How can you teach a computer to draw like Picasso, Vangogh, Monet or any other artist ? Is it even possible to encode the style of a painting and apply to an existing image ? Gatys et al. introduce a novel concept[^1] that uses Deep neural network to generate artistic images of high perceptual quality.
The paper shows that it is possible to learn the style of a painting (style image) and use it to transform the existing image (content) into an artistic version that appears similar to the painting. In order to make the content image appear artistically similar to the style image, we need a metric that could capture the perceptual style differences between the two.The paper introduces a new approach that uses the deep neural network to extract the content and style features and jointly minimise the white noise image to make its contents similar to content image and style similar to style image. The level of content vs style can be tuned by tuning the amount of content or style to apply.
The paper uses images of painting as reference style image but we can experiment with other style images such as textures, tile maps or even just another photograph.
A few samples generated using neural style transfer (Content + Style = Generated)
We start with inputs, the content, style and white noise image, run it through the feature extraction process and make use of optimizer to generate the output image.
A pretrained convolutional network (vgg19 in the example above) is used to extract the content and style features. For content features, the activation map from a higher layer is used because the higher layers capture objects and their arrangements (the lower layers capture edges and textures). For style features, they use activation maps from multiple layers starting from low to high layers. This helps in generating smoother and visually pleasing images.
The paper uses L-BFGS to minimise the content and style loss by changing white noise image to bring it closer to content and style. We haven’t defined the loss functions yet but we have all the ingredients to define the loss function.
Minimising the content loss will make white noise image look like source image
is content image vector, is white noise image vector and is the layer
is a matrix of feature representation of the content image at layer
is a matrix of feature representation of white noise image at layer
is filter number, is the position of a filter in the matrix
The idea of using correlation matrix of feature maps is key to generate an artistic image as it seems to capture the style representation. The paper calls it gram matrix and it is generated by taking inner product between the vectorised feature map. In order to define style loss, we need following:
- gram matrix to capture style representation of each layer
- contribution of each layer to the style loss
is gram matrix at layer
is the color channel
The following formula shows the contribution of layer to the total style loss:
is the loss at layer
is gram matrix from the source image
is the gram matrix from white noise image
is a number of feature maps at layer (# of feature maps = # of filters)
is height x width of feature maps at layer
The style loss is a weighted sum of loss at each layer
is source image vector
is white noise image vector
is weighting at layer
Minimising the style loss will make the texture of white noise image look like style image
We jointly minimise the content and style loss of white noise image using the following equation. and are the hyperparams that control weighting factors.
The paper presents an amazing technique to generate artistic versions of stock photos. The image generation process is slower than other techniques such as real-time style transfer by Johnson et al but allows us to choose any style photo whereas real-time style transfer is done only on pre-trained style images.