Neural Style Transfer

4 min read (Updated: March 3, 2018)

How can you teach a computer to draw like Picasso, Vangogh, Monet or any other artist? Is it even possible to encode the style of a painting and apply to an existing image? Gatys et al. introduce a novel concept that uses Deep neural network to generate artistic images of high perceptual quality.

The paper shows that it is possible to learn the style of a painting (style image) and use it to transform the existing image (content) into an artistic version that appears similar to the painting. In order to make the content image appear artistically similar to the style image, we need a metric that could capture the perceptual style differences between the two.

The paper introduces a new approach that uses the deep neural network to extract the content and style features and jointly minimise the white noise image to make its contents similar to content image and style similar to style image. The level of content vs style can be tuned by tuning the amount of content or style to apply.

The paper uses images of painting as reference style image but we can experiment with other style images such as textures, tile maps or even just another photograph.

Neural Style Transfer Flow

Proposed network architecture

Generated Samples

A few samples generated using neural style transfer (Content + Style = Generated)

Example 1: Sunset with City Art

Content

Sunset content

Style

City art style

Generated

Generated sunset

Example 2: Road with Abstract Paint

Road content
Abstract paint style
Generated road

Example 3: Tree Scene

Tree content
Tree style
Generated tree

Example 4: Waterfall

Waterfall content
Waterfall style
Generated waterfall

Example 5: Silo Park

Silo park content
Udnie style
Generated silo park

Implementation details

We start with inputs, the content, style and white noise image, run it through the feature extraction process and make use of optimizer to generate the output image.

Feature extraction

A pretrained convolutional network (vgg19 in the example above) is used to extract the content and style features. For content features, the activation map from a higher layer is used because the higher layers capture objects and their arrangements (the lower layers capture edges and textures). For style features, they use activation maps from multiple layers starting from low to high layers. This helps in generating smoother and visually pleasing images.

Optimization

The paper uses L-BFGS to minimise the content and style loss by changing white noise image to bring it closer to content and style. We haven’t defined the loss functions yet but we have all the ingredients to define the loss function.

Content Loss

Minimising the content loss will make white noise image look like source image:

Lcontent(p,x,l)=12i,j(FijlPijl)2L_{content}(\vec{p},\vec{x},l)=\frac 1 2 \sum_{i,j}(F_{ij}^l - P_{ij}^l)^2

Where:

  • p\vec{p} is content image vector
  • x\vec{x} is white noise image vector
  • ll is the layer
  • FijlF_{ij}^l is a matrix of feature representation of the content image at layer ll
  • PijlP_{ij}^l is a matrix of feature representation of white noise image at layer ll
  • ii is filter number
  • jj is the position of a filter in the matrix

Style Loss

The idea of using correlation matrix of feature maps is key to generate an artistic image as it seems to capture the style representation. The paper calls it gram matrix and it is generated by taking inner product between the vectorised feature map. In order to define style loss, we need following:

  • gram matrix to capture style representation of each layer
  • contribution of each layer to the style loss

Gram matrix:

Gijl=kFiklFjklG^l_{ij}=\sum_k F^l_{ik} F^l_{jk}

Where:

  • GijlG_{ij}^l is gram matrix at layer ll
  • kk is the color channel

Layer Loss:

The following formula shows the contribution of layer ll to the total style loss:

El=14Nl2Ml2i,j(GijlAijl)2E_l = \frac{1}{4N^2_lM^2_l} \sum_{i,j}(G^l_{ij} - A^l_{ij})^2

Where:

  • ElE_l is the loss at layer ll
  • GijlG_{ij}^l is gram matrix from the source image
  • AijlA_{ij}^l is the gram matrix from white noise image
  • NlN_l is a number of feature maps at layer ll (# of feature maps = # of filters)
  • MlM_l is height x width of feature maps at layer ll

Style Loss:

The style loss is a weighted sum of loss at each layer:

Lstyle(a,x)=l=0LwlElL_{style}(\vec{a}, \vec{x}) = \sum^L_{l=0}w_lE_l

Where:

  • a\vec{a} is source image vector
  • x\vec{x} is white noise image vector
  • wlw_l is weighting at layer ll

Minimising the style loss will make the texture of white noise image look like style image.

Total Loss

We jointly minimise the content and style loss of white noise image using the following equation. α\alpha and β\beta are the hyperparams that control weighting factors.

Ltotal(p,a,x)=αLcontent(p,x)+βLstyle(a,x)L_{total}(\vec{p},\vec{a}, \vec{x})=\alpha L_{content}(\vec{p}, \vec{x}) + \beta L_{style}(\vec{a}, \vec{x})

Conclusion

The paper presents an amazing technique to generate artistic versions of stock photos. The image generation process is slower than other techniques such as real-time style transfer by Johnson et al but allows us to choose any style photo whereas real-time style transfer is done only on pre-trained style images.


References:

  1. A Neural Algorithm of Artistic Style arxiv
  2. Justin Johnson’s github repo contains further enhancements such as multiple style images, style interpolation
  3. Keras example