Painting like Van Gogh using Deep Learning

5 min readApr 5, 2021

As a not-so-great artist, I needed another way to create art. Luckily, with a splash of TensorFlow, and a splash of math, I was able to create my own Van Gogh paintings.

Toronto stylized with Starry Night and Lake Louise stylized with The Kiss

How?

Neural Style Transfer

It works by combining two images to create a product image that has the content of one and the style of another.

The technique is shown in the paper A Neural Algorithm of Artistic Style (Gatys et al.).

It works by using a deep neural network to extract a style representation from one image and a content representation of another photo, then fusing the two together to create a new image that matches both the content representation and the style representation. This creates a new image with the original contents, but with new colours, and textures.

How exactly do we extract style and content?

Convolutional Neural Networks

Convolutional neural networks excel with images. They are easily able to classify them and identify their contents. To achieve such high performance, CNNs have a deep understanding of images. Under the hood, CNNs build complex representations of images to identify unique features. Their many layers work together to extract the most important information to correctly identify and detect objects. Surface layers detect general lines and curves, while deeper-level layers use previous findings to locate more explicit general objects like eyes or limbs. This processing hierarchy enables an input image to be transformed into a representation that captures detailed information about it, including all the information we need to know to extract and express an image’s style and content.

Implementation

The full Google Colab

Our CNN, VGG-19

As previously stated, a strong CNN is needed to be able to capture the intricacies of painting and photos. Instead of wasting time training a brand new network, we can harness the power of a pretrained network. VGG-19 is a 19 layer CNN trained on the ImageNet dataset and has a strong comprehension of images. It can detect and extract all the lines, curves, and patterns we need.

Since we’re not training the network, or using it to make predictions, we can set trainable to false, and get rid of the fully connected layers.

We pick intermediate layers from the network to use. As they are from different blocks, we can understand different parts of an image and properly capture content and style.

For style, we use multiple layers to capture the correlations between the different feature maps. These feature correlations allow us to obtain a multi-level representation of the input image to capture texture information and not the specific global arrangement.

Feature Representations

To capture the content features, we simply pass the content image into the model and save the feature maps for the intermediate layers that we selected.

Likewise, for the style representation, we pass the style image through the model and take the feature maps for the intermediate layers that we chose for style.

Loss

To be able to combine two images and both optimize to capture the style of one image and the content of another, we need two separate losses. One to capture the difference in style and another for content.

Calculating content loss is fairly straightforward, we simply take the mean squared error between the features of the original content image and the image that we’re creating. We get the exact differences of specific localized shapes.

For style loss, it gets a bit tricker. As artistic style consists of things like colour and texture, we can’t just find the difference between the feature maps. We need a way to capture more general non-localized information about an image.

Introducing the Gram matrix

The gram matrix allows us to capture a more general picture of an image and not specific localized information. A gram matrix is taken by multiplying a tensor by its transpose.

By multiplying a tensor by its own transpose, the gram matrix effectively spreads and redistributes the original information across itself to remove localized data points.

To calculate style loss, we measure the mean squared error between the gram matrix of the style representation of the reference style image and the gram matrix of the style representation of the image that is being created. As we are using multiple layers, we assign different weights to different layers to emphasize the features of certain layers.

As training progresses, the product image’s content and style loss is reduced and minimized, to create an image with the original contents but with a new style.

Training

Now all together.

In our training loop, we load in our style image once, and the content image twice. One will be used to extract content features, and the other will be our product image. During training, we will perform gradient descent on the product image, to minimize the two losses.

Before beginning training, we extract the content and style features of the original images and calculate the gram matrix for style.

For each training step, we compute gradients, then apply them to the product image. At each step, we save the product image if its loss is the lowest.

In the end, we’ll have our product image with losses minimized, that is a blend of our original content and style images.

Results

We can very easily combine landscapes with famous paintings and formations

Toronto landscape stylized with Pillars of Creation, The Scream, and Starry Night

It also works for portraits

Mona Lisa and American Gothic stylized with The Scream

Conclusion

Neural artistic style transfer can combine style and content from separate images and create a new one. It shows us that style and content can be separated and abstracted.
CNNs can extract abstract representations from images
A gram matrix allows us to capture non-localized information about an image

Before you go, make sure to check out the Google Colab for the full code