The One-Stop Guide to Convolutional Neural Networks

7 min readOct 25, 2020

Convolutional Neural Networks are the backbone of computer vision. More broadly, CNNs allow a system to understand images. They are what power technologies like Tesla’s autopilot.

What a Tesla autopilot sees from nerdist

CNNs are also used to power facial recognition, object detection, and even natural language processing applications.

So how do they work?

Convolutional layers excel by digesting large amounts of data, then reducing it into an easily understood form.

let’s understand how that’s done

Convolution Layer

Convolution layers are used to extract features from the input image. Features can be anything from lines and diagonals to limbs or wings. To extract them, the convolutional layer zooms in on the image and slowly looks at small sections of the image. It applies a filter to each small section. To apply the filter, the convolutional layer takes the dot product of the filter and the small piece of the image. This process is also called convolving (to convolve).

The filter is made up of one or more kernels. A kernel is simply a 2d array of values. When multiplied with an image, they can pick up features like edges and lines.

Kernels are an extremely powerful tool and can do things like blur images. In fact, kernels are what power many photoshop filters.

Simple filters may use only one kernel but for more complicated filters, multiple kernels are used. A convolutional layer may also use multiple filters.

The output of multiplying the filter and the image is called a feature map.

Padding

As shown in the first Figure(left), the convolved result is smaller than the original input image. To counteract this, padding can be used. The dimensions of the original image are increased by 2 on each axis to add a border around the image with values of 0. This preserves the original image’s dimensions and helps the model keep information at the borders.

(left) stride of 1 | (right) stride of 2 | from sci2lab

Strides refer to the movement of the sliding window. A stride of (1, 1) means that the sliding window will move along the height and width one row/column at a time. A stride of (2, 2) means that the sliding window will move along the height and width two columns/rows at a time. Increasing the stride will decrease the size of the feature map.

(left) feature map from block 1 | (middle) feature map from block 4 | (right) feature map from block 5 from machinelearningmastery

The purpose of convolutional layers, as mentioned previously is to extract features or details from an image. A complete CNN will have many convolutional layers. As a general trend, deeper layers will extract specific shapes for example eyes from an image, while shallower layers extract more general shapes like lines and curves.

Pooling Layer

Pooling layers are used to reduce the size of the convolved map. They summarize the features in a feature map and decrease the number of parameters needed to train the model. Thus, the computing power needed to train the model is lowered.

Similar to the convolutional layers, pooling layers operate with a sliding window. Unlike convolutional layers, the pooling layers do not take a product.

There are quite a few different types of pooling.

Max Pooling
Average Pooling

Max pooling is the most common type of pooling used. It returns the maximum value from the patch of the feature map in the sliding window.

Average pooling returns the average of all the values.

Other types of pooling layers include sum pooling and min pooling. They do as their name suggests.

The different pooling layers have their own advantages and disadvantages. Average pooling takes the average values and thus will smooth out the image and lose sharp features.

On the other hand, max-pooling will mainly be interested in the stronger, sharper, and brighter features of the image.

More information on the differences between pooling layers is in this article.

Non-Linearity

Like most deep neural networks, a CNN must have non-linearity to properly operate on its non-linear dataset. In most CNNs either the ReLU or leaky ReLU function is used as the non-linear function.

ReLU or Rectified Linear Unit is defined by f(x)=max(0, x). The ReLU function returns x if the input if positive and 0 if negative.

Leaky ReLU is similar. It is defined by

It outputs 0.01x for values equal to and below 0, and x for values above 0

Leaky ReLU lets little negative values leak by. This can result in better performance in certain cases. It can fix the dying ReLU problem in which a neuron will always receive a negative value and always output 0, rendering it useless.

Most CNNs use either ReLU or Leaky ReLU instead of other functions like the sigmoid functions, because of their simplicity and therefore speed.

Dropout Layers

Dropout is a regularization technique used to avoid overfitting. Overfitting is when a network has memorized the training examples and is unable to generalize on new inputs and images. Dropout layers work by randomly ignoring certain neurons during training. By ignoring certain neurons, the subsequent neurons are forced to not rely on specific neurons. This prevents the network from learning redundant details from the training set and thus prevents overfitting.

Fully Connected Layer

Finally, there is the fully connected layer. This is where the classification happens.

The feature maps are first flattened into a single vector. This is normally done by simply resizing the matrices into a 1d vector. This is then fed into a fully connected neural network with the number of outputs corresponds to the number of classes. An activation function like sigmoid or softmax is applied to the output.

Though fully connected layers are the most popular method to classify images, some networks use global average pooling instead.

Global average pooling(GAP) is essentially an average pooling layer as mentioned previously, but with the poolsize set to the size of the feature map. This means that the output is the average of all the values in the feature map.

To replace output nodes in the fully connected layers, this method generates one feature map for each class in the dataset. GAP is applied to the map and the output is fed into an activation function. Compared to the traditional fully connected layers, GAP reinforces the relationship between the feature maps and categories and follows the convolutional structure more closely. Additionally, by removing a fully connected network, overfitting is avoided at this step of the CNN.

Another approach for the classification is a mixture between GAP and fully connected layers. In this method, the GAP simply replaces the flatten layer in the traditional method. The output of the pooling layer is fed into a dense layer. This method is used in Resnet-50.