In a previous article I introduced Convolutional Neural Networks (CNN) in an intuitive way. In this article I will present few of the modern architectures of the CNN.
Convolutional networks have been showing exceptional results in image classification more than a decade ago. Here we will review some of standard CNN architectures and the context in which they were presented before discussing more recent developments.
I’m not trying to be exhaustive in this post, as there are a lot of architectures and variations that made an impact. I will only introduce the most known ones.
This is one of the very first implementations of CNN, it showed impressive results in handwritten digit recognition problem. LeNet-5 was introduced by LeCun et all in 1998. The following figure shows the architecture as it appeared in the original article.
LeCun demonstrated the concept of convolutions, increasing the filter/channel size, and having fully connected layers with a cost function to propagate the errors, which are now the backbone of all CNN frameworks.
What stands out
- LeNet-R used tanh activation function rather than reLU.
- Used Euclidean radial basis rather than softmax
- Sigmoid was applied after the pooling layer
- 60K parameters
- Was not using padding.
AlexNet, Krizhevsky et al, had a large impact on the field of machine learning, specifically in the application of deep learning to machine vision. It famously won the 2012 ImageNet LSVRC-2012 competition by a large margin (15.3% VS 26.2% (second place) error rates). The network had a very similar architecture as LeNet-5 but with more layers and filters resulting in a larger network and more parameters to learn. This work showed that features can be learned rather than be generated manually using feature engineering.
What stands out
- Relu activation function is used instead of Tanh to add non-linearity. It showed 6 times more performance with same accuracy.
- Use dropout instead of regularisation to deal with overfitting.
- Overlap pooling to reduce the size of the network.
- The number of parameters was approximately 62 millions;
- Used 2 GPU to speed up the computations
VGG is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford . The model achieves 92.7% accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes. It is known for its uniform design and has been very successful in many domains. The uniformity in having all the convolutions being 3 × 3 with stride s = 1 and max pooling with 2 × 2 along with the channels increasing from 64 to 512 in multiples of 2 makes it very appealing and easy to set up.
VGG-16 has two fully connected layers at the end with a softmax layer for classification
What stands out
- Uniform design
- Reduced number of computations
- 140 million parameters. Even with a GPU setup, a long time would be required for training the model
Modern CNN architectures
In the NLP context, the basic CNN mapping to sentences with a convolution filter of size k is shown to be analogous to the ngram token detector in classic NLP settings. The idea of a hierarchical CNN is to extend the principle by adding more layers. In doing so, the receptive field size increases, and a larger window of words or contexts will be captured as features, as shown in following figure.
If we generalise the stride of the stacked CNN to size s, we get a Dilated CNN. In this regards a Staked CNN becomes a special case of the Delated CNN
A Dilated CNN increases the receptive field exponentially with respect to the number of layers. Another approach is to keep the stride size constant, as in s = 1, but perform length shortening at each layer using local pooling by using maximum or average as values
The following figure shows how by progressively increasing the dilations in the layers, the receptive fields can be exponentially increased to cover a larger field in every layer.
- A dilated CNN helps capturing the structure of sentences over longer spans of text and is effective in capturing the context.
- A dilated CNN can have fewer parameters and so increase the speed of training while capturing more context.
Inception networks by Szegedy et al. are currently one of the best- performing CNNs, especially in computer vision
The “Inception” micro-architecture was first introduced by Szegedy et al. in their 2014 paper, Going Deeper with Convolutions. in the following figure the inception module as used in incarnation of this architecture, GoogLeNet.
In this module the various convolutions results are concatenated at the output layer and form the input of the next layer.
The goal of the inception module is to act as a “multi-level feature extractor” by computing 1×1, 3×3, and 5×5 convolutions within the same module of the network — the output of these filters are then stacked along the channel dimension and before being fed into the next layer in the network.
This repetitive block, also called Inception block, lies at the heart of the inception network. The central idea behind using 1 × 1 filters is to reduce the volume and hence the computation before feeding it to a larger filter such as 3 × 3.
The above figure shows an example of 28×28×192 output from a previous layer convolved with a 3×3×128 filter to give an output of 28×28×128; i.e., 128 filters at the output result in about 174 million operation of multiplication and accumulation (MAC). The 1 × 1 × 96 filters reduces the volume and then convolve with 3 × 3 × 128 filters, the total computations reduce to approximately 100 million MACs, a saving of 60% approximately.
- The 1 × 1 convolution block plays an important role in reducing the volume for a larger filter size convolution.
- The inception block allows multiple filters to be learned, and remove the need to select one of the filters.