The task of object recognition is a fantastic example of the success of AI and deep learning. For a human being, object recognition is a completely trivial task. When we are shown a picture of a common object, we can easily name it. However, writing a computer program to perform the same task to human-level standards was traditionally seen as an impossible task.
This all changed in the 2010s when amazing breakthroughs were made in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) – a competition where computer programs must classify images into a choice of 1000 categories. The breakthrough was made using an algorithm called a Convolutional Neural Network (CNN). In 2011, a good ILSVRC classification error rate was 25%. In 2012, a CNN called AlexNet achieved an error rate of 16%1. In the next couple of years, error rates fell to a few percent and then in 2015, researchers at Microsoft reported that their CNNs exceeded human ability at ILSVRC2!
Today, CNNs are being applied in many different industries. In healthcare, CNNs are used to classify images of skin lesions as benign lesions or malignant skin cancers3. In the retail industry, Amazon’s walk-in stores use CNNs to detect when an item is removed from a shelf. The item is then billed to the shopper automatically when they exit the store, without the shopper having to go to a checkout4. In the transportation industry, CNNs are used by driverless cars to interpret the video input of the road5.
As you can see from the above examples, CNNs are having a revolutionary effect on the world. In the rest of this article, we will explore some of the details of how they work
As mentioned previously, object recognition is trivial for human beings. For that reason, researchers turned to biology for inspiration when they were designing computer vision algorithms.
In 1962, Hubel and Wiesel performed an experiment where they inserted electrodes into specific parts of the visual cortex of a cat and measured the activation of neurons when the cat saw some basic shapes6. They identified two types of neuron cells: simple cells whose output is maximized by straight edges having particular orientations within their receptive field, and complex cells which have larger receptive fields and combine the outputs of the simple cells. They also discovered that neighbouring cells have similar and overlapping receptive fields.
The discoveries of Hubel and Wiesel directly inspired today’s CNNs which are made up of a hierarchy of layers of neurons, where each layer uses the same receptive field. The first layers will detect low level features such as oriented line segments, deeper layers will detect higher level features such as corners or closed loops, and the deepest layers might detect high level features such as an eye, nose or mouth. The point to remember here is that CNNs would not have been possible without initial findings from the world of biology.
Now that we know about the recent success of CNNs and their biological inspiration, let’s try to understand some of the technical details. First, we must understand how images are passed to a computer program. Fortunately, this is quite easy to understand: if we pass a picture of dimension m by n pixels to a computer program, the computer sees a 3-dimensional array of dimensions m, n and 3. The 3 refers to red, green and blue values and each entry in the array will be a number between 0 and 255, representing the intensity of the three colours at each pixel.
The first layer in a CNN is a Convolutional Layer. The layer is made up of several slices. Each neuron in each slice has a receptive field, say of size 5 by 5. This means that it is connected to a 5 by 5 square of pixels in the original image. We require that the receptive field take this 5 by 5 square (really a 5 by 5 by 3 cube since we have three colours) and output a single number, representing the activation level of that neuron. We do this by choosing a 5 by 5 by 3 cube of numbers and taking the dot product of this cube with the 5 by 5 square of pixels in the image. The first neuron will operate on the top left square of pixels. We then slide the receptive field around the whole image one pixel at a time, sending the output to a new neuron. It is important to note that each neuron in the slice uses the same set of weights, and that their receptive fields overlap – just as Hubel and Wiesel discovered that neighbouring cells in the visual cortex have similar and overlapping receptive fields. On the other hand, each slice in the layer will have a different set of weights so that each slice is recognising a different low-level feature.
We then stack Convolutional Layers together, passing the output of one layer as the input of the next (in fact, we also have other types of layers such as Pooling Layers, which we won’t go in to here). Thus, our CNN has an intrinsically hierarchical nature, just as Hubel and Wiesel observed in the visual cortex. Once the image data has passed through the CNN, we end up with an array of numbers, representing the high-level features in the image.
Once we have this final array, we can pass it to a fully-connected layer which will output the probabilities that the image belongs to each category. We then train the CNN on our training data using back-propagation, just like a feed-forward neural network.
We have seen that CNNs, inspired by biology, have facilitated amazing improvements in computer vision in recent years, leading to some applications which are in the process of revolutionising various industries. We have also given a brief technical overview which will hopefully inspire some of you to dive deep into the details.