One reason artificial intelligence and neural nets became so popular is probably professor Fei Fei Li’s CS231n visual recognition open course at Stanford. Along with many other influencers such as Andrew Ngai, they attracted lots of attention to deep learning. Whether it is detecting a cancer tissue in medical images, adding cute animal ears on top of selfies, or driving a car onto the street, Computer Vision (CV) has painted us a bright future where machines can make our lives more convenient in many ways.
In this article, let us first learn about some histories related to CV. Then we will go into some rudimentary techniques in CV such as line and edge detection. After we built up the basics and the intuitions, we will go into the details on how modern CV models work. After a brief introduction of deep learning, we will go into more detail in the fundamentals of convolutional neural networks. In the end, we will go over some of the state-of-the-art algorithms nowadays.
Computer Vision, a Brief History
“If we want machines to think, we need to teach them to see.”
— Fei Fei Li
Diverted from artificial intelligence, research in the field of CV began around the 1960s. One of the earliest relevant papers may be Receptive Fields of Single Neurons in the Cat’s Striate Cortex by Hubel and Wiesel in 1959, where they inserted electrode into a cat’s brain and observed response after changing what the cat sees.
The 2 published another paper in 1962, presented their word on a more detailed investigation on how the cat’s brain processes visual information. The early research in a neuroscience-related approach was rather abject. I have mentioned in one of my previous articles, where the neural-network-based approach went through a dark time until the 2000s.
There were other advanced in the field. For example, Hough’s transform, named after a patent in 1962 after Paul Hough’s patent Hough’s transform is now widely used in fields such as autonomous driving. Edge detection proposed by John Canny in 1986 is used quite a lot in edge detection. Edge detection is applied in many different fields such as face recognition.
When I was studying at Queens University in Canada (where Elon Musk also went to), Professor Greenspan also demonstrated on their progress in object permanence for machines. The Canadians contributed a lot to the field of AI research, including the well known CIFAR.
Another notable mention before we go into deep learning would be Lenna Forsén. If you are into any field related to digital image processing, chances are that you have seen her picture somewhere. Lawrence Roberts was the first man who used her photo from a Playboy magazine in 1960 in his master’s thesis.
The picture then somehow became a standard test image in the field of image processing. Playboy magazine planned to file a lawsuit for portrait rights but kindly gave it away when they realize it was for research and education.
Now let us get back to neural networks. Inspired by Hubel and Wiesel’s work, Kunihiko Fukushima invented the “Neocognitron” in 1980. Neocognitron was arguably the earliest version of a convolutional neural network. The network was used to recognize hand-written characters.
Yann LeCun, famous for his work on the convolutional neural networks, applied back-propagation to the convolutional neural networks in 1989. He then published LeNet-5 in 1998 with gradient-based learning algorithms.
The turning point of neural networks happened in 2012. Alex Krizhevsky, along with his AlexNet, won the ImageNet competition on September 30, demonstrated the superior performance of approaches based on convolutional neural networks.
The leading scientist and principal investigator of ImageNet, professor Fei Fei Li, began working on the idea of ImageNet since 2006. She continued her travel in conferences and talks around the world, influencing others into the field of computer vision.
Another notable mention would probably be Joseph Redmon. Redmon invented the YOLO nets in 2016. The word YOLO probably came from a popular internet slang “You Only Live Once”, suggesting people to live their life to their full extent and generally used when young people are about to perform a reckless move. In the paper, YOLO stands for “You only look once”, which offers fast object detection based on neural networks.
Redmon also thinks himself as a pony inside his resume, referring to “My Little Pony”, which is an American franchise targeting little girls.
Advancements in computer vision are making our world better in many different ways. It is time for us to get into how those algorithms work!
Computer Vision: The Basics, and Before Neural Networks
Since the time of television, monitor screens display an image by adjusting the brightness of Cathode Ray Tubes (CRT) with 3 different colors — Red (R), Green (G) and Blue (B). Despite our screen nowadays probably uses more advanced hardware such as Liquid Crystal Display (LCD) or Organic Light-Emitting Diodes (OLED), the images are usually stored and transmitted digitally by some format of an RGB table. The Bitmaps, for example, store the images in arrays of hexadecimal values ranging from 0x000000 to 0xFFFFFF, which can represent 3 numbers ranging from 0 to 255.
Some image formats may store the image differently due to compression format or other conventions, but they can usually be converted back into arrays of RGB values. In the context of mathematical intuition, the differences in RGB values do not represent their actual difference in terms of human perception. The Commission InternationaledeL’éclairage (CIE) came up with the ΔE metric, which represents the color difference more accurately in terms of human perception.
How computer stores and processes graphical information is another immense field, which we can get deeper into some other time in another article.
Edge Detection and Feature Extraction
“A visual image is built by our brain’s ability to package groups of pixels together in the form of edges.”
— Jill Bolte Taylor
It would be difficult to analyze an image upright, as the pixel arrays can be complex and noisy. This is why researchers usually extract features such as edge and lines. These features can be represented by much simpler numerical values or relationships. There are many ways to detect edges such as taking a derivative or Canny’s method. In this article, we will briefly go over Sobel’s method.
Edges are essentially high difference in the pixels, thus most edge detection method tries to extract the regions where the difference between the pixels are detected. Sobel’s method detects the edges by conducting convolutions (indicated by the operator *) on 3×3 regions of the image with 2 special kernels. The kernels will yield the horizontal and vertical components of the gradient which can be used to calculate a vector representing the direction and strength of edges.
More details on the math can be found on Wikipedia, and are explained in the video below. The author also made videos on other algorithms for image processing such as Canny edge detector and blur filters.
Finding the Edges (Sobel Operator), Video From Youtube
Hough Transform and Autonomous Driving
“In Soviet Russia, car drives you!”
— Russian Reversal
Imagine we are engineers working for an automobile company that wishes to invent its self-driving cars. At some point, we will have to teach the car to drive inside the lane on the road. Otherwise, our car will drive like that Asian women driver in Family Guy.
If the traffic lines are continuous, it would be easy. We can make sure that the car moves back a little bit when it gets too close to either side of the line. But we know that in most places, California for example, there are dashed lines on the street. If the car only detects its proximity to the lines, it would probably go wild at those gaps between the solid lines.
Thanks to Hough transform, we will be able to reconstruct the straight line from the dashed ones. Hough transform converts a line inside the x-y space into a point inside m-c space. A point inside the x-y space, however, will be converted to a line inside the m-c space. The m-c space also has the property that, all the lines intersects to the same point, will correspond to the points that are on the same line inside the x-y space. For more about line detection by Hough Transform, here is a paper.
Advancing to Deep Learning: Convolutional Neural Networks & Residual Neural Networks
“No one tells a child how to see, especially in the early years. They learn this through real-world experiences and examples.”
— Fei Fei Li
The phrase “Machine Learning” first came up from Arthur Samuel in 1952. However, research in neural networks was heavily discouraged until the 2000s. The earliest version of an artificial neuron was invented by neurophysiologist Warren McCulloch and mathematician Walter Pitts in 1943. It was meant to describe how neurons work inside our brains.
Convolutional Neural Networks (CNN) became a star after the 2012 ImageNet contest. Since then, we started to see neural networks related models in over 80% of the newly published paper in any field that can relate to it. There were many other machine learning algorithms such as k-NN and Random Forest but were outmatched by the performance of CNN in terms of image processing.
Deep Convolutional Neural Networks
Whenever convolutional neural networks were mentioned, the first who came to mind of a data scientist is probably Yann LeCun, who published his paper of LeNet-5 in 1998.
Typically, CNN is constructed by 4 types of layers. I have mentioned 3 in my Alpha Go article because those are the 3 more commonly seen ones. Researchers may also add little twists to their paper and invent new types of layers. The 4 layers are convolutional, pooling layers, Rectified Linear Unit (ReLU) layers, and Fully Connected layers.
1. Convolutional Layer
Convolutional Layer usually appears as the first layer of a Convolutional Neural Network. These types of layers will scan through the source layer with a filter, and put the sum into the destination layer. Some filters are good at detecting edges and some are good in other tasks. The process of convolution can extract spatial information inside the 2D image and pass it to the next layer. More details on different kinds of convolutional filters and their applications in Computer Vision can be found here.
2. Pooling Layer
Pooling Layer iterate through the source layer and selects a specific value inside the bounded region. The value is typically maximum, minimum, average within the region. Reducing information into a smaller size is also called “downsampling” or “subsampling”.
When computational resources were limited, networks preprocessed by convolutional and pooling layers were much more resource-friendly compared to feeding the pixels directly into nodes of a Multi-layer Perceptron. In special cases, “upsampling” techniques can also be used to generate more data.
ReLU layer feeds the values into a ReLU function, which just gets rid of the negative values. ReLU is a popular activation function used in neural networks because it can reduce the likelihood of the Vanishing Gradient problem.
Fully Connected Layer
Fully Connected Layer is essentially a Multi-layer Perceptron, which is sometimes called “softmax” and essentially does something called “weighted sum”. I have explained more about Multi-layer perceptron including feedforward and backpropagation in my Alpha Go article.
Deep Residual Neural Networks
In 2015, Kaiming He and his team at Microsoft Research published their paper Deep Residual Learning for Image Recognition. The paper applied residual learning to convolutional neural networks and achieved better performance than all other State-of-the-Art (SOTA) models at that time.
One key concept of residual learning is the use of “shortcut connections”. This involves adding a connection between the layers of the network, so that information can be passed by skipping layers. Such a structure is also found in the analysis of the human brain. The network with skipped weights is also known as Highway Networks.
The network also achieved better performance on the CIFAR-10 datasets.
Computer Vision Today: Semantic Segmentation, Object Detection, and Image Classification
“Big brother is watching you!”
— 1984, George Orwell
Along with the advancements in computer vision, artificial intelligence nowadays can do many crazy things. Out of concerns for privacy invasion, the European Union is even considering banning facial recognition. While other countries such as China, already embedded facial recognition into their social credit system.
Everything has a trade-off. With the advances of electronic payments and cameras on the street, I can assure you that there are way fewer pickpockets happening on the streets in Shanghai than it used to be. There are also much fewer crime activities in Shanghai compare to Los Angeles.
When we talk about computer vision, what came to our mind might be labeled images in bounding boxes. The objects are first outlined by the regions through a Region Proposal process, and then what is inside the region will be detected. There are 3 popular fields of research: Semantic Segmentation, Object Detection, and Image Classification.
Semantic Segmentation: CASENet and Mask R-CNN
Semantic Segmentation is highly related to the detection of boundaries and edges. CASENet, for example, outlines the boundaries of the objects.
Object Detection: YOLOv1, YOLOv2, and YOLOv3
“You only look once!”
— Joseph Redmon
Published by Joseph Redmon in 2016, YOLOv1 already demonstrated much faster speed compared to other models at that time in object detection. The author incrementally improved the model into YOLOv2 and YOLOv3 with even better performance.
Image Classification: ImageNet State-of-the-Art and Epic Failures of CNN
ImageNet invites researchers around the world to compete on their data set, the state-of-the-art models are still rapidly iterating. Maybe you will define the next SOTA?
There are other interesting facts on CNN. A group of researchers published a paper on ICLR 2019, showing that CNN classifies images based on textures rather than shapes as a human would.
The neural networks are also vulnerable to high-frequency disturbances. The recent paper on DiffTaichi showed that VGG-16 would classify a squirrel into a goldfish when ripples are added into the picture.
Ian Goodfellow had studied the adversarial attacks against neural networks a long time ago, and we will go into him more when we learn about generative models.
Words in the end…
“AI is everywhere. It’s not that big, scary thing in the future. AI is here with us.”
— Fei Fei Li
I have finally finished this article as planned. There are way too many to write about in the field and you can never fit them all inside a 10-minute article. I hope this article has raised your interest in some of them. I might write about gesture recognition and social behavior prediction in my future articles.
CV and NLP are the two hot topics in artificial intelligence. Cloud AI platforms nowadays are putting most of their efforts into those two because of the potential applications can be made out of them. I have also written an article on NLP. I plan to write about automated machine learning, generative models and cloud platforms later on. Follow me to learn more.
Also, I had another article on front-end frameworks for computer vision. These frameworks are easy to try for yourself, maybe you can make some cool applications out from them tonight!