What is computer vision? You may of hard of this term and there is a lot of complexity surrounding this field of artificial intelligence. In this blog I hope to provide some clarity and simplify the this subject for you. In a nutshell, its basically giving eyes to the computer, along with the brains to process the image or video. Before we delve deep into it, let’s look at the below examples:
That is one of the greatest footballers smiling!
And this is a very heartbreaking picture of David Beckham sad.
We can clearly differentiate between a sad Beckham and a happy Beckham, but can a computer do the same thing?
Well it can, and may be much better and quicker than us humans!! Let’s use these simplified examples below to further understand how this works for a computer.
Now that’s a simplified version of a smiling face, a bit creepy but still a smile!!
A sad face!! So how can the computer differentiate between the two faces? But first we have to see how we differentiate between the two faces, we need to apply logic. Whenever we (humans) look at an image we instantly see the differences, we are able to subconsciously specify features like smiles within seconds. How this would be done for a computer In the above example is similar to how we would check if it’s a smile, we look for the alignment of the lips.
-> Means a smile
-> Means a sad face
So, when the computer is fed (trained) with many smiling and sad faces, which are labelled, it searches for the feature which differentiates one kind of image from the other. The detailed process of how the computer selects features will be discussed in another post, for the time being let us discuss what happens when the features are detected and applied on an unknown image.
The unique features which are detected:
Now whenever a new image is obtained, the image is scanned for the above features.
The Image is first broken down into boxes:
Now each box is scanned for all the features. The boxes a1 to e4 are scanned to detect if it contains any of the two features.
In the above image in the box b3, b4, d1 the feature is detected while in the box d3, d4, b1 the feature is detected.
So, the results (smiling face) of the feature detection for all the features are given below (0 stands for negative and 1 stands for positive): [a1,b1,c1,d1,e1,a2,b2,c2,d2,e2,a3,b3,c3,d3,e3,a4,b4,c4,d4,e4]
Similarly, for the sad face, the image is initially broken down into number of regions and the features are traversed through the regions to find a match.
So, in this case, the results of the feature detection for the features would be different as shown below:
So, in the end, the computer then compares this result with the result of the feature detection run on the original labelled smiling/sad faces. Accordingly, the ones having the correct/closest match are then classified into a smiling or a sad face.
So, in brief we can change the smiling face to
This is also one of the core concepts used in the very powerful convolutional neural network.
So, this is the process in which the computer can classify images. This was explained using a very basic example, but by having thousands of features we can extend this concept to very complicated images to identify even the minutest of differences.
So, this was it for the post, have a great day and keep => [0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0]