The above is a frame from a video generated by the Eye Tracking hardware built by Jeff Pelz and his group at the Center For Imaging Science, at the Rochester Institute of Technology. The crosshairs in the main image show where the subject was looking. The coordinates for this crosshair are labelled "H" and "V" in the overlay text in the lower left.
In the lower right, you can see the timecode which shows how long into
the observation we are. The format for this
timecode is: HOURS:MINUTES:SECONDS:FIELDS
The image that follows are a few examples of this from around 3 minutes, 8 seconds into the tape.
Most OCR applications have to take into account many variables. These variables determine the angle that the text is on the page, discrepancies between different instances of a certain character, and so on. Typically OCR software runs with some sort of neural network, so that it can attempt to figure out which character is which. The inputs to this neural network are usually various pixels on a matrix where the character is in the captured image. The outputs are each letter it can possibly be. With this kind of traditional system, the output with the highest value determines which character it is. This process can be slow, and the learning process to build up the neural network is time consuming.
The process I have decided on borrows concepts from this, but eliminates a lot of the gruntwork and learning, since there are fewer variables to get in the way. It also eliminates the complete neural network weighting system, leaving just a few if-statements to determine which numerical digit is on the screen.
For my application, there are only 10 different characters I need to work with. These characters are always in the same place on the screen, are never at an angle, and are always white on a dark gray background. Since these factors basically eliminate most of the reason to use a neural network, I have come up with an alternate, simpler method.
I took the above captured timecode imagery, and separated out just the individual digits. I zoomed them up huge and added some guides to make it easy to determine one pixel from another. We're only looking at the green channel in the above, rather than determining the luminance, which is more computationally intensive.
Each digit has distinct features in certain areas of the grid. By looking at just a few pixels in key areas of the image, you can easily determine one character from another.
From this imagery, we will pick out a few key pixels in each character area. This is shown as blue outlines on the pixels in the above image.
We need to threshhold the image. Programmatically, this is done with a simple:
if( pixelvalue > 128 ) pixel = ON; else pixel = OFF;
For the above example, we have thirteen boxes selected. I picked these by just looking at various sections of the image, and guessing as to which pixels seem like they can be used to differentiate characters. At the time I made this, I didn't really think that I could do it with fewer than the selected pixels... it wasn't really a concern. I figure that I'll go over it in the next step and determine which key pixels I need.
Next, I just created a simple binary tree on a sheet of paper. Each
node is a query to see if that pixel is above the threshhold. If it is,
then we follow the right fork, otherwise we follow the left fork. In
the final code, this logic is implemented with a nested
First, I'll show a non-optimized tree, going down the key pixel list.
For this, I simply looked at pixel location 1, and determined in which characters this pixel was light, and which it was dark. As you can see in the above graph, it's a good choice to discern one set of characters from another set of characters.
However, you can also see quite a few dead ends in the above tree. These comparisons are basically useless and unnecessary, and have been removed from the following optimized tree.
Now that I've figured this out, I'm going to pick my pixels better. For example, the pixel '5' comparison above to discern between 7 and 3 seems like it might be a borderline case. If pixel '11' was used instead, it would most definitely discern between those two characters better.
So there you have it. By looking at the brightness of the green channel in at most 4 pixel positions, we can determine which digit from '0' to '9' is at a specific location in a frame of video.
I just realized that this can be accomplished even easier, since the characters on the screen are non-varying.
If there were N strategically placed key pixels in each character, we can store a "fingerprint" of each character as a standard 'int', where each pixel value (above/below 128 -- on or off) is stored as a bit in this int value.. Then, all that would need to be done is just get the fingerprint of the character on the screen that we're testing, and compare it with the list of known fingerprints to determine which character it is on the screen.