Heruistic-Based OCR
Scott Lawrence
September 2005
0. Table Of Contents
enable javascript to see the table of contents
1. Overview
Screenshot from the eye tracker
Figure 1: Screenshot from the eye tracker

The above is a frame from a video generated by the Eye Tracking hardware built by Jeff Pelz and his group at the Center For Imaging Science, at the Rochester Institute of Technology. The crosshairs in the main image show where the subject was looking. The coordinates for this crosshair are labelled "H" and "V" in the overlay text in the lower left.

In the lower right, you can see the timecode which shows how long into the observation we are. The format for this timecode is: HOURS:MINUTES:SECONDS:FIELDS
The image that follows are a few examples of this from around 3 minutes, 8 seconds into the tape.

Sample of the digits on the screen
Figure 2: Sample of the digits on the screen

Most OCR applications have to take into account many variables. These variables determine the angle that the text is on the page, discrepancies between different instances of a certain character, and so on. Typically OCR software runs with some sort of neural network, so that it can attempt to figure out which character is which. The inputs to this neural network are usually various pixels on a matrix where the character is in the captured image. The outputs are each letter it can possibly be. With this kind of traditional system, the output with the highest value determines which character it is. This process can be slow, and the learning process to build up the neural network is time consuming.

The process I have decided on borrows concepts from this, but eliminates a lot of the gruntwork and learning, since there are fewer variables to get in the way. It also eliminates the complete neural network weighting system, leaving just a few if-statements to determine which numerical digit is on the screen.

2. My Algorithm

For my application, there are only 10 different characters I need to work with. These characters are always in the same place on the screen, are never at an angle, and are always white on a dark gray background. Since these factors basically eliminate most of the reason to use a neural network, I have come up with an alternate, simpler method.

Enlarged pixels of the digits
Figure 3: Enlarged pixels of the digits

I took the above captured timecode imagery, and separated out just the individual digits. I zoomed them up huge and added some guides to make it easy to determine one pixel from another. We're only looking at the green channel in the above, rather than determining the luminance, which is more computationally intensive.

Each digit has distinct features in certain areas of the grid. By looking at just a few pixels in key areas of the image, you can easily determine one character from another.

Key sample points
Figure 4: Key sample points

From this imagery, we will pick out a few key pixels in each character area. This is shown as blue outlines on the pixels in the above image.

Increased contrast
Figure 5: Increased contrast

We need to threshhold the image. Programmatically, this is done with a simple:

    if( pixelvalue > 128 )
        pixel = ON;
        pixel = OFF;
Figure 6: Threshhold function
Ordering of key pixels
Figure 7: Ordering of key pixels

For the above example, we have thirteen boxes selected. I picked these by just looking at various sections of the image, and guessing as to which pixels seem like they can be used to differentiate characters. At the time I made this, I didn't really think that I could do it with fewer than the selected pixels... it wasn't really a concern. I figure that I'll go over it in the next step and determine which key pixels I need.

Next, I just created a simple binary tree on a sheet of paper. Each node is a query to see if that pixel is above the threshhold. If it is, then we follow the right fork, otherwise we follow the left fork. In the final code, this logic is implemented with a nested if statement block.

First, I'll show a non-optimized tree, going down the key pixel list.

Full tree
Figure 8: Full tree

For this, I simply looked at pixel location 1, and determined in which characters this pixel was light, and which it was dark. As you can see in the above graph, it's a good choice to discern one set of characters from another set of characters.

However, you can also see quite a few dead ends in the above tree. These comparisons are basically useless and unnecessary, and have been removed from the following optimized tree.

Optimized tree
Figure 9: Optimized tree

Now that I've figured this out, I'm going to pick my pixels better. For example, the pixel '5' comparison above to discern between 7 and 3 seems like it might be a borderline case. If pixel '11' was used instead, it would most definitely discern between those two characters better.

So there you have it. By looking at the brightness of the green channel in at most 4 pixel positions, we can determine which digit from '0' to '9' is at a specific location in a frame of video.

3. 2006 June Addendum

I just realized that this can be accomplished even easier, since the characters on the screen are non-varying.

If there were N strategically placed key pixels in each character, we can store a "fingerprint" of each character as a standard 'int', where each pixel value (above/below 128 -- on or off) is stored as a bit in this int value.. Then, all that would need to be done is just get the fingerprint of the character on the screen that we're testing, and compare it with the list of known fingerprints to determine which character it is on the screen.

4. Figures