Tuesday, April 28, 2009

A first glance at tesseract-ocr

So, I start to look into the name card OCR project. As suggested by Alex, I'd first look at the OCR engine developed by Google, tesseract-ocr.

From wikipedia:
Tesseract is a free optical character recognition engine. It was originally developed as proprietary software at Hewlett-Packard between 1985 until 1995. After ten years without any development taking place, Hewlett Packard and UNLV released it as open source in 2005. Tesseract is currently developed by Google and released under the Apache License, Version 2.0

It's a little strange that the featured tar balls listed on the project home page have compile errors. After modifying the source code, I successfully got the program binaries, but the language charsets aren't built. Then I change to the svn HEAD version, and it works.

To use tesseract, I simply type:
[bergwolf@bin]$./tesseract phototest.tif result
Tesseract Open Source OCR Engine
[bergwolf@bin]$cat result.txt
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

Tesseract OCR engine is very accurate, and is very suitable for our name card OCR service, because usually we only have white background and black letters in our images.

However, the drawback is that it doesn't support many image formats. Most mobile device save camera photos in jpeg format. Currently, only tiff and bmp formats are recognizable by tesseract. If we want to use it as our OCR engine, two options are available: either patch tesseract with other image formats support, or use other tools like imagemagick to convert other image formats to tiff or bmp format, both of which shouldn't be very hard.

No comments:

Post a Comment