Tech rider: name card ocr

Showing posts with label name card ocr. Show all posts

Friday, June 19, 2009

tesseract-ocr: jpeg support added

I just pushed some patches to the github host to add jpeg image support with libjpeg.

One drawback is that, because the image operation infrastructure in tesseract doesn't allow passing libjpeg private structures between open and read operations, I have to open the file stream twice in opening and reading jpeg images.

To test the patch, pull from git://github.com/bergwolf/tesseract-ocr-copy.git and build tesseract-ocr. Then run some tests, this time against jpeg images :)

github copy of tesseract-ocr

We need to modify tesseract-ocr to support more image formats. So I set up a project host at github and pulled the source code from tesseract-ocr's googlecode host.

So, to get the new code base, just type the following commands:
1. git clone git://github.com/bergwolf/tesseract-ocr-copy.git
2. cd tesseract-ocr-copy
3. ./configure
4. make
5. make install

works perfect :)

Great news: we have passed 1st round evaluation

Just got a notification email saying that our online namecard ocr project has passed the first round of Nokia Innovation Contest.

Nokia is very generous to plan to sponsor us with a N95. Cool!

Tuesday, May 12, 2009

libevent and so on

Since we need a HTTP server to provide efficient service to all kinds of clients, I started to look into some lightweight open source solutions. The first item that jumps into my eyes is libevent, because I happen to read a blog of a facebook developer's, stating that facebook is using libevent as a HTTP server in their haystack photo storage service.

Libevent is a lightweight event driven library wildly used in many applications, such as memcached and tor. Libevent has simple but efficient HTTP support. Here is a sample code building a simple HTTP server with evhttp:

#include <sys/queue.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>

#include <string.h>
#include <stdio.h>

#include <evhttp.h>

void ocr_handler(struct evhttp_request *req, void *arg)
{
 struct evbuffer *buf;
 struct evkeyval *header;

 TAILQ_FOREACH(header, req->input_headers, next) {
  fprintf(stdout, "key:%s\tvalue:%s\n", header->key, header->value);
 }
}

int main(int argc, char **argv)
{
 int err;
 struct evhttp *httpd;
 struct event_base *evbase;
 int port = 1234;

 evbase = event_init();

 fprintf(stderr, "event method: %s\n", event_base_get_method(evbase));

 httpd = evhttp_new(evbase);
 if ((err = evhttp_bind_socket(httpd, NULL, port)) < 0) {
  fprintf(stderr, "error binding http server to port %d\n", port);
  return -1;
 }

 /* set callback for "/" */
 evhttp_set_cb(httpd, "/", ocr_handler, NULL);

 /* Set a send callback for all other requests. */
 evhttp_set_gencb(httpd, NULL, NULL);

 event_base_dispatch(evbase);

 evhttp_free(httpd);
 event_base_free(evbase);
 return 0;
}

However, as I looked into the library in details and wrote some test programs, it turns out that life is not that easy. I tried to dynamically create threads to serve incoming HTTP requests, but the code didn't work as I thought. After searching for a while, I found the problem:

Steven Grimm:

What libevent doesn't support is sharing a libevent instance across threads. It works just fine to use libevent in a multithreaded process where only one thread is making libevent calls.

Therefore, to use libevent in a multi-threaded program, we should create each thread a event base when initialising the program and call ev_set_base() after ev_set() but before ev_add(). Then we will have a thread pool to serve HTTP requests. There will a main thread listening to all incoming HTTP requests. When a request comes, it passes the request to some thread from the thread pool and wakes it up to handle the request.

Although this will work, we somehow end up with a master/worker thread architecture, where the main thread handles all reads from netwrok. This will certainly be a bottleneck when there are thousands of clients(think the C10K problem). I don't know how the facebook guys deal with this problem(maybe they patched libevent?:), But IMO, using an evhttp dispacher in a multi-threaded process, we'll end up this way.

So, currently, I'm planning to look at other solutions like lighttpd before making any decision on the server architecture.

Tuesday, April 28, 2009

A first glance at tesseract-ocr

So, I start to look into the name card OCR project. As suggested by Alex, I'd first look at the OCR engine developed by Google, tesseract-ocr.

From wikipedia:
Tesseract is a free optical character recognition engine. It was originally developed as proprietary software at Hewlett-Packard between 1985 until 1995. After ten years without any development taking place, Hewlett Packard and UNLV released it as open source in 2005. Tesseract is currently developed by Google and released under the Apache License, Version 2.0

It's a little strange that the featured tar balls listed on the project home page have compile errors. After modifying the source code, I successfully got the program binaries, but the language charsets aren't built. Then I change to the svn HEAD version, and it works.

To use tesseract, I simply type:
[bergwolf@bin]$./tesseract phototest.tif result
Tesseract Open Source OCR Engine
[bergwolf@bin]$cat result.txt
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

Tesseract OCR engine is very accurate, and is very suitable for our name card OCR service, because usually we only have white background and black letters in our images.

However, the drawback is that it doesn't support many image formats. Most mobile device save camera photos in jpeg format. Currently, only tiff and bmp formats are recognizable by tesseract. If we want to use it as our OCR engine, two options are available: either patch tesseract with other image formats support, or use other tools like imagemagick to convert other image formats to tiff or bmp format, both of which shouldn't be very hard.

Tech rider

Friday, June 19, 2009

tesseract-ocr: jpeg support added

github copy of tesseract-ocr

Great news: we have passed 1st round evaluation

Tuesday, May 12, 2009

libevent and so on

Tuesday, April 28, 2009

A first glance at tesseract-ocr

The BARD

Labels

Where I'm Heading

Blog Archive