On Thu, Apr 7, 2011 at 9:33 PM, Mike Sandford <vade...@gmail.com> wrote:

> I don't know if it's strictly necessary for my application, but I am
> trying to analyze anywhere from a few characters up to a few lines of
> text rapidly.  Tesseract is a portion of my application pipeline.
> I've got my own document layout engine since there's a lot of really
> specialized, mostly useless domain knowledge.
>
> OCR is currently taking up over half the total analysis time.  I
> managed to reduce it from about 60% to about 20% by (2sec per document
> to 0.8sec per document) using multiprocessing, I launch several jobs
> from the command line in parallel.  That's roughly 4x speedup on a
> quad-core so that's good.  But I'm still interested in pushing
> further.  In the ideal world I'd have four tesseract daemons on all
> the time and when I need OCR done I pipe a filename in (or perhaps the
> image data) and get a string out.  Or something like that.
>
> My thought is that it takes a certain amount of time to load up the
> binary and the training data and get organized in memory.  Right now
> this whole process happens every time I need to process a file,
> perhaps 10-20 times per document.  That could be a substantial amount
> of overhead.  I fed tesseract a 1x1 white tiff 10 times and it took
> between 30ms and 44ms to load and tell me that there was no output.
> Let's just assume for a moment that those numbers aren't totally
> bogus, that means out of 0.8sec per document I'm spending 10x(30ms to
> 40ms)=300ms to 400ms of time just loading up the binary.  That could
> be half of my total document processing time.
>
> I haven't gone looking at the guts to try and figure out if this is
> possible yet.  I was hoping to get some feedback as to how dumb (or
> perhaps not!) of an idea this is before I really launched into it.  So
> what does everyone think?  Would this be helpful to anyone else?  Does
> tesseract's architecture lend itself to staying in RAM for an extended
> period of time, for multiple images?
>
> I do realize that I could potentially just write out a single image
> with all the regions of interest contained within it, but my guess is
> that tesseract does some learning about what the font is as it
> processes characters.  And since each document might have different
> fonts, font sizes, etc I think that may be more harmful than
> beneficial.
>
>
Just from my bookmarks (have-a-look-on-this-later ;-)): there is
project COSI - The Common OCR Service Interface [1] that used patched
tesseract-ocr 2.04 in server mode. I did no test it yet, but maybe it would
be good start point.

Zdenko

[1] http://sourceforge.net/apps/mediawiki/cosi/index.php?title=Main_Page

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to