On Thu, Apr 7, 2011 at 9:33 PM, Mike Sandford <vade...@gmail.com> wrote:
> I don't know if it's strictly necessary for my application, but I am > trying to analyze anywhere from a few characters up to a few lines of > text rapidly. Tesseract is a portion of my application pipeline. > I've got my own document layout engine since there's a lot of really > specialized, mostly useless domain knowledge. > > OCR is currently taking up over half the total analysis time. I > managed to reduce it from about 60% to about 20% by (2sec per document > to 0.8sec per document) using multiprocessing, I launch several jobs > from the command line in parallel. That's roughly 4x speedup on a > quad-core so that's good. But I'm still interested in pushing > further. In the ideal world I'd have four tesseract daemons on all > the time and when I need OCR done I pipe a filename in (or perhaps the > image data) and get a string out. Or something like that. > > My thought is that it takes a certain amount of time to load up the > binary and the training data and get organized in memory. Right now > this whole process happens every time I need to process a file, > perhaps 10-20 times per document. That could be a substantial amount > of overhead. I fed tesseract a 1x1 white tiff 10 times and it took > between 30ms and 44ms to load and tell me that there was no output. > Let's just assume for a moment that those numbers aren't totally > bogus, that means out of 0.8sec per document I'm spending 10x(30ms to > 40ms)=300ms to 400ms of time just loading up the binary. That could > be half of my total document processing time. > > I haven't gone looking at the guts to try and figure out if this is > possible yet. I was hoping to get some feedback as to how dumb (or > perhaps not!) of an idea this is before I really launched into it. So > what does everyone think? Would this be helpful to anyone else? Does > tesseract's architecture lend itself to staying in RAM for an extended > period of time, for multiple images? > > I do realize that I could potentially just write out a single image > with all the regions of interest contained within it, but my guess is > that tesseract does some learning about what the font is as it > processes characters. And since each document might have different > fonts, font sizes, etc I think that may be more harmful than > beneficial. > > Just from my bookmarks (have-a-look-on-this-later ;-)): there is project COSI - The Common OCR Service Interface [1] that used patched tesseract-ocr 2.04 in server mode. I did no test it yet, but maybe it would be good start point. Zdenko [1] http://sourceforge.net/apps/mediawiki/cosi/index.php?title=Main_Page -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.