On Sat, Jul 10, 2004 at 12:15:28AM +0200, martin f krafft wrote: > also sprach Andrew Perrin <[EMAIL PROTECTED]> [2004.07.09.2221 +0200]: > > Correct - if you want searchable text you need some OCR filter. > > I've used gocr with some, moderate, success, but it's by no means > > perfect. Others have recommended clara, which is probably better > > but requires too much user involvement for my taste! > > Yes, I am starting to notice that we need to get into the OCR > domain. I am new to scanning, so please excuse me not making that > jump before posting. > > So far it sounds like HP has open source drivers for their > all-in-ones... if I can find one with automated pagefeeding, I am > off to try clara...
Search the archives for my and other's discussions about project gutenbergs tests with gocr and other open source OCR programs. They are all perfect with perfect texts, but basically horribly unusable with "typical" texts. If the text is not perfectly straight with a great big font, i.e., printed with OCR in mind, gocr does an abysmal job -- whereas closed source OCR software got to the 95% accuracy with these "typical" tests in oh I don't know 1996. The OCR software that comes with Microsoft Office beats the crap out of GOCR, even with cleanly printed books with nice fonts that you'd expect to be easy to scan. What's missing in GOCR is a "slanted text straighter" algorithm. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]