On 21 Dec 2008, Hugo Vanwoerkom wrote: > Hi, > > Recently there was a post mentioning tesseract. > > Turns out that is an award winning opensource OCR that works! > > I tried it out: > > 1. apt-get install tesseract-ocr > 2. apt-get install tesseract-ocr-eng > 3. use xsane to scan a page at dpi 300 and save as .tif > 4. run: convert foo.tif -depth 8 foo1.tif > 5. doit: tesseract foo1.tif foo2 -l eng > > And voilá! There is foo2.txt with the text. > > This is a page that I scanned: > http://www.scribd.com/doc/9267859/p13x1 > > This is the result: > http://www.scribd.com/doc/9269769/p13 > > The only errors where some punctuation marks. > > {2} tesseract comes by default with the German dic. > [3] don't scan at less than 300 dpi > [4] the result form xsane is depth 16 which tesseract can't handle so > you have to convert the result to depth 8. > > Hugo >
As we seem to be reposting this, here are my comments again. Yes, tesseract does work well. Here, xsane gives depth 24, but conversion to depth 8 is neither possible nor necessary. Following the docs, I did export TESSDATA_PREFIX="/usr/share/tesseract-ocr/" There was no need for "- l eng" since I only had the English version of tesseract installed. So to scan a page saved at 300 dpi I just do: tesseract foo.dvi foo The result is excellent. I got pretty good results with ocrad but tesseract is definitely better. Anthony -- Anthony Campbell - a...@acampbell.org.uk Microsoft-free zone - Using Debian GNU/Linux http://www.acampbell.org.uk (blog, book reviews, and sceptical articles) -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org