tesseract: ocr that works

Hugo Vanwoerkom Sat, 27 Dec 2008 02:33:01 -0800

Hi,

Recently there was a post mentioning tesseract.


Turns out that is an award winning opensource OCR that works!

I tried it out:

1. apt-get install tesseract-ocr
2. apt-get install tesseract-ocr-eng
3. use xsane to scan a page at dpi 300 and save as .tif
4. run: convert foo.tif -depth 8 foo1.tif
5. doit: tesseract foo1.tif foo2 -l eng

And voilá! There is foo2.txt with the text.

This is a page that I scanned:
http://www.scribd.com/doc/9267859/p13x1

This is the result:
http://www.scribd.com/doc/9269769/p13

The only errors where some punctuation marks.

{2} tesseract comes by default with the German dic.
[3] don't scan at less than 300 dpi

[4] the result form xsane is depth 16 which tesseract can't handle soyou have to convert the result to depth 8.


Hugo


--

To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.orgwith a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

tesseract: ocr that works

Reply via email to