Hi tesseract community!

I've found an interesting scenario where a simple 4-digit number cropped 
from a PDF (i.e from a region rendered from a vector font, not from an 
embedded bitmap) is incorrectly OCR'd. I used ImageMagick to extract a .png 
from the source PDF, like this:

convert -density 1600 -trim input.pdf[42] -rotate 90 +repage -crop 
600x720+900+3400 crop.png

...and then used tesseract to OCR it:

tesseract crop.png stdout --psm 6

The digits "1552" in the source image are OCR'd as "15562".

You can try for yourself like this:

wget https://i.imgur.com/0swZuoU.png
tesseract 0swZuoU.png stdout --psm 6

The image as hosted on imgur is not bitwise-equivalent to crop.png, but 
it's impossible to tell apart by eye. I can upload the original crop.png 
somewhere else, if necessary.

I'm using the latest commit (30ebb31f) of the tesseract engine, and I tried 
with the latest commits (4767ea9 & e2aad9b) of both tessdata and 
tessdata_best.

Can I do anything to improve the OCR result in this sort of scenario?

Chris

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6d94d071-6161-4d21-8733-c5322ee71dd0n%40googlegroups.com.

Reply via email to