Hi tesseract community! I've found an interesting scenario where a simple 4-digit number cropped from a PDF (i.e from a region rendered from a vector font, not from an embedded bitmap) is incorrectly OCR'd. I used ImageMagick to extract a .png from the source PDF, like this:
convert -density 1600 -trim input.pdf[42] -rotate 90 +repage -crop 600x720+900+3400 crop.png ...and then used tesseract to OCR it: tesseract crop.png stdout --psm 6 The digits "1552" in the source image are OCR'd as "15562". You can try for yourself like this: wget https://i.imgur.com/0swZuoU.png tesseract 0swZuoU.png stdout --psm 6 The image as hosted on imgur is not bitwise-equivalent to crop.png, but it's impossible to tell apart by eye. I can upload the original crop.png somewhere else, if necessary. I'm using the latest commit (30ebb31f) of the tesseract engine, and I tried with the latest commits (4767ea9 & e2aad9b) of both tessdata and tessdata_best. Can I do anything to improve the OCR result in this sort of scenario? Chris -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6d94d071-6161-4d21-8733-c5322ee71dd0n%40googlegroups.com.