At https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#binarisation, the Tesseract docs say:
*While tesseract version 3.05 (and older) handle inverted image (dark background and light text) without problem, for 4.x version use dark text on light background* and *If you OCR just text area without any border, tesseract could have problems with it. See for some details in tesseract user forum <https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/v26a-RYPSOE/2Sppq61GBwAJ>#427 <https://github.com/tesseract-ocr/tesseract/issues/427> . You can easy add small border (e.g. 10 pt) with ImageMagick® <http://imagemagick.org/script/index.php>:* *convert 427-1.jpg -bordercolor White -border 10x10 427-1b.jpg* I'm a little puzzled about two things: 1. If we're using a light background, won't "adding a white border" typcially just mean making a larger image with the target text making up less of its area (because the border will match the color of the background)? Is that the intended interpretation of this -- to avoid text that directly touches the boundaries of the image? 2. The inversion advice talks about Tesseract 3 and 4. Does Tesseract 5 maintain the "dark text on light background" preference of 4? p.s. Tried to post a message once before and it didn't show up for some reason. Giving it one more shot; sorry if this doubleposts. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4fc6a9e9-b674-479b-930b-e955f23204d1n%40googlegroups.com.