Try preprocessing your documents. Create a black and white image first and crop the images for text area. Try to enhance the text by thresholding. In my experience i have seen tesseract do not so well when there are stray lines or boxes. You can also experiment with different psm modes, i found changing them to be useful in my application. You could also finetune the eng/latin model if all the documents are in a similar font for that font.
On Monday, April 21, 2025 at 12:03:33 PM UTC-4 mcarlo...@gmail.com wrote: > Hello everyone, > > A quick question regarding the use of the tessdata_best > <https://github.com/tesseract-ocr/tessdata_best> models. I have simply > copy-pasted the eng.traineddata file into the local directory where > Tesseract takes models from (the one shown when running --list-langs: in my > case, /opt/homebrew/share/tessdata/). I simply replaced the standard model > that comes with the tesseract Homebrew package. *Should I adapt some > other configuration in order to have better results (apart from --oem 1)?* > > Honestly, I am having the same amount of (or even more) errors than with > the standard model. I am trying to automatically transcribe documents such > as the one attached (a simple excerpt from a longer file; see also e.g. > https://royalsocietypublishing.org/doi/epdf/10.1098/rstl.1720.0013). *Any > idea if there are more suitable models for this kind of 18th-century > documents? *(Seems like a 18th-century Caslon font, which uses the long S > <https://en.wikipedia.org/wiki/Long_s> quite often) > > Thank you for any kind of help you can provide! > Best, > Massimiliano > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/aabbcef7-ae5d-4d21-ad69-1b22d6ea8c0fn%40googlegroups.com.