Try preprocessing your documents. Create a black and white image first and 
crop the images for text area. Try to enhance the text by thresholding. In 
my experience i have seen tesseract do not so well when there are stray 
lines or boxes. You can also experiment with different psm modes, i found 
changing them to be useful in my application. You could also finetune the 
eng/latin model if all the documents are in a similar font for that font. 

On Monday, April 21, 2025 at 12:03:33 PM UTC-4 mcarlo...@gmail.com wrote:

> Hello everyone,
>
> A quick question regarding the use of the tessdata_best 
> <https://github.com/tesseract-ocr/tessdata_best> models. I have simply 
> copy-pasted the eng.traineddata file into the local directory where 
> Tesseract takes models from (the one shown when running --list-langs: in my 
> case, /opt/homebrew/share/tessdata/). I simply replaced the standard model 
> that comes with the tesseract Homebrew package. *Should I adapt some 
> other configuration in order to have better results (apart from --oem 1)?*
>
> Honestly, I am having the same amount of (or even more) errors than with 
> the standard model. I am trying to automatically transcribe documents such 
> as the one attached (a simple excerpt from a longer file; see also e.g. 
> https://royalsocietypublishing.org/doi/epdf/10.1098/rstl.1720.0013). *Any 
> idea if there are more suitable models for this kind of 18th-century 
> documents? *(Seems like a 18th-century Caslon font, which uses the long S 
> <https://en.wikipedia.org/wiki/Long_s> quite often)
>
> Thank you for any kind of help you can provide!
> Best,
> Massimiliano
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/aabbcef7-ae5d-4d21-ad69-1b22d6ea8c0fn%40googlegroups.com.

Reply via email to