[tesseract-ocr] Fine Tuning with image containing multiple languages

Jacob Pedersen Fri, 16 Dec 2022 06:00:54 -0800

Hi

Consider an image containing a mix of English and German text.


Extracting wordstr boxes from it and fixing mistakes.

When fine tuning the two languages, I get encoding errors for English as it 
does not contain German chars.

What is the correct approach here?

1. Ignore encoding errors? What effect does this have on the result?
2. Create two box files changing German words like 'Dänemark' to 'Danemark' 
for eng?
3. Remove German wordstr's from box file when fine tuning deu?
4. Add German chars to the English unicodecharset?
5. Something else?

/Jacob

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/edac0bd6-57bb-4afc-8e3c-a02a4c1f007cn%40googlegroups.com.

[tesseract-ocr] Fine Tuning with image containing multiple languages

Reply via email to