[tesseract-ocr] tesseract performs wrong auto-correction sometimes : how to disable it?

Youcef Wed, 25 Apr 2018 07:59:54 -0700

Hi,


Tesseract seems to post process its prediction.

Here after, what I get after OCRizing images (same font, same size images 
generated with text2image):

- an image containing "12345678I" => `123456781`
- an image containing "GLOTHUVFI" => `GLOTHUVFI`
- an image containing "12345678H" => `12345678H`
- an image containing "GLOTHUVFH" => `GLOTHUVFH`
- an image containing "12345678A" => `123456784`
- an image containing "GLOTHUVFA" => `GLOTHUVFA`

It looks like Tesseract doesn't like a word with a some numbers and one 
letter at the end. In fact, if the letter looks like a number ("I" and "A" 
looks like "1" and "4" respectively), it replaces it by the closest number.
I have tried to tune following parameters without any changement in the 
result:

- segment_penalty_dict_frequent_word
- language_model_penalty_chartype

Thanks for any help.

Regards

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4722674d-27a1-4b8e-8c5a-9e07dbe3ca7d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] tesseract performs wrong auto-correction sometimes : how to disable it?

Reply via email to