Hello, I am using Tesseract to extract some data from screenshots.
I've noticed that sometimes there are mistakes in interpreting characters 
like '0' and 'O', 'P' and 'R' or '-' and '—' or the other way around. This 
happen with the same font. And it happens sometimes even with some 
preprocessing, like binarization.
Is there a comprehensive map of all characters that are usually mistakenly 
recognised that are very similar?
I need that map in order to calculate effective string distance with 
Levenshtein and adjust the cost for characters that are very similar. 
Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/65047a67-c98a-45c5-ac02-9fd38666e975n%40googlegroups.com.

Reply via email to