Hello, I am using Tesseract to extract some data from screenshots. I've noticed that sometimes there are mistakes in interpreting characters like '0' and 'O', 'P' and 'R' or '-' and '—' or the other way around. This happen with the same font. And it happens sometimes even with some preprocessing, like binarization. Is there a comprehensive map of all characters that are usually mistakenly recognised that are very similar? I need that map in order to calculate effective string distance with Levenshtein and adjust the cost for characters that are very similar. Thanks.
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/65047a67-c98a-45c5-ac02-9fd38666e975n%40googlegroups.com.