Hello Deborah, Hopefully this isn't off-topic, and I don't mean to derail your thread, but I just wanted to chime in that I am having some very similar difficulties and considerations in the hopes that it will generate enough interest to yield an effective solution. On Sunday, June 16, 2024 at 2:41:27 AM UTC-4 Deborah wrote:
> Hello, I am using Tesseract to extract some data from screenshots. > I've noticed that sometimes there are mistakes in interpreting characters > like '0' and 'O', 'P' and 'R' or '-' and '—' or the other way around. This > happen with the same font. And it happens sometimes even with some > preprocessing, like binarization. > Is there a comprehensive map of all characters that are usually mistakenly > recognised that are very similar? > I need that map in order to calculate effective string distance with > Levenshtein and adjust the cost for characters that are very similar. > Thanks. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bf0f930c-b641-439e-b4c1-6ac24c4d7c4en%40googlegroups.com.