*Bottom line up front:* Has anyone compiled a list of common misperceptions on the part of tesseract? E.g.: e is often seen as o and l can be mistaken for 1, etc.
Forgive me if this is well known, but a cursory search provided no result, though it is always possible I was too hasty or overlooked an obvious resource: *What I'm doing: *As part of a longer pipeline, at one step I am reasoning over very small but highly characteristic strings like drug dosage, "60 mg". Edit distance (Levenshtein or a variation) and n-grams, even unigrams, only do a so-so job. I'd like to calculate probabilities based on look-alikes per above. That is, a not unreasonable case on a poor document is to mistake "60 mg" for 6Ong" which gives a ratio of only 44%, for example. But, if the program knew that 0 and O as well as m and n can be frequently mistaken for the same character ... better matching. I've also considered dumping individual character probabilities into the mix from Tesseracts API, but I'm new to Tesseract, haven't gotten there yet, and I'm not even convinced that this would be a better solution. Thanks in advance to anyone who has the time to answer, Regards, John -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.

