*Bottom line up front:* Has anyone compiled a list of common misperceptions 
on the part of tesseract? E.g.: e is often seen as o and l can be mistaken 
for 1, etc. 

Forgive me if this is well known, but a cursory search provided no result, 
though it is always possible I was too hasty or overlooked an obvious 
resource:

*What I'm doing: *As part of a longer pipeline, at one step I am reasoning 
over very small but highly characteristic strings like drug dosage, "60 
mg". Edit distance (Levenshtein or a variation) and n-grams, even unigrams, 
only do a so-so job. I'd like to calculate probabilities based on 
look-alikes per above. That is, a not unreasonable case on a poor document 
is to mistake "60 mg" for 6Ong" which gives a ratio of only 44%, for 
example. But, if the program knew that 0 and O as well as m and n can be 
frequently mistaken for the same character ... better matching. I've also 
considered dumping individual character probabilities into the mix from 
Tesseracts API, but I'm new to Tesseract, haven't gotten there yet, and I'm 
not even convinced that this would be a better solution. 

Thanks in advance to anyone who has the time to answer,
Regards,
John

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to