On Wednesday, March 12, 2014 7:57:38 AM UTC-4, John Green wrote: > > > *What I'm doing: *As part of a longer pipeline, at one step I am > reasoning over very small but highly characteristic strings like drug > dosage, "60 mg". Edit distance (Levenshtein or a variation) and n-grams, > even unigrams, only do a so-so job. I'd like to calculate probabilities > based on look-alikes per above. That is, a not unreasonable case on a poor > document is to mistake "60 mg" for 6Ong" which gives a ratio of only 44%, > for example. But, if the program knew that 0 and O as well as m and n can > be frequently mistaken for the same character ... better matching. I've > also considered dumping individual character probabilities into the mix > from Tesseracts API, but I'm new to Tesseract, haven't gotten there yet, > and I'm not even convinced that this would be a better solution. >
It's not clear from your description if you're already doing this, but you might want to consider modeling the target domain that you're matching to either in terms of n-gram probabilities or something even stricter. There's going to be much less variability in something like a dosage string than there is in general text. You could use something like a medical term ontology to create a pretty comprehensive list of things like units, frequencies, routes, etc. Tom -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.

