On Wednesday, March 12, 2014 7:57:38 AM UTC-4, John Green wrote:
>
>
> *What I'm doing: *As part of a longer pipeline, at one step I am 
> reasoning over very small but highly characteristic strings like drug 
> dosage, "60 mg". Edit distance (Levenshtein or a variation) and n-grams, 
> even unigrams, only do a so-so job. I'd like to calculate probabilities 
> based on look-alikes per above. That is, a not unreasonable case on a poor 
> document is to mistake "60 mg" for 6Ong" which gives a ratio of only 44%, 
> for example. But, if the program knew that 0 and O as well as m and n can 
> be frequently mistaken for the same character ... better matching. I've 
> also considered dumping individual character probabilities into the mix 
> from Tesseracts API, but I'm new to Tesseract, haven't gotten there yet, 
> and I'm not even convinced that this would be a better solution. 
>

It's not clear from your description if you're already doing this, but you 
might want to consider modeling the target domain that you're matching to 
either in terms of n-gram probabilities or something even stricter. 
 There's going to be much less variability in something like a dosage 
string than there is in general text.  You could use something like a 
medical term ontology to create a pretty comprehensive list of things like 
units, frequencies, routes, etc.

Tom 

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to