Re: mis-decoding a single line of text

patrickq Thu, 29 Jul 2010 05:18:45 -0700

OK so I did my first experiment with this:
1. Created a plain text file with a single line containing "Vice"
2. Saved as "eng-user-words" in the tessdata folder (next to
eng.traineddata)
3. Ran recognition on an image where "Vice President" was previously
returned as "Woe President"


No effect, same mistake. Too far apart to correct?

On Jul 27, 6:09 pm, "Jimmy O'Regan" <[email protected]> wrote:
> On 27 July 2010 21:55, patrickq <[email protected]> wrote:
>
> > I assume you are referring 
> > tohttp://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
> > ?
> > It's helpful, thanks, and I should have checked what's there first.
>
> > My understanding is that:
> > - one dictionary file (eng.word-dawg) is included as part of building
> > the training data, and includes a separation between frequent and
> > infrequent words
>
> Two separate files (or subfiles, in Tess 3).
>
> > - there is no guideline explaining what's "frequent" versus not, nor
> > how the two sets interact. Do the frequent words get picked any time
> > the recognized text is two letters away (as opposed to infrequent
> > words where they trigger only if text is one letter away)? Unclear.
>
> How to calculate frequency?
>
> Courtesy of my main project
> (http://wiki.apertium.org/wiki/Building_dictionaries):
> $ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed
> 's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |
> sed 's/&.*;/ /g'
>
> will get you a frequency list from a wikipedia xml dump (you probably
> won't want Afrikaans, but everything else in the pipeline should stay
> the same).
>
> (As for why it's important:http://en.wikipedia.org/wiki/Zipf%27s_law
> - though it'd be more apparent in longer texts)
>
> I guess that by 'two letters away'/'one letter away' that you have a
> passing familiarity with the use of Levenshtein distance in spell
> checkers... you're on a similar track, but not quite.
>
> In a simple (non-statistical) spell checker, you have a dictionary -
> normally, just a simple list of words - and if the word being checked
> isn't in the dictionary, it provides a list of suggestions based on
> the Levenshtein distance between the word being checked, and available
> words. (This word list is a language model - a unigram model - at its
> simplest).
>
> At the most basic, a statistical spell checker does the same, but the
> wordlist has frequencies attached, so higher frequency correctly
> spelled words are given a higher priority in the suggestions list.
> (This language model is a statistical unigram language model).
>
> This sort of unigram based spell checker is good at finding typos, but
> not at finding misuses - it won't find the error in "I'll meat you at
> the cinema". If you switch to a bigram model (2 words), you can catch
> that sort of error - as long as you have a list of correctly spelled,
> but easily confused words like 'meet' and 'meat' - by giving you a
> small context window. "I'll meat" will almost never occur in normal
> English text, but "I'll meet" will.
>
> So far, so erudite. I'm getting to the point, honest.
>
> So, Tesseract has unigram word models, with no real statistics
> attached - though it does attach a fake statistical weight, depending
> on the type of dictionary, but it's actually closer to the more
> complicated bigram model, because tesseract isn't testing words, per
> se, it's testing the characters.
>
> At any point, if you ask Tesseract what the 'word' it sees is, it will
> simply give you a string composed of the highest-confidence
> characters: the word structure also keeps an array of possible
> characters along with the confidence from the recogniser. The weight
> from a dictionary can add extra weight to a set of characters, but
> only if the set of characters that word is composed from is among the
> set of choices (some other steps can add or remove characters... etc).
>
> > - there is no mention of the advantages of the eng.word-dawg method
> > versus eng.user-words but I guess eng,user-words is the only option to
> > anyone who is not building his own training data?
>
> Yes. The user-words, IIRC, also has a slightly higher weight.
>
> > I'd like to give it a try (using eng.user-words) - the one question I
> > still have is how do results get affected by adding a word to the
> > dictionary? Auto-correct when replacing letters with a low score gets
> > a match? How many corrections per word? Anyone with answers, please
> > share and I volunteer to add to the doc - it's a wiki after all, why
> > do I get a sense that a thick layer of dust covers the doc :-)?
>
> The wiki can only be added to by people with commit access... which
> kind of defeats the purpose, IMO. You can add comments, but I'm sure
> most people would get fed up of wading through the 'plz send me source
> of tesseract' dross.
>
> On the bright side, the wiki source is held in SVN, and if you send me
> a diff, or put it in the issue tracker, I'll commit it.
>
>
>
>
>
> > Patrick
>
> > On Jul 27, 4:20 pm, Eugene Reimer <[email protected]> wrote:
> >> A quick glance at the documentation will tell you that "the dictionary"
> >> lives in several DAWG files, as well in that user-words file.
>
> >> patrickq wrote, On 2010-07-27 14:59:
>
> >> > I get HAX 6 5-5,- with Tesseract 3.0
>
> >> > What I find remarkable is that half the folks on this forum would love
> >> > to disable the word recognition (i.e. dictionary), the other half
> >> > would like to enable it - and absolutely no one knows how to enable/
> >> > disable the dictionary nor can say for sure if it's actually enabled
> >> > or not by default. I am included in the group of the clueless - we
> >> > have scanned thousands of business cards and still have no idea
> >> > whatsoever what the hell is going on with that elusive dictionary.
>
> >> > I gather from Jimmy's recent answer that the dictionary is contained
> >> > in a single file of type text, one word per line, in a file called
> >> > eng.user-words (any support for regular expressions there? for example
> >> > to say that [\\d]*th is a common word) placed in the Tessdata folder
> >> > but we await final confirmation. Is it enough that the file exists?
> >> > Does removing the file disable the dictionary?
>
> >> > Clearly many have used the dictionary but sadly it appears that these
> >> > knowledgeable people deserted this forum once they got the answers
> >> > they need - if you see one of these gentlemen (or ladies, yes) roaming
> >> > the streets, please admonish them for not staying subscribed to forum
> >> > messages to give back in helping others!
>
> > --
> > You received this message because you are subscribed to the Google Groups 
> > "tesseract-ocr" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to 
> > [email protected].
> > For more options, visit this group 
> > athttp://groups.google.com/group/tesseract-ocr?hl=en.
>
> --
> <Leftmost> jimregan, that's because deep inside you, you are evil.
> <Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: mis-decoding a single line of text

Reply via email to