Re: mis-decoding a single line of text

Jimmy O'Regan Tue, 27 Jul 2010 15:09:12 -0700

On 27 July 2010 21:55, patrickq <[email protected]> wrote:
> I assume you are referring to 
> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
> ?
> It's helpful, thanks, and I should have checked what's there first.
>
> My understanding is that:
> - one dictionary file (eng.word-dawg) is included as part of building
> the training data, and includes a separation between frequent and
> infrequent words

Two separate files (or subfiles, in Tess 3).

> - there is no guideline explaining what's "frequent" versus not, nor
> how the two sets interact. Do the frequent words get picked any time
> the recognized text is two letters away (as opposed to infrequent
> words where they trigger only if text is one letter away)? Unclear.

How to calculate frequency?

Courtesy of my main project
(http://wiki.apertium.org/wiki/Building_dictionaries):
$ bzcat afwiki-20070508-pages-articles.xml.bz2 | grep '^[A-Z]' | sed
's/$/\n/g' | sed 's/\[\[.*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |
sed 's/&.*;/ /g'

will get you a frequency list from a wikipedia xml dump (you probably
won't want Afrikaans, but everything else in the pipeline should stay
the same).

(As for why it's important: http://en.wikipedia.org/wiki/Zipf%27s_law
- though it'd be more apparent in longer texts)

I guess that by 'two letters away'/'one letter away' that you have a
passing familiarity with the use of Levenshtein distance in spell
checkers... you're on a similar track, but not quite.

In a simple (non-statistical) spell checker, you have a dictionary -
normally, just a simple list of words - and if the word being checked
isn't in the dictionary, it provides a list of suggestions based on
the Levenshtein distance between the word being checked, and available
words. (This word list is a language model - a unigram model - at its
simplest).

At the most basic, a statistical spell checker does the same, but the
wordlist has frequencies attached, so higher frequency correctly
spelled words are given a higher priority in the suggestions list.
(This language model is a statistical unigram language model).

This sort of unigram based spell checker is good at finding typos, but
not at finding misuses - it won't find the error in "I'll meat you at
the cinema". If you switch to a bigram model (2 words), you can catch
that sort of error - as long as you have a list of correctly spelled,
but easily confused words like 'meet' and 'meat' - by giving you a
small context window. "I'll meat" will almost never occur in normal
English text, but "I'll meet" will.

So far, so erudite. I'm getting to the point, honest.

So, Tesseract has unigram word models, with no real statistics
attached - though it does attach a fake statistical weight, depending
on the type of dictionary, but it's actually closer to the more
complicated bigram model, because tesseract isn't testing words, per
se, it's testing the characters.

At any point, if you ask Tesseract what the 'word' it sees is, it will
simply give you a string composed of the highest-confidence
characters: the word structure also keeps an array of possible
characters along with the confidence from the recogniser. The weight
from a dictionary can add extra weight to a set of characters, but
only if the set of characters that word is composed from is among the
set of choices (some other steps can add or remove characters... etc).

> - there is no mention of the advantages of the eng.word-dawg method
> versus eng.user-words but I guess eng,user-words is the only option to
> anyone who is not building his own training data?
>

Yes. The user-words, IIRC, also has a slightly higher weight.

> I'd like to give it a try (using eng.user-words) - the one question I
> still have is how do results get affected by adding a word to the
> dictionary? Auto-correct when replacing letters with a low score gets
> a match? How many corrections per word? Anyone with answers, please
> share and I volunteer to add to the doc - it's a wiki after all, why
> do I get a sense that a thick layer of dust covers the doc :-)?

The wiki can only be added to by people with commit access... which
kind of defeats the purpose, IMO. You can add comments, but I'm sure
most people would get fed up of wading through the 'plz send me source
of tesseract' dross.

On the bright side, the wiki source is held in SVN, and if you send me
a diff, or put it in the issue tracker, I'll commit it.

>
> Patrick
>
> On Jul 27, 4:20 pm, Eugene Reimer <[email protected]> wrote:
>> A quick glance at the documentation will tell you that "the dictionary"
>> lives in several DAWG files, as well in that user-words file.
>>
>> patrickq wrote, On 2010-07-27 14:59:
>>
>> > I get HAX 6 5-5,- with Tesseract 3.0
>>
>> > What I find remarkable is that half the folks on this forum would love
>> > to disable the word recognition (i.e. dictionary), the other half
>> > would like to enable it - and absolutely no one knows how to enable/
>> > disable the dictionary nor can say for sure if it's actually enabled
>> > or not by default. I am included in the group of the clueless - we
>> > have scanned thousands of business cards and still have no idea
>> > whatsoever what the hell is going on with that elusive dictionary.
>>
>> > I gather from Jimmy's recent answer that the dictionary is contained
>> > in a single file of type text, one word per line, in a file called
>> > eng.user-words (any support for regular expressions there? for example
>> > to say that [\\d]*th is a common word) placed in the Tessdata folder
>> > but we await final confirmation. Is it enough that the file exists?
>> > Does removing the file disable the dictionary?
>>
>> > Clearly many have used the dictionary but sadly it appears that these
>> > knowledgeable people deserted this forum once they got the answers
>> > they need - if you see one of these gentlemen (or ladies, yes) roaming
>> > the streets, please admonish them for not staying subscribed to forum
>> > messages to give back in helping others!
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: mis-decoding a single line of text

Reply via email to