Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

Tom Morris Tue, 08 Dec 2015 10:34:58 -0800

FreeOCR is closed source and Windows only, so it's difficult for me to tell
what it's doing (or even what version of Tesseract it includes).  However,
the test case that you're using doesn't appear realistic.  Tesseract is
optimized for recognizing words, not short random strings of characters, so
rather than testing on "vv w" I think you'd get more representative results
if you used something like "Novv is the time to go dovvn" and see if it
turns the vv's into w's.  Having said that, vv ==> w isn't an entry in the
standard eng.unicharambigs.  They only mandatory entries are for quotes, so
you could try things like `' or '` to see if they get turned into ".


As far as I know, there's no way to specify a different unicharambigs file
on the command line.  You need to replace it in the kan.traineddata file
for it to be found.  The combine_tessdata utility is used for packing and
unpacked the traineddata files.  e.g.

    $ combine_tessdata -e kan.traineddata kan.unicharambigs
    $ combine_tessdata -o kan.traineddata kan.unicharambigs

One thing that I noticed when looking at the source is that there's an
upper limit of 10 characters for the bad and replacement strings, which I'm
not sure is documented anywhere.  This should be plenty for most
applications, but it's something to keep in mind.

Good luck.  Let us know how you make out.

Tom



On Tue, Dec 8, 2015 at 4:11 AM, Sriranga(83yrsold) <
[email protected]> wrote:

> Another question Is how to test  and add more in the <lang>unicharambigs
> in the tesseract-ocr . In case if it can  be tested in the CMD or terminal
> what is the commandline to be used?
>
> On Tue, Dec 8, 2015 at 2:18 PM, Sriranga(83yrsold) <
> [email protected]> wrote:
>
>> Hi Tom,
>> attached herewith sample of post-proc.txt used in FreeOCR  - which had
>> incorporated on my special request by creator Ralph Richardson  more than 3
>> years back. Attached screenshots will speak itself. As a sample I have done
>> in English for easy understand by you.
>> You can test in any langs. FreeOCR available for free download.
>> you will notice that post-processor text sample (except no option like 0
>> or 1)has similar feature available  in the <lang>unicharambig.
>> *Advantage of in-built *of "unicharambigs" is at the time of final
>> output of OCRed-
>> all misspelling will automatically corrected before generating the
>> <lan>traineddata resulting the corrected tessdata can be used for any image
>> for correcting output text.
>> *disadvantage of post processor* being external program is - one should
>> have update the post-proc.text everytime  for each  ocred
>> I am puzzled why unicharmabigs does not work as internal program
>> correctly - when the post processor program works fine?
>> With regards,
>> sriranga(83yrs)
>>
>>
>> On Mon, Dec 7, 2015 at 11:44 PM, Tom Morris <[email protected]> wrote:
>>
>>> Hi Sriranga.  I haven't used the training tools, but since no one else
>>> has answered, I'll give it my best attempt.  Shree might have better
>>> insights.
>>>
>>> First, a question of clarification.  Are you having problems with the
>>> file or are you just trying to determine whether it is working properly or
>>> not?
>>>
>>> If you just want to see if it's working correctly, my impression is that
>>> most people do this empirically by a) visual inspection of the file to see
>>> if the substitutions look correct and b) running a corpus of text through
>>> to see how the contents of the file affect accuracy.
>>>
>>> To my untrained eye, the things I wonder about are:
>>> - are all those mandatory substitutions (lines ending in 1) correct? ie
>>> is it true that the string in column 1 can *never* be a valid word?
>>> - there is an empty line which probably should be removed
>>> - there are a few lines which have junk after the third column which
>>> don't match the specified format e.g.:
>>>
>>> ಚಟಿಲ್ಕೆ ಚಟ್ನಿ,, 1   "
>>> ಹೊರಿದಿವೆ ಹೊಂದಿವೆ.1   .
>>>
>>> Some of the words with embedded punctuation also look a little
>>> suspicious to me.  Not knowing the script or language I don't know how
>>> common these errors are, but I'd probably start with a very basic list of
>>> substitutions and add to it as I found more common errors.
>>>
>>> Hopefully someone else can give you better advice which is based on more
>>> than bystander guesswork!
>>>
>>> Tom
>>>
>>>
>>> On Friday, December 4, 2015 at 10:36:13 PM UTC-5, sriranga(83yrsold)
>>> wrote:
>>>>
>>>> Solution is requested urgently.
>>>>
>>>> On Wed, Dec 2, 2015 at 4:25 PM, sriranga(83yrsold) <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>>  I have created kan.unicharambigs(attached below) based on the output
>>>>> text of Kan.training_text file (which is big). I could not understand how
>>>>> to test the attached file and find out whether it works or not?
>>>>> kindly point out my mistakes in fhe said attached file, if any, for
>>>>> which i shall be thankful to you. I prefer to have commandline test if
>>>>> possible.
>>>>>
>>>>>
>>>>> ==========================================================================
>>>>> Based on wiki instruction (extract reproduced below for ready
>>>>> reference) =
>>>>>
>>>>> The rules are not bidirectional, so if you want 'rn' to be considered
>>>>> when 'm' is detected and vise versa you need a rule for each.
>>>>>
>>>>> Version 3.03 and on supports a new, simpler format for the
>>>>> unicharambigs file:
>>>>>
>>>>> v2
>>>>> '' " 1
>>>>> m rn 0
>>>>> iii m 0
>>>>>
>>>>> In this format, the "error" and "correction" are simple utf-8 strings
>>>>> separated by *a space*, and, after another space, the same type
>>>>> specifier as v1 (0 for optional and 1 for mandatory substitution). Note 
>>>>> the
>>>>> downside of this simpler format is that Tesseract has to encode the utf-8
>>>>> strings into the components of the unicharset. In complex scripts, this
>>>>> encoding may be ambiguous. In this case, the encoding is chosen such as to
>>>>> use the least utf-8 characters for each component, ie the shortest
>>>>> unicharset components will make up the encoding.
>>>>>
>>>>> Like most other files used in training, the 'unicharambigs' file must
>>>>> be encoded as UTF8, and must end with a newline character. The
>>>>> unicharambigs format is also described in the unicharambigs(5) man
>>>>> page
>>>>> <https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharambigs.5.html>.
>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/VXdCSnno06w/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CANKD7YxsYjJuvCpc0rPY56ZB2bWo_XFDAY_rzP13k4rD20ZbdA%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CANKD7YxsYjJuvCpc0rPY56ZB2bWo_XFDAY_rzP13k4rD20ZbdA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEH3Qhs1QK3yoAmqR%3Dw-%2B9Bd_BNYgpoNxf%2BCaFNaE1k2zA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

Reply via email to