Re: Training Tesseract on Android

zdenko podobny Sun, 09 Mar 2014 10:50:27 -0700

Dear Bernard,

thank for your reaction - especially your experience with language
combination are very useful.


On other hand I am not sure what is point you disagree with me. I did not
write that training is not useful. E.g. there are several good experiences
in this forum with training for old font (aka Fraktur or Gothics)[1] (even
I am not sure if anybody got 100% - see polish experience described in
Report on the comparison of Tesseract and ABBYY FineReader OCR engines[2]

[1]
https://code.google.com/p/tesseract-ocr/wiki/AddOns#Community_training_project
[2] https://code.google.com/p/tesseract-ocr/wiki/Documentation#Other

Zdenko


On Wed, Mar 5, 2014 at 3:23 PM, Bernard Polarski <[email protected]> wrote:

> I am forced to disagree for the simple and good reason that I am having
> progress over the current FRA module. But this is true only if I create a
> custom langage and use it in coordination with the standard.
> This is probably due to the CUBE stuff which seems to be a real game
> changer in the standard langage. But Cube is for the moment out of reach
> for custom training.
>
> Example :  I have a middle damage image that give a 92% accurency in using
> -l FRA.
>                  I created another library called ADF and added new
> word-dwag, new ambigs entries and a set of 15 box/tif certified 'georgia'
> font taken from scanned  books. Certified box file are re-checked to assert
> 100% correctness. I
>
> Next I used it with the  -l FRA+ADF and got a 98% accurracy on this same
> pic.
>
> BUT. There are also regressions that appears. Some characters that are
> correctly recognized with -l FRA (namely a damaged font  'd' for the word
> <des>' becomes <(les>  with -l FRA+ADF.
> but much more characters that where not recognized with FRA are now
> correctly recognized with FRA+ADF. I tried to fix this with new ambigs
> rules  in the ADF ambig rules but regression remains.
>
> Good to know is that when I tried to backport all goodied from my custom
> made ADF library directly into FRA,  the result were disapointing.
> The best combination so far is to segregate all custom training into a new
> library and perform the tesseract with both ( -l FRA+ADF in my case).
>
> I suspect that the failure to obtain improvements directly from
> modification in the base langage library (for me FRA) is due to the cube
> stuff which alter the rules.
>
> Also I wrote a series of scripts around imagemagick where, for a  set of
> given images, each with its certified box, generate a new box file using
> tesseract and compared the certified box with the generated box in order
> to extract from the image every character with a mismatched translation.
> This process is fully automated and next, the script collated all these
> new certified failed-to-be-recognized character into a new image to
> be retrainted using the certifed box. The result was disapointing but not
> totally without effect. Just diapointing for the moment.
>
> I plan to improve the process : instead of extraction just the caracter,
> extract the word where there is a failed characters, recreate a new image
> with these failed words. To achive that I intend to explore this new
> feature from image2text --output_word_boxes to help identify a word.
> otherwise I am good to write a procedure to find the word boundaries for a
> given box.
>
> Last, I noticed a very big improvement from 3.02 to 3.03 on
> the more-or-less damages images. Version 3.03 showed alone more improvement
> than my tweaked 3.02 <-l FRA+ADF>
> When it comes to images with clear picx (like the one saved from MS word
> onto a TIFF) , I already see result 100% correct, but then ABBY also gives
> 100% on these favorable conditions.
> The real challange is the scanned books image.
>
>
>
> Le mercredi 5 mars 2014 10:54:36 UTC+1, zdenop a écrit :
>
>> You need to port all training tools to android.
>>
>> Generally (my opinion):
>>
>>    1. Unless you have proof that you MUST do custom training - training
>>    is wasting of time (nobody was able to create better language data for the
>>    existing language and common fonts at Google)
>>    2. Unless you do not understand training process (probably you will
>>    need to read the source code) - training is wasting of time
>>
>>
>>
>> Zdenko
>>
>>
>> On Wed, Mar 5, 2014 at 9:39 AM, Tushar Makkar <[email protected]>wrote:
>>
>>> I am using the tess-two (https://github.com/rmtheis/tess-two) library
>>> for OCR recognition on Android . I want to create the training data on
>>> Android . I have followed https://code.google.com/p/tesseract-ocr/wiki/
>>> TrainingTesseract3 and successfully created training data on linux
>>> system . How to do the same on Android using tess-two or any other library
>>> ?
>>>
>>> --
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>>
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>>
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>  --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Training Tesseract on Android

Reply via email to