Re: Training Tesseract on Android

Bernard Polarski Wed, 05 Mar 2014 06:23:50 -0800

I am forced to disagree for the simple and good reason that I am having 
progress over the current FRA module. But this is true only if I create a 
custom langage and use it in coordination with the standard.
This is probably due to the CUBE stuff which seems to be a real game 
changer in the standard langage. But Cube is for the moment out of reach 
for custom training.
 
Example :  I have a middle damage image that give a 92% accurency in using 
-l FRA. 
                 I created another library called ADF and added new 
word-dwag, new ambigs entries and a set of 15 box/tif certified 'georgia' 
font taken from scanned  books. Certified box file are re-checked to assert 
100% correctness. I
 
Next I used it with the  -l FRA+ADF and got a 98% accurracy on this same 
pic.
 
BUT. There are also regressions that appears. Some characters that are 
correctly recognized with -l FRA (namely a damaged font  'd' for the word 
<des>' becomes <(les>  with -l FRA+ADF.
but much more characters that where not recognized with FRA are now 
correctly recognized with FRA+ADF. I tried to fix this with new ambigs 
rules  in the ADF ambig rules but regression remains.
 
Good to know is that when I tried to backport all goodied from my custom 
made ADF library directly into FRA,  the result were disapointing. 
The best combination so far is to segregate all custom training into a new 
library and perform the tesseract with both ( -l FRA+ADF in my case).
 
I suspect that the failure to obtain improvements directly from 
modification in the base langage library (for me FRA) is due to the cube 
stuff which alter the rules.
 
Also I wrote a series of scripts around imagemagick where, for a  set of 
given images, each with its certified box, generate a new box file using 
tesseract and compared the certified box with the generated box in order 
to extract from the image every character with a mismatched translation. 
This process is fully automated and next, the script collated all these 
new certified failed-to-be-recognized character into a new image to 
be retrainted using the certifed box. The result was disapointing but not 
totally without effect. Just diapointing for the moment.
 
I plan to improve the process : instead of extraction just the caracter, 
extract the word where there is a failed characters, recreate a new image 
with these failed words. To achive that I intend to explore this new 
feature from image2text --output_word_boxes to help identify a word. 
otherwise I am good to write a procedure to find the word boundaries for a 
given box.
 
Last, I noticed a very big improvement from 3.02 to 3.03 on 
the more-or-less damages images. Version 3.03 showed alone more improvement 
than my tweaked 3.02 <-l FRA+ADF> 
When it comes to images with clear picx (like the one saved from MS word 
onto a TIFF) , I already see result 100% correct, but then ABBY also gives 
100% on these favorable conditions. 
The real challange is the scanned books image.
 
  
 
Le mercredi 5 mars 2014 10:54:36 UTC+1, zdenop a écrit :


> You need to port all training tools to android. 
>
> Generally (my opinion):
>  
>    1. Unless you have proof that you MUST do custom training - training 
>    is wasting of time (nobody was able to create better language data for the 
>    existing language and common fonts at Google)
>    2. Unless you do not understand training process (probably you will 
>    need to read the source code) - training is wasting of time
>
>
>
> Zdenko
>
>
> On Wed, Mar 5, 2014 at 9:39 AM, Tushar Makkar 
> <[email protected]<javascript:>
> > wrote:
>
>> I am using the tess-two (https://github.com/rmtheis/tess-two) library 
>> for OCR recognition on Android . I want to create the training data on 
>> Android . I have followed 
>> https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and 
>> successfully created training data on linux system . How to do the same on 
>> Android using tess-two or any other library ? 
>>
>> -- 
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]<javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Training Tesseract on Android

Reply via email to