I am forced to disagree for the simple and good reason that I am having
progress over the current FRA module. But this is true only if I create a
custom langage and use it in coordination with the standard.
This is probably due to the CUBE stuff which seems to be a real game
changer in the standard langage. But Cube is for the moment out of reach
for custom training.
Example : I have a middle damage image that give a 92% accurency in using
-l FRA.
I created another library called ADF and added new
word-dwag, new ambigs entries and a set of 15 box/tif certified 'georgia'
font taken from scanned books. Certified box file are re-checked to assert
100% correctness. I
Next I used it with the -l FRA+ADF and got a 98% accurracy on this same
pic.
BUT. There are also regressions that appears. Some characters that are
correctly recognized with -l FRA (namely a damaged font 'd' for the word
<des>' becomes <(les> with -l FRA+ADF.
but much more characters that where not recognized with FRA are now
correctly recognized with FRA+ADF. I tried to fix this with new ambigs
rules in the ADF ambig rules but regression remains.
Good to know is that when I tried to backport all goodied from my custom
made ADF library directly into FRA, the result were disapointing.
The best combination so far is to segregate all custom training into a new
library and perform the tesseract with both ( -l FRA+ADF in my case).
I suspect that the failure to obtain improvements directly from
modification in the base langage library (for me FRA) is due to the cube
stuff which alter the rules.
Also I wrote a series of scripts around imagemagick where, for a set of
given images, each with its certified box, generate a new box file using
tesseract and compared the certified box with the generated box in order
to extract from the image every character with a mismatched translation.
This process is fully automated and next, the script collated all these
new certified failed-to-be-recognized character into a new image to
be retrainted using the certifed box. The result was disapointing but not
totally without effect. Just diapointing for the moment.
I plan to improve the process : instead of extraction just the caracter,
extract the word where there is a failed characters, recreate a new image
with these failed words. To achive that I intend to explore this new
feature from image2text --output_word_boxes to help identify a word.
otherwise I am good to write a procedure to find the word boundaries for a
given box.
Last, I noticed a very big improvement from 3.02 to 3.03 on
the more-or-less damages images. Version 3.03 showed alone more improvement
than my tweaked 3.02 <-l FRA+ADF>
When it comes to images with clear picx (like the one saved from MS word
onto a TIFF) , I already see result 100% correct, but then ABBY also gives
100% on these favorable conditions.
The real challange is the scanned books image.
Le mercredi 5 mars 2014 10:54:36 UTC+1, zdenop a écrit :
> You need to port all training tools to android.
>
> Generally (my opinion):
>
> 1. Unless you have proof that you MUST do custom training - training
> is wasting of time (nobody was able to create better language data for the
> existing language and common fonts at Google)
> 2. Unless you do not understand training process (probably you will
> need to read the source code) - training is wasting of time
>
>
>
> Zdenko
>
>
> On Wed, Mar 5, 2014 at 9:39 AM, Tushar Makkar
> <[email protected]<javascript:>
> > wrote:
>
>> I am using the tess-two (https://github.com/rmtheis/tess-two) library
>> for OCR recognition on Android . I want to create the training data on
>> Android . I have followed
>> https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and
>> successfully created training data on linux system . How to do the same on
>> Android using tess-two or any other library ?
>>
>> --
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]<javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.