I wonder what are your latest observations? I am looking for answers to 
your questions as well.

On Tuesday, February 21, 2017 at 11:59:27 AM UTC-5 kolomiyets wrote:

> Hi,
>
>
> I have been trying to train Tesseract 4.0 with my own data in order to 
> extract text as a mix of natural language words and domain-specific 
> (non-natural language) words (acronyms, identifiers, abbreviations). The 
> Tesseract standard model has troubles in recognizing domain-specific words 
> where “visual” words from source are either dropped or recognized with 
> missing parts in them. So, I decided to train my own model. 
>
>
> I went through tutorials, set up a number of experiments, but so far with 
> no real success. While I could fix the entirely dropped words problem by 
> lowering the hard coded confidence threshold, and a partial success in 
> recognizing domain-specific words, the accuracy on natural language words 
> went down. 
>
>
> Two observations I have made so far by following experiments:
>
>
>
>    1. In Experiment 1 I use the available data (as it is) for training 
>    (~1 M tokens, and ~150 fonts). After that I generated an evaluation data 
>    set for another ~200 k tokens and ~15 most relevant fonts. Then, I trained 
>    the model by replacing the top layer from the existing Tesseract 
>    traineddata as described at 
>    
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replace-Top-Layer
>  
>    The training converged a couple of days later and I evaluated the model on 
>    a held out dataset with gold standard (tiff – plain txt). The accuracy I 
>    received was lower than by using the standard Tesseract model. I noticed 
>    that the model is able to recognize some (not all) domain-specific words, 
>    however the performance on the natural language words went down (where the 
>    standard model worked fine). So, I analyzed the errors and designed 
> another 
>    experiment in which I addressed the observed errors, which were, in my 
>    opinion, caused by data skewness = confusions between characters in rare 
>    and complex contexts.
>    2. In Experiment 2 I used the entire data set I have (~120 M tokens) 
>    and extracted word and char bigram statistics. Then I took all words with 
>    frequencies over a certain threshold as part of the final training data 
>    set. In addition, I boosted word statistics for words containing 
>    low-frequency char bi-grams (which made me troubles in the experiment 
>    before) and appended them into the final training data set. In the end, it 
>    resulted in a ~600 k unique words training data set. This was then 
> rendered 
>    with ~ 150 fonts into tiffs, the evaluation data set remained a natural 
>    language text of ~200 k tokens in ~15 most relevant fonts. It turned out 
>    that training converges too slow – it has been running for over a week now 
>    with the best model of a ~ 0.17% error rate . Evaluations of pairs of 
>    different subsequent model snapshots on the held out dataset showed no 
>    general improvement over each other, rather random fluctuations between 
>    more accurate natural language words vs. domain-specific and vice versa. 
>    More interesting, models with lower char error rate (< 0.5%) perform worse 
>    (especially on natural language words) than models with higher char error 
>    rate (~ 0.5%). I also noticed that the model captures “language modeling 
>    features” which makes the recognition of misspelled words, “non-natural 
>    language” unique identifiers and acronyms difficult. Moreover, unique 
>    identifiers, rare words etc. in text are a big problem, however can be 
>    already recognized in chunks, but not as a whole word. More specifically, 
>    trouble cases are “like-this”, “like/this” or “this-or-like-this”. 
>
>
>
> At this point I am doubting the way I am training Tesseract is correct. So 
> I would like to ask the community the following questions:
>
>
>    - Should I use a natural language text or a dictionary of words for 
>    training and evaluation data set?
>    - How important is the effect of token redundancy? (Are the errors in 
>    recognition of natural language words caused by the only single instances 
>    of those words in the training data?)
>    - How to get Tesseract to recognize freely generated tokens not 
>    available in the training dataset? 
>    
>
> Thanks,
>
> Alex
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ff5f4f48-aeaf-4ed6-ab37-faa8dedb215an%40googlegroups.com.

Reply via email to