I wonder what are your latest observations? I am looking for answers to your questions as well.
On Tuesday, February 21, 2017 at 11:59:27 AM UTC-5 kolomiyets wrote: > Hi, > > > I have been trying to train Tesseract 4.0 with my own data in order to > extract text as a mix of natural language words and domain-specific > (non-natural language) words (acronyms, identifiers, abbreviations). The > Tesseract standard model has troubles in recognizing domain-specific words > where “visual” words from source are either dropped or recognized with > missing parts in them. So, I decided to train my own model. > > > I went through tutorials, set up a number of experiments, but so far with > no real success. While I could fix the entirely dropped words problem by > lowering the hard coded confidence threshold, and a partial success in > recognizing domain-specific words, the accuracy on natural language words > went down. > > > Two observations I have made so far by following experiments: > > > > 1. In Experiment 1 I use the available data (as it is) for training > (~1 M tokens, and ~150 fonts). After that I generated an evaluation data > set for another ~200 k tokens and ~15 most relevant fonts. Then, I trained > the model by replacing the top layer from the existing Tesseract > traineddata as described at > > https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replace-Top-Layer > > The training converged a couple of days later and I evaluated the model on > a held out dataset with gold standard (tiff – plain txt). The accuracy I > received was lower than by using the standard Tesseract model. I noticed > that the model is able to recognize some (not all) domain-specific words, > however the performance on the natural language words went down (where the > standard model worked fine). So, I analyzed the errors and designed > another > experiment in which I addressed the observed errors, which were, in my > opinion, caused by data skewness = confusions between characters in rare > and complex contexts. > 2. In Experiment 2 I used the entire data set I have (~120 M tokens) > and extracted word and char bigram statistics. Then I took all words with > frequencies over a certain threshold as part of the final training data > set. In addition, I boosted word statistics for words containing > low-frequency char bi-grams (which made me troubles in the experiment > before) and appended them into the final training data set. In the end, it > resulted in a ~600 k unique words training data set. This was then > rendered > with ~ 150 fonts into tiffs, the evaluation data set remained a natural > language text of ~200 k tokens in ~15 most relevant fonts. It turned out > that training converges too slow – it has been running for over a week now > with the best model of a ~ 0.17% error rate . Evaluations of pairs of > different subsequent model snapshots on the held out dataset showed no > general improvement over each other, rather random fluctuations between > more accurate natural language words vs. domain-specific and vice versa. > More interesting, models with lower char error rate (< 0.5%) perform worse > (especially on natural language words) than models with higher char error > rate (~ 0.5%). I also noticed that the model captures “language modeling > features” which makes the recognition of misspelled words, “non-natural > language” unique identifiers and acronyms difficult. Moreover, unique > identifiers, rare words etc. in text are a big problem, however can be > already recognized in chunks, but not as a whole word. More specifically, > trouble cases are “like-this”, “like/this” or “this-or-like-this”. > > > > At this point I am doubting the way I am training Tesseract is correct. So > I would like to ask the community the following questions: > > > - Should I use a natural language text or a dictionary of words for > training and evaluation data set? > - How important is the effect of token redundancy? (Are the errors in > recognition of natural language words caused by the only single instances > of those words in the training data?) > - How to get Tesseract to recognize freely generated tokens not > available in the training dataset? > > > Thanks, > > Alex > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ff5f4f48-aeaf-4ed6-ab37-faa8dedb215an%40googlegroups.com.