[tesseract-ocr] Re: Any success story?

Des Bw Fri, 17 Nov 2023 09:15:52 -0800

Dear Tom, thank you for listing out all the sources . Indeed, I didn't look 
hard. I was mostly reading this forum; and sure, I am familiar with Shree's 
(Nick White?) works.


>(like, a model that can detect with higher accuracy: 98% or more ?)
>An accuracy figure without context is meaningless. What language? What 
domain? What image source? What resolution? Word or character accuracy? 
etc, etc

When I wrote that I was thinking about regular scanned documents. The 
standard (default) model, for example, seems to mostly get around 92-95% 
accuracy in most 300dpi scanned books (prose).  It could be better or worse 
for some languages. But, that seems to average in most cases. 

My frustration has been the absence of good documentations of successful 
trainings done by others so that we beginners could learn from them. 
Schree's GitHub is the only place that contains relevant information on how 
some did the training: in what settings; and what results came out of 
etc. That is all my intention and point. A success stories are encouraging. 
And the learnings of that person could provide invaluable lessons for the 
new comers. 

On Thursday, November 16, 2023 at 8:05:53 PM UTC+3 [email protected] wrote:

> On Tuesday, November 14, 2023 at 12:55:07 AM UTC-5 [email protected] 
> wrote:
>
> It looks like every one is having issues with tesseract. 
>
>
> That's not true. It just looks like that because this list is dominated by 
> newcomers
> to the field of OCR and image processing.
>  
>
> I am not able to find any one who has a great success with this software. 
>
>
> With all due respect, you must not have looked very hard.
>  
>
> It would be really encouraging to hear any success story from 
> any language. 
>
>
> As Merlijn already mentioned, the Internet Archive has used Tesseract to 
> OCR 
> over 10 million *documents* (so 100s of millions of pages?) in hundreds of 
> languages
> https://archive.org/search?query=ocr%3Atesseract*
>
> TAMU's eMOP project used Tesseract with custom training to OCR 45 million
> old crufty page images from the dawn of the printing press
> https://emop.tamu.edu/software
>
> State of the Art Optical Character Recognition of 19th Century Fraktur 
> Scripts using Open Source Engines
> https://arxiv.org/abs/1810.03436
>
> German Parliamentary Corpus (GerParCor)
> https://arxiv.org/abs/2204.10422
>
> Additional arXiv papers using this search 
> <https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=tesseract&terms-0-field=all&classification-computer_science=y&classification-physics_archives=all&classification-include_cross_list=exclude&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first>.
>  
> Following the citation graphs of any of the
> papers will turn up additional potentially interesting papers.
>
> Has anybody a successful training of tesseract?
>
>
> Yes, many.
>
> Nick White trained Ancient Greek. 
> Shree has posted copiously about his efforts training Tesseract. See the 
> list archives as well as his repos:
> https://github.com/Shreeshrii?tab=repositories&q=tessdata_
>
> Exploiting Script Similarities to Compensate for the Large Amount of Data 
> in Training Tesseract LSTM: Towards Kurdish OCR
> https://www.mdpi.com/2076-3417/11/20/9752
>
> Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy 
> Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English
> https://arxiv.org/abs/2109.05952
>
> There's a contrib repository with Acadian, polytonic Greek, and other 
> user-trained languages
> https://github.com/tesseract-ocr/tessdata_contrib
>  
>
> (like, a model that can detect with higher accuracy: 98% or more ?)
>
>
> An accuracy figure without context is meaningless. What language? What 
> domain?
> What image source? What resolution? Word or character accuracy? etc, etc
>
> If you read some of the papers and descriptions of the large scale 
> projects, you'll see
> that OCR model training is a non-trivial problem which people spend 
> months/years on.
>
> Tom
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/633e7064-360b-4278-8c74-611bfeb15998n%40googlegroups.com.

[tesseract-ocr] Re: Any success story?

Reply via email to