Re: [tesseract-ocr] Improve text extraction

2022-07-22 Thread Lorenzo Bolzani
Hi Atef, I think your best option is to generate a lot of images as bad as this one and use them for training. So you take the good images (with the corresponding text), thousands, and ruin/blur them in many different ways. In this way, for example, from good 1000 images you get 5000/1 bad ima

[tesseract-ocr] Improve text extraction

2022-07-20 Thread Atef Chatty
Hi, i want to extract information from unclear images. I tried many filters but it doesn’t help. This is some example : This is the input pictures: Example.png So why i want to extract this informations ? : I am working on a project to extract information from driver’s licenses. The extraction i

Re: [tesseract-ocr] Improve text extraction when some text is inverted

2021-07-02 Thread 'Chris' via tesseract-ocr
Thanks to both of you for replying. I'm using Charles Weld's NuGet package (https://github.com/charlesw/tesseract/) so at the moment I think I am stuck on version 4.1.1. I have to admit Tesseract is a bit of a black box to me, and short of setting a few variables I am not I am at a bit of a los

Re: [tesseract-ocr] Improve text extraction when some text is inverted

2021-07-02 Thread Zdenko Podobny
You provided no example, so just hint: have a look at the leptonica function pixAutoPhotoinvert[1], that should help in such cases. Function is available IMO from version 1.79.0 [1] https://github.com/DanBloomberg/leptonica/blob/5aaf1c187deeef7f47288c6b0833a07021940da7/src/pageseg.c#L2370-L2391 Z

Re: [tesseract-ocr] Improve text extraction when some text is inverted

2021-07-02 Thread Merlijn B.W. Wajer
Hi, On 01/07/2021 18:39, 'Chris' via tesseract-ocr wrote: > I am experimenting with Tesseract 4.1.1 using C# to extract text from black > and white or greyscale TIF images of semi structured forms that are 300 > dpi. > > The results are really promising except when some of the text is inverted

[tesseract-ocr] Improve text extraction when some text is inverted

2021-07-01 Thread 'Chris' via tesseract-ocr
I am experimenting with Tesseract 4.1.1 using C# to extract text from black and white or greyscale TIF images of semi structured forms that are 300 dpi. The results are really promising except when some of the text is inverted (ie white on black). In these cases the results are poor. Can anyon