Re: [tesseract-ocr] OCR problem with condensed text

Augustin Fourcaud Wed, 10 May 2023 01:19:04 -0700

Thanks for your answer, I'll try it.

Le dimanche 7 mai 2023 à 21:49:29 UTC+2, zdenop a écrit :


> Hello,
>
> yes, you can train tesseract with your images. Have a look at 
> https://github.com/tesseract-ocr/tesstrain and an example project 
> https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip
> You can retrain ( finetune) existing (e.g. just to add new letters/symbols 
> or font) by using the parameter START_MODEL. 
>
> Zdenko
>
>
> pi 5. 5. 2023 o 20:13 Augustin Fourcaud <a.fou...@gmail.com> napísal(a):
>
>> Hello, I am using Tesseract for a project (i’m not use to OCR) and I am 
>> encountering some issues that I haven't been able to resolve with the 
>> documentation, and I saw that questions can be asked here. I am coding in 
>> Python, on Jupyter, and using Pytesseract.
>>
>> I would like to extract the text contained in tables (see img1). Most of 
>> these tables are scanned PDFs, which is why I am using OCR. So, I tried to 
>> apply it to specific cells to test the accuracy and adjust Tesseract. I 
>> created PNG images of different sizes and did several preprocessing tests. 
>> I have three types of images: table titles (title.png), cells with spaced 
>> text (light_cell.png), and cells with tight text (cellXXprcent.png). This 
>> is where I encounter problems that I cannot solve:
>>
>> In the case of cells with tight text ( cellXXprcent.png, the 3 images are 
>> a small part of all formats i tested) , even on very zoomed-in text, which 
>> is of good quality, or on text of the right size (about 30px high) but of 
>> average quality, I cannot get good results. I have tried on the images by 
>> modifying the size in several different ways (scaling directly from the PDF 
>> with the scaleBy method of PYPDF2, saving at 300 DPI and resizing the PNGs 
>> with OpenCV) and with preprocessing (with thresholding, erosion, dilation, 
>> opening, top_hat, and with different sizes of ellipse and rectangle 
>> kernels) without really increasing the accuracy. I have applied everything 
>> (I think) that is said in the documentation, binarization, image border of 
>> 5 and 10 pixels, the images are not noisy and are straight, and there is no 
>> alpha channel. I have also tested with different OEMs and PSMs and by 
>> disabling Tesseract dictionaries, since my text is not in the form of words 
>> (should I write "-c load_system_dawg=false -c load_freq_dawg=false" or "-c 
>> load_system_dawg=false+load_freq_dawg=false" in the config? both work, so I 
>> don't know which format is correct). Is there a solution that I haven't 
>> tried yet?
>>
>> I also have more general questions:
>>
>> Is it the solution to train Tesseract with my own images, and if so, can 
>> I train it with a large size or with a specific size? I haven't done any 
>> training myself yet because my images are of good quality and don't have a 
>> particularly extravagant font.
>>
>> How can I add the Greek script to the parameters to detect Lambda without 
>> disrupting the recognition of English characters? Currently, when I write 
>> (-l eng+greek), some English characters are recognized as Greek characters, 
>> I would like it to only recognize Lambda as Greek. Could the argument for 
>> whitelist be a solution?
>>
>> Thank you very much in advance if you take the time to answer me, and 
>> have a good weekend.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/99b58936-7fc1-4ee2-976b-5a942f58e5fcn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/99b58936-7fc1-4ee2-976b-5a942f58e5fcn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/258d2ce9-0a31-428b-9bfd-1944b44ea6fen%40googlegroups.com.

Re: [tesseract-ocr] OCR problem with condensed text

Reply via email to