Re: [tesseract-ocr] OCR problem with condensed text

Zdenko Podobny Sun, 07 May 2023 12:49:28 -0700

Hello,

yes, you can train tesseract with your images. Have a look at
https://github.com/tesseract-ocr/tesstrain and an example project
https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip
You can retrain ( finetune) existing (e.g. just to add new letters/symbols
or font) by using the parameter START_MODEL.


Zdenko


pi 5. 5. 2023 o 20:13 Augustin Fourcaud <a.fourca...@gmail.com> napísal(a):

> Hello, I am using Tesseract for a project (i’m not use to OCR) and I am
> encountering some issues that I haven't been able to resolve with the
> documentation, and I saw that questions can be asked here. I am coding in
> Python, on Jupyter, and using Pytesseract.
>
> I would like to extract the text contained in tables (see img1). Most of
> these tables are scanned PDFs, which is why I am using OCR. So, I tried to
> apply it to specific cells to test the accuracy and adjust Tesseract. I
> created PNG images of different sizes and did several preprocessing tests.
> I have three types of images: table titles (title.png), cells with spaced
> text (light_cell.png), and cells with tight text (cellXXprcent.png). This
> is where I encounter problems that I cannot solve:
>
> In the case of cells with tight text ( cellXXprcent.png, the 3 images are
> a small part of all formats i tested) , even on very zoomed-in text, which
> is of good quality, or on text of the right size (about 30px high) but of
> average quality, I cannot get good results. I have tried on the images by
> modifying the size in several different ways (scaling directly from the PDF
> with the scaleBy method of PYPDF2, saving at 300 DPI and resizing the PNGs
> with OpenCV) and with preprocessing (with thresholding, erosion, dilation,
> opening, top_hat, and with different sizes of ellipse and rectangle
> kernels) without really increasing the accuracy. I have applied everything
> (I think) that is said in the documentation, binarization, image border of
> 5 and 10 pixels, the images are not noisy and are straight, and there is no
> alpha channel. I have also tested with different OEMs and PSMs and by
> disabling Tesseract dictionaries, since my text is not in the form of words
> (should I write "-c load_system_dawg=false -c load_freq_dawg=false" or "-c
> load_system_dawg=false+load_freq_dawg=false" in the config? both work, so I
> don't know which format is correct). Is there a solution that I haven't
> tried yet?
>
> I also have more general questions:
>
> Is it the solution to train Tesseract with my own images, and if so, can I
> train it with a large size or with a specific size? I haven't done any
> training myself yet because my images are of good quality and don't have a
> particularly extravagant font.
>
> How can I add the Greek script to the parameters to detect Lambda without
> disrupting the recognition of English characters? Currently, when I write
> (-l eng+greek), some English characters are recognized as Greek characters,
> I would like it to only recognize Lambda as Greek. Could the argument for
> whitelist be a solution?
>
> Thank you very much in advance if you take the time to answer me, and have
> a good weekend.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/99b58936-7fc1-4ee2-976b-5a942f58e5fcn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/99b58936-7fc1-4ee2-976b-5a942f58e5fcn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z5s_F3y32QTY4ua_7cng7_5vN66LwTe5antY1NX00K-w%40mail.gmail.com.

Re: [tesseract-ocr] OCR problem with condensed text

Reply via email to