Hello, yes, you can train tesseract with your images. Have a look at https://github.com/tesseract-ocr/tesstrain and an example project https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip You can retrain ( finetune) existing (e.g. just to add new letters/symbols or font) by using the parameter START_MODEL.
Zdenko pi 5. 5. 2023 o 20:13 Augustin Fourcaud <a.fourca...@gmail.com> napísal(a): > Hello, I am using Tesseract for a project (i’m not use to OCR) and I am > encountering some issues that I haven't been able to resolve with the > documentation, and I saw that questions can be asked here. I am coding in > Python, on Jupyter, and using Pytesseract. > > I would like to extract the text contained in tables (see img1). Most of > these tables are scanned PDFs, which is why I am using OCR. So, I tried to > apply it to specific cells to test the accuracy and adjust Tesseract. I > created PNG images of different sizes and did several preprocessing tests. > I have three types of images: table titles (title.png), cells with spaced > text (light_cell.png), and cells with tight text (cellXXprcent.png). This > is where I encounter problems that I cannot solve: > > In the case of cells with tight text ( cellXXprcent.png, the 3 images are > a small part of all formats i tested) , even on very zoomed-in text, which > is of good quality, or on text of the right size (about 30px high) but of > average quality, I cannot get good results. I have tried on the images by > modifying the size in several different ways (scaling directly from the PDF > with the scaleBy method of PYPDF2, saving at 300 DPI and resizing the PNGs > with OpenCV) and with preprocessing (with thresholding, erosion, dilation, > opening, top_hat, and with different sizes of ellipse and rectangle > kernels) without really increasing the accuracy. I have applied everything > (I think) that is said in the documentation, binarization, image border of > 5 and 10 pixels, the images are not noisy and are straight, and there is no > alpha channel. I have also tested with different OEMs and PSMs and by > disabling Tesseract dictionaries, since my text is not in the form of words > (should I write "-c load_system_dawg=false -c load_freq_dawg=false" or "-c > load_system_dawg=false+load_freq_dawg=false" in the config? both work, so I > don't know which format is correct). Is there a solution that I haven't > tried yet? > > I also have more general questions: > > Is it the solution to train Tesseract with my own images, and if so, can I > train it with a large size or with a specific size? I haven't done any > training myself yet because my images are of good quality and don't have a > particularly extravagant font. > > How can I add the Greek script to the parameters to detect Lambda without > disrupting the recognition of English characters? Currently, when I write > (-l eng+greek), some English characters are recognized as Greek characters, > I would like it to only recognize Lambda as Greek. Could the argument for > whitelist be a solution? > > Thank you very much in advance if you take the time to answer me, and have > a good weekend. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/99b58936-7fc1-4ee2-976b-5a942f58e5fcn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/99b58936-7fc1-4ee2-976b-5a942f58e5fcn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z5s_F3y32QTY4ua_7cng7_5vN66LwTe5antY1NX00K-w%40mail.gmail.com.