Thanks for your answer, I'll try it. Le dimanche 7 mai 2023 à 21:49:29 UTC+2, zdenop a écrit :
> Hello, > > yes, you can train tesseract with your images. Have a look at > https://github.com/tesseract-ocr/tesstrain and an example project > https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip > You can retrain ( finetune) existing (e.g. just to add new letters/symbols > or font) by using the parameter START_MODEL. > > Zdenko > > > pi 5. 5. 2023 o 20:13 Augustin Fourcaud <a.fou...@gmail.com> napísal(a): > >> Hello, I am using Tesseract for a project (i’m not use to OCR) and I am >> encountering some issues that I haven't been able to resolve with the >> documentation, and I saw that questions can be asked here. I am coding in >> Python, on Jupyter, and using Pytesseract. >> >> I would like to extract the text contained in tables (see img1). Most of >> these tables are scanned PDFs, which is why I am using OCR. So, I tried to >> apply it to specific cells to test the accuracy and adjust Tesseract. I >> created PNG images of different sizes and did several preprocessing tests. >> I have three types of images: table titles (title.png), cells with spaced >> text (light_cell.png), and cells with tight text (cellXXprcent.png). This >> is where I encounter problems that I cannot solve: >> >> In the case of cells with tight text ( cellXXprcent.png, the 3 images are >> a small part of all formats i tested) , even on very zoomed-in text, which >> is of good quality, or on text of the right size (about 30px high) but of >> average quality, I cannot get good results. I have tried on the images by >> modifying the size in several different ways (scaling directly from the PDF >> with the scaleBy method of PYPDF2, saving at 300 DPI and resizing the PNGs >> with OpenCV) and with preprocessing (with thresholding, erosion, dilation, >> opening, top_hat, and with different sizes of ellipse and rectangle >> kernels) without really increasing the accuracy. I have applied everything >> (I think) that is said in the documentation, binarization, image border of >> 5 and 10 pixels, the images are not noisy and are straight, and there is no >> alpha channel. I have also tested with different OEMs and PSMs and by >> disabling Tesseract dictionaries, since my text is not in the form of words >> (should I write "-c load_system_dawg=false -c load_freq_dawg=false" or "-c >> load_system_dawg=false+load_freq_dawg=false" in the config? both work, so I >> don't know which format is correct). Is there a solution that I haven't >> tried yet? >> >> I also have more general questions: >> >> Is it the solution to train Tesseract with my own images, and if so, can >> I train it with a large size or with a specific size? I haven't done any >> training myself yet because my images are of good quality and don't have a >> particularly extravagant font. >> >> How can I add the Greek script to the parameters to detect Lambda without >> disrupting the recognition of English characters? Currently, when I write >> (-l eng+greek), some English characters are recognized as Greek characters, >> I would like it to only recognize Lambda as Greek. Could the argument for >> whitelist be a solution? >> >> Thank you very much in advance if you take the time to answer me, and >> have a good weekend. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/99b58936-7fc1-4ee2-976b-5a942f58e5fcn%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/99b58936-7fc1-4ee2-976b-5a942f58e5fcn%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/258d2ce9-0a31-428b-9bfd-1944b44ea6fen%40googlegroups.com.