Tesseract is an OCR engine, so try to eliminate graphics elements by yourself/send only text areas to OCR.
Zdenko ut 20. 4. 2021 o 10:40 Soul Green <soulu...@gmail.com> napísal(a): > Omg thanks. > I hadn't thought about checking *that *documentation. I've been using > tesseract.js with node so I completely forgot that it was based on > something else. How amateur. > I also didn't know that tesseract did its own processing as well. > Thanks again I'll try everything there > On Tuesday, 20 April 2021 at 5:14:56 pm UTC+10 zdenop wrote: > >> Hint: read documentation, stop guessing. You can start here >> https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md >> >> Zdenko >> >> >> ut 20. 4. 2021 o 9:11 Soul Green <soul...@gmail.com> napísal(a): >> >>> I am very new to coding so forgive me. >>> >>> I have been having an extremely low success rate with tesseract. >>> Here are 3 examples both pre- and post- processing: >>> >>> [image: red1.jpg][image: croppedred1.jpg] [image: >>> yellow1.jpg][image: croppedyellow1.jpg] [image: >>> blue1.jpg][image: >>> croppedblue1.jpg] >>> These were scanned as "a" ,"Ss30", and "moh" respectively. >>> I consider the yellow one a success, as I can just regex the 30 out of >>> the result, but I still don't understand how it could be so off for the >>> rest. >>> >>> I've tried different traineddatas, even including one that I trained >>> myself on over 200 data examples. >>> >>> I have three theories as to why I couldn't train it: >>> 1. The different colours are processed differently, causing differently >>> shaped characters. (Red looks bold and yellow looks thin) >>> 2. The different sizes of the images causes the characters to be >>> slightly differently shaped when cropped. >>> 3. Tesseract assumes that the two lines of text are one, and reads them >>> together. >>> >>> Could someone please give me a hint on what to try? I don't want to >>> spend another day training it on just blue ones (for example) only to find >>> that colour isn't the problem. >>> Thanks >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/9d819bc5-cf07-4c28-91a6-61b142ccc324n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/9d819bc5-cf07-4c28-91a6-61b142ccc324n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/7ee0d000-566c-4371-acd2-b4a23b648563n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/7ee0d000-566c-4371-acd2-b4a23b648563n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z2rpc1zXyNP86nkdVERcb%3D28u94NxU125n_t_QbXSRMQ%40mail.gmail.com.