There is nothing like 100% OCR. Please provide an example of an image that causes a problem. These you provided work out of box: tesseract image_save1.png - Estimating resolution as 445 S/N: 0112182 DATE: DECEMBER 2024
tesseract image_save2.png - Estimating resolution as 450 5X Zdenko so 4. 1. 2025 o 0:18 Jokūbas Žižiūnas <jokubas0...@gmail.com> napísal(a): > I wanted to ask what are the most optimal pre-processing techniques for my > case in the letters that I would like to read. I am using pytesseract for > character recognition, but sometimes my characters are not recognized > properly. > > I have added couple samples of images I am using, but am using more. > > > The most common issues are: > - 5 get recognized as S (but not vice versa) > - S gets recognized as O (but not vice versa) > - / gets recognized as I > > I have tried multiple techniques, but if one technique fixes an issue, > then another issue pops up. The character recognition works most of the > time, but it is not consistent, I would say ~80%. I can take a picutre, do > the processing and recognition works, then take a new picture in same > conditions and the recognition does not work, seems like recognition is > within the tolerance of noise > > I believe that a large part of issue is that the font is in bold. For > example, I did notice that the wider / is, the more likely it is to be > recognized as I. I have tried cv2.resize(fx=2, fy=2) + cv2.erode(), but > then for some reason I recognized that the thicker the 5 is, the less > likely it is to be recognized as S. At the same time , if characters are > thicker, or I reduce the threshold in binarization, the hole in 4 gets > filled in and causes the problems. > > I cannot change the font. I have tried taking picture at various > exposures, nothing does seem to fix the core of the issue. I This is the > best focus I am able to obtain. I cannot whitelist certain symbols, because > both letters and numbers are possible. I do not want to do .replace('SX', > '5X') because the point of the check is to validate the that the label has > been printer correctly. > > Techniques I have tried: > - Regular binarization > - OTSU binarization > - Adaptive thresholding > - Resize + erode() > - Upscale image with cv2.dnn_superres, kinda better, but too slow, because > I have a lot of images to process > - Histogram equalization before any of the above > > NOTE: I am able to get the solution for sample images, I am unable to get > the consistent solution if images slightly vary, I cannot get it to work > 100% of the time. > > Can someone provide info on how would you go about cleaning up these images > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/tesseract-ocr/59cbd128-37c6-4c06-abbc-f79a05d95a5dn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/59cbd128-37c6-4c06-abbc-f79a05d95a5dn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6LC2bid-61RLB4HnWogfOtbQCxP5vOjDBbjJU_UoxzQ%40mail.gmail.com.