Re: [tesseract-ocr] OpenCV Python preprocessing strategies for OCR (pytesseract) character recognition

Zdenko Podobny Fri, 03 Jan 2025 15:27:47 -0800

There is nothing like 100% OCR.

Please provide an example of an image that causes a problem. These you
provided work out of box:
tesseract image_save1.png -
Estimating resolution as 445
S/N: 0112182
DATE: DECEMBER 2024


tesseract image_save2.png -
Estimating resolution as 450
5X

Zdenko


so 4. 1. 2025 o 0:18 Jokūbas Žižiūnas <jokubas0...@gmail.com> napísal(a):

> I wanted to ask what are the most optimal pre-processing techniques for my
> case in the letters that I would like to read. I am using pytesseract for
> character recognition, but sometimes my characters are not recognized
> properly.
>
> I have added couple samples of images I am using, but am using more.
>
>
> The most common issues are:
> - 5 get recognized as S (but not vice versa)
> - S gets recognized as O (but not vice versa)
> - / gets recognized as I
>
> I have tried multiple techniques, but if one technique fixes an issue,
> then another issue pops up. The character recognition works most of the
> time, but it is not consistent, I would say ~80%. I can take a picutre, do
> the processing and recognition works, then take a new picture in same
> conditions and the recognition does not work, seems like recognition is
> within the tolerance of noise
>
> I believe that a large part of issue is that the font is in bold. For
> example, I did notice that the wider / is, the more likely it is to be
> recognized as I. I have tried cv2.resize(fx=2, fy=2) + cv2.erode(), but
> then for some reason I recognized that the thicker the 5 is, the less
> likely it is to be recognized as S. At the same time , if characters are
> thicker, or I reduce the threshold in binarization, the hole in 4 gets
> filled in and causes the problems.
>
> I cannot change the font. I have tried taking picture at various
> exposures, nothing does seem to fix the core of the issue. I This is the
> best focus I am able to obtain. I cannot whitelist certain symbols, because
> both letters and numbers are possible. I do not want to do .replace('SX',
> '5X') because the point of the check is to validate the that the label has
> been printer correctly.
>
> Techniques I have tried:
> - Regular binarization
> - OTSU binarization
> - Adaptive thresholding
> - Resize + erode()
> - Upscale image with cv2.dnn_superres, kinda better, but too slow, because
> I have a lot of images to process
> - Histogram equalization before any of the above
>
> NOTE: I am able to get the solution for sample images, I am unable to get
> the consistent solution if images slightly vary, I cannot get it to work
> 100% of the time.
>
> Can someone provide info on how would you go about cleaning up these images
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/59cbd128-37c6-4c06-abbc-f79a05d95a5dn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/59cbd128-37c6-4c06-abbc-f79a05d95a5dn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6LC2bid-61RLB4HnWogfOtbQCxP5vOjDBbjJU_UoxzQ%40mail.gmail.com.

Re: [tesseract-ocr] OpenCV Python preprocessing strategies for OCR (pytesseract) character recognition

Reply via email to