Hi Nor, I would crop the text as tight as possible, in this way you control exactly the text region (see the attached image). Altro try adding a white border of 1 or 2 pixels later, see IF this works best.
The image you sent is not pure black and white, so maybe the automatic cropping gets confused. At the bottom of the image there is a gray line that probably causes the problem. If you do not want to crop it yourself do a threshold on the image but you need to find a reasonable threshold (experiment with Gimp). Cropping seems easier. Use psm 7, or 6, (see tesseract --help-extra). With the tightly cropped images try a few rescale to fixed height like "original size", 30, 35, 40, 45, 50 px and see what works best. Do a second pass on the best "height region" with a finer grid. As you have a reasonable amount of test images, I would run a script to test all these combinations of preprocessing, a few hundreds, to find the sweet spot even if it may take a couple of hours. You can also use the whitelist to limit the valid characters, depending on the type or errors you are seeing. The image looks very compressed, if possible reduce the compression or use PNG. I do not know which tool/language you are using but, if you are programming, see if you can find a real API bindings (like tesserocr for python) and not a command line wrapper. Bye Lorenzo Il giorno mer 26 lug 2023 alle ore 21:09 nor s <njsgas...@gmail.com> ha scritto: > OK I think I found the sweet spot. Setting the location for the crop > rectangle to +933+1013 from the top left corner of the image gives me an > amazing result of 98.8% and average on 670 images. I think that's pretty > good! > I still don't know why moving the box around a few pixels makes such a > difference. > > I think I'm where I want to be. if anyone has any ideas or suggestion > about what's happening I'd love to hear from you. > > Cheers > Nor > > On Wednesday, July 26, 2023 at 12:24:26 PM UTC-4 nor s wrote: > >> Just to add a bit more information. I have found that changing the >> vertical position of the crop box by a few pixels seems to make a >> difference. >> One image that had a crop location of +930+1015 was not reading the >> date/time. However, changing the vertical position to +1000 resulted in a >> 105 out of 133 correct readings. Again, not being familiar with the >> internal workings of OCR, I having difficulty in understanding why OCR is >> behaving this way. >> >> Still digging! :) >> >> Cheers >> Nor >> >> On Wednesday, July 26, 2023 at 9:21:56 AM UTC-4 nor s wrote: >> >>> To show an example of an OCR that properly extracted the date/time, here >>> are the files I used. >>> ShowPix it the full image , Outpx.2.jpg is the cropped image and >>> outpx2.txt is the result of the OCR. >>> >>> As you can see the imaged that failed and the one that worked are very >>> similar. >>> >>> Cheers >>> Nor >>> On Wednesday, July 26, 2023 at 9:05:04 AM UTC-4 nor s wrote: >>> >>>> Hi All, >>>> As I had mentioned in an earlier message, I've got tesseract to >>>> properly identify dates and time at a rate of about 84%.. However what >>>> puzzles me is why the program reads the time stamp from the image >>>> properly and on another image it fails. All the images are similar and >>>> for all I crop put the date/time area to isolate it. I have attaches an >>>> example. >>>> >>>> The tempimage.jpg is the full image. outpx.jpx is the cropped image and >>>> outpx.txt is the OCR result produced from the cropped image. >>>> >>>> If anyone has any idea why OCR fails on this I would love to hear from >>>> you. >>>> >>>> Thanks for your help. >>>> >>>> Cheers >>>> Nor >>> >>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/631ff8fd-660e-4bb2-b558-013bcc00218cn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/631ff8fd-660e-4bb2-b558-013bcc00218cn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxPP%3D63%2BMKC_WKPXGA9DLcx5wviktBoDgjF_iGb_zr5FQ%40mail.gmail.com.