L.S., apologies, I think I screwed up with my last replies going private instead of group. Anyway, here's the trail until right now: see below.
(Clumsy on the mobi 😰, me) @astro/nor: this means I'm the only one who got your first sample image, so might be good to resend it to group so everyone can follow. Sorry for messing up the reply chain here. ---------- Forwarded message --------- From: Ger Hobbelt <g...@hobbelt.com> Date: Mon, 24 Jul 2023, 20:50 Subject: Re: [tesseract-ocr] App to adjust imgage scaling To: astro Hi Nor, Thanks for the background info and sample image. I'm away from my machines for at least a week, only online per mobile and very short sporadic checks, so this will have to wait unless someone likes to take a swing at it, but the sample looks good at first glance from here. There's some light grey noise in there but the tesseract binarization process part should easily take care of those, (fingers crossed) so ocr is expected to succeed most of the time with this. But, as always, the truth is in the testing so I'll have to see what a tesseract run does on my rig, f.e. If I read you correctly, you now have a high ocr success rate? (Perfect ocr is always a miracle, but better than 90% would-be a good initial target to aim for. Tweaking that upwards is an *art* and I'm not an expert in that yet 😅) Cheers, Ger PS: next step might be handy to show your tesseract command line you issue from VB, (plus sample image(s) and output you get out of tesseract, good & bad): there's a couple people on here who may voice some improvements if they spot any and have time to respond. On Fri, 21 Jul 2023, 18:46 astro, wrote: > Hi Ger, > The images I'm scanning are trail camera images that have the date/time > on the picture in the bottom corner. I'm trying to extract the date/time > values from the image. Normally the images are 1440x1080 at 96dpi . the > only way I could get tesseract to read some of the time stamp was by upping > the image size. I have since changed my strategy and used imageMagick to > crop the bottom corner of the image that contains the date/time to a 540x70 > image and leaving the 96dpi ( see attached). That seems to work very well. > I'm currently looking to increase the reliability by trying various things > including correcting the output where possible. > > Thanks for the reply. > > Cheers > Nor > > On 7/21/2023 12:01 PM, Ger Hobbelt wrote: > > 6000*4500?! > > Hm, sounds way too large for a simple text. > > I'm guessing here, but it might be that you got thwarted by the various > "dpi" notes re ocr/tesseract out there. > > Bottom line: IIRC tesseract was trained on text of around 30px high (note > that I use PX = pixels as the relevant unit of measure, I don't care about > dpi because that's something only really relevant to printing press people > (desktop publishing, etc.) > While a lot of folks hang onto dpi as unit of measure it's derivative and > only relevant when you scan printed pages, which turns "points" (and picas > and ....) into pixels, which is where dpi pops up. > > Anyway, the key bit for every image you feed to an ocr engine like > tesseract is attempting to match the ”x height” Vs the training material as > closely as possible for any attempt at a good/optimal match. > For tesseract, this means you should aim for each line if text to be > somewhere between 20 and 50 pixels high (and as clean looking in black & > white / greyscale as possible, but that comes second, after getting that > line height to the 20-50px range. Computers work in PX, not DPI, so it's PX > that's the driving criterium. > > Since you mention "picking out a date” I ASSUME your text area is one line > of text only. > > Drop all image areas that do not contain text. > Make sure the text is black on a white background (you may need to invert > your image when this is a video grab or some such, f.e.) > There's a long wiki page about improving image quality for tesseract > processing too. > But first try to extract that line of text, scale it so the digits are > between 20-50px high and try some sizes within that range. > > Second most important bit, I find, is making sure the input image has > black text on white background or anything greyscale/luminance-wise that > approaches this as best as possible. SOME tesseract modes / settings can > cope with white text on black BG, but that's you getting rather lucky so > don't bet on it. > > tesseract is *engineered* for black text in white background input images > (paper book scans) > > If you need further assistance on this forum/mailing list, attack the > image and tesseract commandline you tried; those messages get more feedback > as they are less of a guessing game ;-) > > PS: third most important work item that lots of folks do wrong: when > clipping/extracting lines of text, postprocess those line images by adding > a nice large white=BACKGROUND COLOR boundary around the entire line. > Personally, I favor a "border" like that of about 0.5 to 1.0 the size of > the line itself. The added border should be SMOOTHLY transitioning from the > actual image background to prevent false edge detections in tesseract > itself: this problem doesn't happen for clean paper book scans (which > already have a plain white background) but is an important aspect when > extracting from "busy backgrounds". > Anyway, that topic is the size of a book all by itself, so take it slow > and get prio 1 right first: 1 line of text to ocr = 20-50px high. > > Cheers, > > Ger > > > > > On Fri, 21 Jul 2023, 13:35 astro, wrote: > >> Hi Ger, >> Thanks for your response. Yes. I found ImageMagick. Looks t be very >> powerful and easy to implement. I tried it out by upping the the image to >> 300 dpi and 6000x4500 and ran the image thru the OCR process but tesseract >> had difficulty in picking out the date on the image. I guess I will have to >> play around so to see if I can improve things. >> >> Cheers >> Nor >> >> On 7/21/2023 12:13 AM, Ger Hobbelt wrote: >> >> Check out ImageMagick, an open source image toolset. Specifically the >> 'convert' tool, look for commandline usage and application >> parameters/arguments, where you will find several ways to resize/rescale >> the image. >> Also useful to ”tweak” the image as part of your ocr preprocessing >> pipeline before your image reaches tesseract. >> >> Another big one would be OpenCV, but that would require you to write >> programs (python software or similar) while ImageMagick can accomplish a >> lot of what you want or might need and can be driven by some simple batch / >> Powershell / shell lines: much easier to get success that way if you're not >> already comfortable with coding software. >> >> https://legacy.imagemagick.org/Usage/resize/ >> May appear overwhelming at first; read and try the various ways mentioned >> there to get a grasp and discover what you need to do for your scenario >> specifically. Ocr is not a simple process pipeline, so take your time with >> it. >> >> >> On Thu, 20 Jul 2023, 15:03 nor s, wrote: >> >>> I'm trying to run tesseract-OCR on images that come to me at 72 DPI . >>> The program is unable to decode these images and requires a 200 dpi or >>> better scale to be successful. Is there a program available, similar to >>> tesseract-OCR, that would read a command line and convert an 72 dpi image >>> to 200 dpi or some other specified value and save it in a specified >>> location. I'm running windows 10. >>> I can make these change in Photoshop but I'm trying to automate the >>> process since I have a lot of image to scan. >>> >>> Any suggestion would be greatly appreciated. >>> >>> Thanks >>> Nor >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/b6075062-921e-4da9-acdf-b0364dc3c960n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/b6075062-921e-4da9-acdf-b0364dc3c960n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frm0CYyZnKuVpuYHmLh9j_6XjBx%3DMYZ5i8B%3DO1zsRK8pA%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frm0CYyZnKuVpuYHmLh9j_6XjBx%3DMYZ5i8B%3DO1zsRK8pA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foe-C5U6kUP_OaFyF6Yk9jhEWBsx1vH6rGw0FeKd%2BOKng%40mail.gmail.com.