Re "X" checkbox: Since this is a (I assume) standardized form, those checkboxes are at known, fixed, positions.
Couple of thoughts: 1: assuming everyone "crosses" a checkbox is a faulty assumption. Some people, depending on circumstances, "blacken" the box in other ways, all legal and to be expected: - a slash with a pen, sometimes a fat one: marked = checked. - an arbitrary squiggle to "fill" the box more or less; when I observe people, l often see shapes like flattened S or greek Xi, but expect circles (filled and non filled) and really any other shape that is not too much effort to put plenty ink onto an area. - rare but happens: fully blackened. Think: bored/upset/angry/tic/OCD/autism/... Talk to people who process (paper) voting forms if you want to research it. That would be my first stop anyway. (The Dutch use paper voting forms, which, by law, must be inspected by humans, so you will find knowledge and observations there that a cheaper, less quality oriented, machine process won't ever give you. Find out who volunteers for voting committee duty and take it from there.) Bottom line: consider what humans do, observe and consider what they might do, before taking an example given to you by your client or boss as "it" - your success here depends on the supplier: you'll have to train them too ;-) 2: given that a checkbox is not a letter/word field but an inked/not-inked? field, feeding it to tesseract is both overkill and adverse to success. Better to apply those image filters and count the number of black pixels in the (known) area: ANYTHING (any ink blot) above a certain (low) threshold signifies "checked"; the low threshold is not zero to account for dirt, coffee stains and other real world mishaps that can ruin a scan. To be determined in a field test during development I would say. Ruin a few extra forms with cola or coffee partial baths and include them in your test sets if you care about input range / quality. For dirt, which will cause incorrect machine conclusions if you don't sensibly filter+threshold, drop a few firms on the wet earth outside and tread on them. (If I were at your spot,then, yes, I would expect forms with boot tread marks across them as part of the feed once you get this "into production" aka "when we went live". Drop, abuse, dry, scan. People carry forms. Shit happens on the way to you. And they will gladly entertain the thought of lynching when your "software" gives them no meds or the wrong stuff. Bottom line: your "minimum acceptable output quality" is strongly dependent in where you are within or without the handing-out-meds primary process. Anyhow, Checkboxes, from my perspective, don't need heavy CPU loading and power burning ai solutions. Image filtering however is a must: preprocessing FTW! ;-) ----- About the text fields: I haven't tested your images but I expect medium grade success rates; as I wrote in another conversation in this mailing list a few weeks ago, tesseract is engineered to "read" books, academic papers, etc. That also means the specific JARGON of medicine prescription forms does not match that world view to a tee. Hence you will need further work (dedicated model training) fr any OCR engine (tesseract, trocr, etc) you apply. The JARGON I see has at least two joint categories: 1: medical brand names, chemicals, etc.: commercial recognizers for speech and writing have dedicated medical models and that's what you pay for. Big bucks as it's lots of specialized effort and from what I saw in speech recog, available in segregated per medical field form where possible as the vocabularies are huge and when you want to recog psychiatry jargon, anaesthesia jargon and all the rest is just horribly complicating NOISE making recog all that much harder. So you get rid of it at the earliest opportunity -- in this case the workflow design and system criteria phase. Take away/lesson: fir top quality you must investigate the incoming "language" and produce and train a tailored model. Larger language is fast increasing work cost estimate. (Human work for creating and training the model; then the same for the machine as the machine model will be sized accordingly) 2: part of the JARGON, or "the language spoken here" if you will, are the shorthand "words", e.g. "3x", "70mg", "w.food" (to be taken together with some food), etc.etc. I expect a large and possibly *inconsistent* set of short hands as those will surely differ per human author. Those short hands are specific to your input language (think of "language" as "anything that can be written here and is to be understood by the recipient: apothecary and/or client", not high school language ed. -- another unrelated jargon/shorthand bit right there, btw ;-) ) and will be harder to recognize as they have not featured, or only featured lightly/sparingly in the tesseract training sets AFAICT. Which is the reason I expect only *medium grade* recognition result quality out of the box. For tesseract or anything else you grab off shelf. Last thought: if your input is always done in the same "look": same font and same brand dot matrix printer serving as the "physical renderer" then you might want to consider looking into trocr or other alternatives as those are (I suppose) possibly easier to train than a widely generic lstm+CTC engine such as tesseract. But that's a wild guess as I haven't done this myself for your or similar enough scenario. Another idea there (wild, as in: untested to date as far as I know) is to run this through TWO disparate OCR engines, say tesseract and something like trocr, have both output hocr format or similar, ie the full gamut of content+scores+pagepixelcoordinates and then feed those into a judge (using a NN or whatever is found to work for your scenario) as part of your post processing phase, picking "the best / most agreeable of both worlds" per word or field processed, driven by the word/character scores from each. . *Ergo / Mgt.Summary*: application of tesseract and any and all preprocessing and postprocessing is highly dependent on your place in the overarching primary processes, it is highly dependent in how and when it impacts any humans, patients/clients in particular (the ethics board and the DoJ may want a word one day, perhaps) and hence any and all thoughts/ideas or other work, effort and musings not closely involved with your project -- and bound by mutually agreed written and signed contract -- are, at best, to be rated as conjecture and sans merit / subject to all disclaimers of liability of any form and any kind: YMMV. My thoughts, HTH, Ger PS: in this story, i assume the proper placement (Aka bracketing) of the form itself, ie the image preprocessing BEFORE segmentation and OCR image preprocessing, as *already solved*, so there won't be any doubt about the position and size of each form field, checkbox or otherwise, down to pixels coordinate level. On Thu, 15 Feb 2024, 10:22 'Mert T' via tesseract-ocr, < tesseract-ocr@googlegroups.com> wrote: > Any ideas? > > Mert T schrieb am Donnerstag, 8. Februar 2024 um 17:16:16 UTC+1: > >> Hello, >> >> I'm new to Tesseract and have the problem that the text recognition has >> many errors. What I'm doing is scanning a prescription in German, and I >> want to show only certain areas. >> So I created certain areas (marked in blue) as new Bitmaps and used them >> in the Process Image method. I edited the Bitmap with A Forge to get rid of >> the red text and make the gray text darker(Screenshot). The 'X' is not >> recognized. If any letter is recognized, the checkbox should be checked. >> I tried to get better results with a better scan quality (600 dpi), but I >> got the best results with 150 dpi. >> Tesseract has many functionalities, I tried some of them but I don't know >> how to use them well to solve my problems. Could someone help me out? >> >> Thanks. >> >> Here my Code: >> >> public string ProcessImage(Bitmap image) >> { >> image = RemovePinkTextAndMakeGrayTextDarker(image); >> >> using var engine = new TesseractEngine("./tessdata", "deu", >> EngineMode.Default); >> using var img = PixConverter.ToPix(image); >> using var page = engine.Process(image, PageSegMode.AutoOsd); >> return page.GetText(); >> } >> >> private Bitmap RemovePinkTextAndMakeGrayTextDarker(Bitmap image) >> { >> var filter = new EuclideanColorFiltering >> { >> CenterColor = new RGB(Color.HotPink), >> Radius = 80, >> FillColor = new RGB(Color.White), >> FillOutside = false >> }; >> filter.ApplyInPlace(image); >> >> var filter3 = new EuclideanColorFiltering >> { >> CenterColor = new RGB(Color.DarkGray), >> Radius = 80, >> FillColor = new RGB(Color.Black), >> FillOutside = false >> }; >> filter3.ApplyInPlace(image); >> >> return image; >> } >> >> [image: 150 scan.png] >> >> [image: Screenshot marked.png] >> >> [image: Scanarea.png] >> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/e618ccd0-3832-4c22-8b2b-b90769cc9d2an%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/e618ccd0-3832-4c22-8b2b-b90769cc9d2an%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frFD_f_2Jq7ZA2z715%2B7cQ222hwernMTwGLkrjfaK91Nw%40mail.gmail.com.