Re "X" checkbox:

Since this is a (I assume) standardized form, those checkboxes are at
known, fixed, positions.

Couple of thoughts:

1: assuming everyone "crosses" a checkbox is a faulty assumption. Some
people, depending on circumstances, "blacken" the box in other ways, all
legal and to be expected:
- a slash with a pen, sometimes a fat one: marked = checked.
- an arbitrary squiggle to "fill" the box more or less; when I observe
people, l often see shapes like flattened S or greek Xi, but expect circles
(filled and non filled) and really any other shape that is not too much
effort to put plenty ink onto an area.
- rare but happens: fully blackened. Think:
bored/upset/angry/tic/OCD/autism/...   Talk to people who process (paper)
voting forms if you want to research it. That would be my first stop
anyway. (The Dutch use paper voting forms, which, by law, must be inspected
by humans, so you will find knowledge and observations there that a
cheaper, less quality oriented, machine process won't ever give you. Find
out who volunteers for voting committee duty and take it from there.)

Bottom line: consider what humans do, observe and consider what they might
do, before taking an example given to you by your client or boss as "it" -
your success here depends on the supplier: you'll have to train them too ;-)


2: given that a checkbox is not a letter/word field but an inked/not-inked?
field, feeding it to tesseract is both overkill and adverse to success.
Better to apply those image filters and count the number of black pixels in
the (known) area: ANYTHING (any ink blot) above a certain (low) threshold
signifies "checked"; the low threshold is not zero to account for dirt,
coffee stains and other real world mishaps that can ruin a scan. To be
determined in a field test during development I would say.

Ruin a few extra forms with cola or coffee partial baths and include them
in your test sets if you care about input range / quality. For dirt, which
will cause incorrect machine conclusions if you don't sensibly
filter+threshold, drop a few firms on the wet earth outside and tread on
them.  (If I were at your spot,then, yes, I would expect forms with boot
tread marks across them as part of the feed once you get this "into
production" aka "when we went live". Drop, abuse, dry, scan.

 People carry forms. Shit happens on the way to you. And they will gladly
entertain the thought of lynching when your "software" gives them no meds
or the wrong stuff.
Bottom line: your "minimum acceptable output quality" is strongly dependent
in where you are within or without the handing-out-meds primary process.


Anyhow, Checkboxes, from my perspective, don't need heavy CPU loading and
power burning ai solutions. Image filtering however is a must:
preprocessing FTW! ;-)


-----

About the text fields:

I haven't tested your images but I expect medium grade success rates; as I
wrote in another conversation in this mailing list a few weeks ago,
tesseract is engineered to "read" books, academic papers, etc.
That also means the specific JARGON of medicine prescription forms does not
match that world view to a tee. Hence you will need further work (dedicated
model training) fr any OCR engine (tesseract, trocr, etc) you apply.

The JARGON I see has at least two joint categories:

1: medical brand names, chemicals, etc.: commercial recognizers for speech
and writing have dedicated medical models and that's what you pay for. Big
bucks as it's lots of specialized effort and from what I saw in speech
recog, available in segregated per medical field form where possible as the
vocabularies are huge and when you want to recog psychiatry jargon,
anaesthesia jargon and all the rest is just horribly complicating NOISE
making recog all that much harder. So you get rid of it at the earliest
opportunity -- in this case the workflow design and system criteria phase.

Take away/lesson: fir top quality you must investigate the incoming
"language" and produce and train a tailored model. Larger language is fast
increasing work cost estimate. (Human work for creating and training the
model; then the same for the machine as the machine model will be sized
accordingly)


2: part of the JARGON, or "the language spoken here" if you will, are the
shorthand "words", e.g. "3x", "70mg", "w.food" (to be taken together with
some food), etc.etc. I expect a large and possibly *inconsistent* set of
short hands as those will surely differ per human author.

Those short hands are specific to your input language (think of "language"
as "anything that can be written here and is to be understood by the
recipient: apothecary and/or client", not high school language ed. --
another unrelated jargon/shorthand bit right there, btw ;-) ) and will be
harder to recognize as they have not featured, or only featured
lightly/sparingly in the tesseract training sets AFAICT. Which is the
reason I expect only *medium grade* recognition result quality out of the
box. For tesseract or anything else you grab off shelf.



Last thought: if your input is always done in the same "look": same font
and same brand dot matrix printer serving as the "physical renderer" then
you might want to consider looking into trocr or other alternatives as
those are (I suppose) possibly easier to train than a widely generic
lstm+CTC engine such as tesseract. But that's a wild guess as I haven't
done this myself for your or similar enough scenario.


Another idea there (wild, as in: untested to date as far as I know) is to
run this through TWO disparate OCR engines, say tesseract and something
like trocr, have both output hocr format or similar, ie the full gamut of
content+scores+pagepixelcoordinates and then feed those into a judge (using
a NN or whatever is found to work for your scenario) as part of your post
processing phase, picking "the best / most agreeable of both worlds" per
word or field processed, driven by the word/character scores from each.
.



*Ergo / Mgt.Summary*: application of tesseract and any and all
preprocessing and postprocessing is highly dependent on your place in the
overarching primary processes, it is highly dependent in how and when it
impacts any humans, patients/clients in particular (the ethics board and
the DoJ may want a word one day, perhaps) and hence any and all
thoughts/ideas or other work, effort and musings not closely involved with
your project -- and bound by mutually agreed written and signed contract --
are, at best, to be rated as conjecture and sans merit / subject to all
disclaimers of liability of any form and any kind: YMMV.

My thoughts, HTH,

Ger

PS: in this story, i assume the proper placement (Aka bracketing) of the
form itself, ie the image preprocessing BEFORE segmentation and OCR image
preprocessing, as *already solved*, so there won't be any doubt about the
position and size of each form field, checkbox or otherwise, down to pixels
coordinate level.



On Thu, 15 Feb 2024, 10:22 'Mert T' via tesseract-ocr, <
tesseract-ocr@googlegroups.com> wrote:

> Any ideas?
>
> Mert T schrieb am Donnerstag, 8. Februar 2024 um 17:16:16 UTC+1:
>
>> Hello,
>>
>> I'm new to Tesseract and have the problem that the text recognition has
>> many errors. What I'm doing is scanning a prescription in German, and I
>> want to show only certain areas.
>> So I created certain areas (marked in blue) as new Bitmaps and used them
>> in the Process Image method. I edited the Bitmap with A Forge to get rid of
>> the red text and make the gray text darker(Screenshot). The 'X' is not
>> recognized. If any letter is recognized, the checkbox should be checked.
>> I tried to get better results with a better scan quality (600 dpi), but I
>> got the best results with 150 dpi.
>> Tesseract has many functionalities, I tried some of them but I don't know
>> how to use them well to solve my problems. Could someone help me out?
>>
>> Thanks.
>>
>> Here my Code:
>>
>> public string ProcessImage(Bitmap image)
>> {
>>     image = RemovePinkTextAndMakeGrayTextDarker(image);
>>
>>     using var engine = new TesseractEngine("./tessdata", "deu",
>> EngineMode.Default);
>>     using var img = PixConverter.ToPix(image);
>>     using var page = engine.Process(image, PageSegMode.AutoOsd);
>>     return page.GetText();
>> }
>>
>> private Bitmap RemovePinkTextAndMakeGrayTextDarker(Bitmap image)
>> {
>>     var filter = new EuclideanColorFiltering
>>     {
>>         CenterColor = new RGB(Color.HotPink),
>>         Radius = 80,
>>         FillColor = new RGB(Color.White),
>>         FillOutside = false
>>     };
>>     filter.ApplyInPlace(image);
>>
>>     var filter3 = new EuclideanColorFiltering
>>     {
>>         CenterColor = new RGB(Color.DarkGray),
>>         Radius = 80,
>>         FillColor = new RGB(Color.Black),
>>         FillOutside = false
>>     };
>>     filter3.ApplyInPlace(image);
>>
>>     return image;
>> }
>>
>> [image: 150 scan.png]
>>
>> [image: Screenshot marked.png]
>>
>> [image: Scanarea.png]
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e618ccd0-3832-4c22-8b2b-b90769cc9d2an%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/e618ccd0-3832-4c22-8b2b-b90769cc9d2an%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frFD_f_2Jq7ZA2z715%2B7cQ222hwernMTwGLkrjfaK91Nw%40mail.gmail.com.

Reply via email to