Re: [tesseract-ocr] Improve tesseract accuracy.

Alex Porter Thu, 23 Feb 2023 03:47:42 -0800

Thanks Ger, this has been incredibly helpful! Reducing the image size for
OCR has dramatically increased the accuracy and reliability of my output.


On Wed, Feb 22, 2023 at 11:31 AM Ger Hobbelt <g...@hobbelt.com> wrote:

> Re the line pixel height research I mentioned I recalled: it's here:
> https://willus.com/blog.shtml?tesseract_accuracy and here:
> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94
>
> I had forgotten I got it from this mailinglist!
>
> On Sat, Feb 18, 2023, 18:39 Ger Hobbelt <g...@hobbelt.com> wrote:
>
>> Hi,
>>
>> Had a very quick look but got sidetracked into something else, so I
>> didn't write the tesseract test script I wanted, so TILAAEFTR. Here goes:
>>
>> your '4' output image is rather large for tesseract to treat it as a
>> 'single line'.
>>
>> tess is known to deliver different accuracies for (*wildly*) different
>> line sizes -- I seem to recall some research and graphs from 2019 where
>> accuracy went down for both too small (8-10px) and *way too high* (200+px),
>> producing a bit of /skewed/ bathtub curve for the OCR error rate, so the
>> idea here is to rescale your extracted number images to a suitable size,
>> before feeding it ot the OCR engine.
>>
>> Test this remark/idea with a script:
>>
>> ```
>> let img = 'out.png'  // the '4', f.e.
>> for (let h = 8; h < 500; h = ceil( h * 1.1 /* = +10% */ )) {
>>   /* use imagemagick for scaling, f.e.? */
>>   rescale(img, height: h, unit: 'px') -> img2
>>   tesseract(img2) -> txt
>> }
>> ```
>>
>> (pseudocode above; write in your favorite scripting language: bash, js,
>> python, whatever)
>>
>> collect the `txt` OCR results; rank them and see where your 'optimum
>> height' lands you. Then use that for your application.
>>
>>
>>
>> Afterthought / Side thought:
>>
>> I see you are grabbing a computer display screen and applying OCR to it.
>> A few thoughts pop up immediately given the source type:
>>
>> I see a rather organized screen, no noisy/chaotic background you get with
>> burned-in subtitles, for example. Food for thought.
>>
>> - doesn't it suffice to take the number (*digit*) images and compare them
>> against a (created) master set, using a image similarity metric? As it's
>> the machine rendering those numbers, they should be pretty consistent, save
>> for some anti-aliasing or non-pixel-accurate positioning in the renderer
>> resulting in (slightly) different pixel values / images for each digit.
>> (Feels like tesseract is an elephant gun for this. But then I probably
>> missed several cues and be utterly wrong...)
>>
>> - of that same vein, taking it one further: since it's output from a
>> computer machine, can't we hook into the software which produces these
>> images and get the raw digital numeric / scoreboard data from the software
>> straight away? Iff we can, we don't have the significant overhead and data
>> accuracy challenges that come with reversing anything using OCR: it's never
>> a 100% accuracy this way. (software protections and other obstructions
>> related to data commerce and ~ politics can keep us at a distance, where
>> screengrabbing+OCR becomes an optimum viable solution if we want to get
>> access to the data, but I would love to get away with less for the same (or
>> better) result. :-S )
>>
>> - is it me or am I seeing more of this machine ->
>> screengrab/scan/photograph, digitally or *analog* (phone snaps of other
>> phones' screens) -> machine OCR data transport queries lately ('22 / '23)?
>> Have I missed something?
>>
>> This looks like trade/score screens and at least the traders would have
>> *some* incentive to provide an API for this. (When you find the related
>> paywall insurmountable, grab+OCR is the way to go, alas, but it will always
>> be somewhat finicky.)
>>
>>
>>
>> Met vriendelijke groeten / Best regards,
>>
>> Ger Hobbelt
>>
>> --------------------------------------------------
>> web:    http://www.hobbelt.com/
>>         http://www.hebbut.net/
>> mail:   g...@hobbelt.com
>> mobile: +31-6-11 120 978
>> --------------------------------------------------
>>
>>
>> On Fri, Feb 17, 2023 at 8:08 AM Alex Porter <trumpetdu...@gmail.com>
>> wrote:
>>
>>>  am currently building a pythont tool to read the screenshots of a
>>> in-game scoreboard. The scoreboard looks like this:[image: ss_1.png]
>>>
>>> I am using open cv to analyse the scoreboard and can reliably slice the
>>> image into rows and extra each value from the scoreboard giving an image,
>>> after processing, like this:[image: crop3.png]
>>>
>>> I am still having issues with tesseract accurately identifying the
>>> numbers. Sometimes it is inaccurate (identifying the wrong number) or not
>>> giving any output at all. I have only whitelisted 0-9 when reading the
>>> numbers. Any help on pre-processing the image to increase accuracy or any
>>> other ideas would be much appreciated!
>>>
>>> I have also attatched the python code. It's quite messy in it's current
>>> form so please forgive that if you decide to look!
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/19961d38-af02-4253-801d-4de53493cf54n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/19961d38-af02-4253-801d-4de53493cf54n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/jWdpUF7mTxE/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpxFVty%2Ba66Ndhb258rggN4u4OY%3DC62asW9_j3%2BoNzFAw%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpxFVty%2Ba66Ndhb258rggN4u4OY%3DC62asW9_j3%2BoNzFAw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CABVhM9BqV2kuUnP0E3XR2WtNpkiO1BBScerjTZPhcZYg7zZWOA%40mail.gmail.com.

Re: [tesseract-ocr] Improve tesseract accuracy.

Reply via email to