Thanks Ger, this has been incredibly helpful! Reducing the image size for
OCR has dramatically increased the accuracy and reliability of my output.

On Wed, Feb 22, 2023 at 11:31 AM Ger Hobbelt <g...@hobbelt.com> wrote:

> Re the line pixel height research I mentioned I recalled: it's here:
> https://willus.com/blog.shtml?tesseract_accuracy and here:
> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94
>
> I had forgotten I got it from this mailinglist!
>
> On Sat, Feb 18, 2023, 18:39 Ger Hobbelt <g...@hobbelt.com> wrote:
>
>> Hi,
>>
>> Had a very quick look but got sidetracked into something else, so I
>> didn't write the tesseract test script I wanted, so TILAAEFTR. Here goes:
>>
>> your '4' output image is rather large for tesseract to treat it as a
>> 'single line'.
>>
>> tess is known to deliver different accuracies for (*wildly*) different
>> line sizes -- I seem to recall some research and graphs from 2019 where
>> accuracy went down for both too small (8-10px) and *way too high* (200+px),
>> producing a bit of /skewed/ bathtub curve for the OCR error rate, so the
>> idea here is to rescale your extracted number images to a suitable size,
>> before feeding it ot the OCR engine.
>>
>> Test this remark/idea with a script:
>>
>> ```
>> let img = 'out.png'  // the '4', f.e.
>> for (let h = 8; h < 500; h = ceil( h * 1.1 /* = +10% */ )) {
>>   /* use imagemagick for scaling, f.e.? */
>>   rescale(img, height: h, unit: 'px') -> img2
>>   tesseract(img2) -> txt
>> }
>> ```
>>
>> (pseudocode above; write in your favorite scripting language: bash, js,
>> python, whatever)
>>
>> collect the `txt` OCR results; rank them and see where your 'optimum
>> height' lands you. Then use that for your application.
>>
>>
>>
>> Afterthought / Side thought:
>>
>> I see you are grabbing a computer display screen and applying OCR to it.
>> A few thoughts pop up immediately given the source type:
>>
>> I see a rather organized screen, no noisy/chaotic background you get with
>> burned-in subtitles, for example. Food for thought.
>>
>> - doesn't it suffice to take the number (*digit*) images and compare them
>> against a (created) master set, using a image similarity metric? As it's
>> the machine rendering those numbers, they should be pretty consistent, save
>> for some anti-aliasing or non-pixel-accurate positioning in the renderer
>> resulting in (slightly) different pixel values / images for each digit.
>> (Feels like tesseract is an elephant gun for this. But then I probably
>> missed several cues and be utterly wrong...)
>>
>> - of that same vein, taking it one further: since it's output from a
>> computer machine, can't we hook into the software which produces these
>> images and get the raw digital numeric / scoreboard data from the software
>> straight away? Iff we can, we don't have the significant overhead and data
>> accuracy challenges that come with reversing anything using OCR: it's never
>> a 100% accuracy this way. (software protections and other obstructions
>> related to data commerce and ~ politics can keep us at a distance, where
>> screengrabbing+OCR becomes an optimum viable solution if we want to get
>> access to the data, but I would love to get away with less for the same (or
>> better) result. :-S )
>>
>> - is it me or am I seeing more of this machine ->
>> screengrab/scan/photograph, digitally or *analog* (phone snaps of other
>> phones' screens) -> machine OCR data transport queries lately ('22 / '23)?
>> Have I missed something?
>>
>> This looks like trade/score screens and at least the traders would have
>> *some* incentive to provide an API for this. (When you find the related
>> paywall insurmountable, grab+OCR is the way to go, alas, but it will always
>> be somewhat finicky.)
>>
>>
>>
>> Met vriendelijke groeten / Best regards,
>>
>> Ger Hobbelt
>>
>> --------------------------------------------------
>> web:    http://www.hobbelt.com/
>>         http://www.hebbut.net/
>> mail:   g...@hobbelt.com
>> mobile: +31-6-11 120 978
>> --------------------------------------------------
>>
>>
>> On Fri, Feb 17, 2023 at 8:08 AM Alex Porter <trumpetdu...@gmail.com>
>> wrote:
>>
>>>  am currently building a pythont tool to read the screenshots of a
>>> in-game scoreboard. The scoreboard looks like this:[image: ss_1.png]
>>>
>>> I am using open cv to analyse the scoreboard and can reliably slice the
>>> image into rows and extra each value from the scoreboard giving an image,
>>> after processing, like this:[image: crop3.png]
>>>
>>> I am still having issues with tesseract accurately identifying the
>>> numbers. Sometimes it is inaccurate (identifying the wrong number) or not
>>> giving any output at all. I have only whitelisted 0-9 when reading the
>>> numbers. Any help on pre-processing the image to increase accuracy or any
>>> other ideas would be much appreciated!
>>>
>>> I have also attatched the python code. It's quite messy in it's current
>>> form so please forgive that if you decide to look!
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/19961d38-af02-4253-801d-4de53493cf54n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/19961d38-af02-4253-801d-4de53493cf54n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/jWdpUF7mTxE/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpxFVty%2Ba66Ndhb258rggN4u4OY%3DC62asW9_j3%2BoNzFAw%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpxFVty%2Ba66Ndhb258rggN4u4OY%3DC62asW9_j3%2BoNzFAw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CABVhM9BqV2kuUnP0E3XR2WtNpkiO1BBScerjTZPhcZYg7zZWOA%40mail.gmail.com.

Reply via email to