Re: [tesseract-ocr] Improve tesseract accuracy.

Ger Hobbelt Fri, 24 Feb 2023 04:09:28 -0800

:+1: Glad it works out so well for you!


Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------


On Thu, Feb 23, 2023 at 12:47 PM Alex Porter <trumpetdu...@gmail.com> wrote:

> Thanks Ger, this has been incredibly helpful! Reducing the image size for
> OCR has dramatically increased the accuracy and reliability of my output.
>
> On Wed, Feb 22, 2023 at 11:31 AM Ger Hobbelt <g...@hobbelt.com> wrote:
>
>> Re the line pixel height research I mentioned I recalled: it's here:
>> https://willus.com/blog.shtml?tesseract_accuracy and here:
>> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94
>>
>> I had forgotten I got it from this mailinglist!
>>
>> On Sat, Feb 18, 2023, 18:39 Ger Hobbelt <g...@hobbelt.com> wrote:
>>
>>> Hi,
>>>
>>> Had a very quick look but got sidetracked into something else, so I
>>> didn't write the tesseract test script I wanted, so TILAAEFTR. Here goes:
>>>
>>> your '4' output image is rather large for tesseract to treat it as a
>>> 'single line'.
>>>
>>> tess is known to deliver different accuracies for (*wildly*) different
>>> line sizes -- I seem to recall some research and graphs from 2019 where
>>> accuracy went down for both too small (8-10px) and *way too high* (200+px),
>>> producing a bit of /skewed/ bathtub curve for the OCR error rate, so the
>>> idea here is to rescale your extracted number images to a suitable size,
>>> before feeding it ot the OCR engine.
>>>
>>> Test this remark/idea with a script:
>>>
>>> ```
>>> let img = 'out.png'  // the '4', f.e.
>>> for (let h = 8; h < 500; h = ceil( h * 1.1 /* = +10% */ )) {
>>>   /* use imagemagick for scaling, f.e.? */
>>>   rescale(img, height: h, unit: 'px') -> img2
>>>   tesseract(img2) -> txt
>>> }
>>> ```
>>>
>>> (pseudocode above; write in your favorite scripting language: bash, js,
>>> python, whatever)
>>>
>>> collect the `txt` OCR results; rank them and see where your 'optimum
>>> height' lands you. Then use that for your application.
>>>
>>>
>>>
>>> Afterthought / Side thought:
>>>
>>> I see you are grabbing a computer display screen and applying OCR to it.
>>> A few thoughts pop up immediately given the source type:
>>>
>>> I see a rather organized screen, no noisy/chaotic background you get
>>> with burned-in subtitles, for example. Food for thought.
>>>
>>> - doesn't it suffice to take the number (*digit*) images and compare
>>> them against a (created) master set, using a image similarity metric? As
>>> it's the machine rendering those numbers, they should be pretty consistent,
>>> save for some anti-aliasing or non-pixel-accurate positioning in the
>>> renderer resulting in (slightly) different pixel values / images for each
>>> digit. (Feels like tesseract is an elephant gun for this. But then I
>>> probably missed several cues and be utterly wrong...)
>>>
>>> - of that same vein, taking it one further: since it's output from a
>>> computer machine, can't we hook into the software which produces these
>>> images and get the raw digital numeric / scoreboard data from the software
>>> straight away? Iff we can, we don't have the significant overhead and data
>>> accuracy challenges that come with reversing anything using OCR: it's never
>>> a 100% accuracy this way. (software protections and other obstructions
>>> related to data commerce and ~ politics can keep us at a distance, where
>>> screengrabbing+OCR becomes an optimum viable solution if we want to get
>>> access to the data, but I would love to get away with less for the same (or
>>> better) result. :-S )
>>>
>>> - is it me or am I seeing more of this machine ->
>>> screengrab/scan/photograph, digitally or *analog* (phone snaps of other
>>> phones' screens) -> machine OCR data transport queries lately ('22 / '23)?
>>> Have I missed something?
>>>
>>> This looks like trade/score screens and at least the traders would have
>>> *some* incentive to provide an API for this. (When you find the related
>>> paywall insurmountable, grab+OCR is the way to go, alas, but it will always
>>> be somewhat finicky.)
>>>
>>>
>>>
>>> Met vriendelijke groeten / Best regards,
>>>
>>> Ger Hobbelt
>>>
>>> --------------------------------------------------
>>> web:    http://www.hobbelt.com/
>>>         http://www.hebbut.net/
>>> mail:   g...@hobbelt.com
>>> mobile: +31-6-11 120 978
>>> --------------------------------------------------
>>>
>>>
>>> On Fri, Feb 17, 2023 at 8:08 AM Alex Porter <trumpetdu...@gmail.com>
>>> wrote:
>>>
>>>>  am currently building a pythont tool to read the screenshots of a
>>>> in-game scoreboard. The scoreboard looks like this:[image: ss_1.png]
>>>>
>>>> I am using open cv to analyse the scoreboard and can reliably slice the
>>>> image into rows and extra each value from the scoreboard giving an image,
>>>> after processing, like this:[image: crop3.png]
>>>>
>>>> I am still having issues with tesseract accurately identifying the
>>>> numbers. Sometimes it is inaccurate (identifying the wrong number) or not
>>>> giving any output at all. I have only whitelisted 0-9 when reading the
>>>> numbers. Any help on pre-processing the image to increase accuracy or any
>>>> other ideas would be much appreciated!
>>>>
>>>> I have also attatched the python code. It's quite messy in it's current
>>>> form so please forgive that if you decide to look!
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/19961d38-af02-4253-801d-4de53493cf54n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/19961d38-af02-4253-801d-4de53493cf54n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/jWdpUF7mTxE/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpxFVty%2Ba66Ndhb258rggN4u4OY%3DC62asW9_j3%2BoNzFAw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpxFVty%2Ba66Ndhb258rggN4u4OY%3DC62asW9_j3%2BoNzFAw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CABVhM9BqV2kuUnP0E3XR2WtNpkiO1BBScerjTZPhcZYg7zZWOA%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CABVhM9BqV2kuUnP0E3XR2WtNpkiO1BBScerjTZPhcZYg7zZWOA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpyoUM-YTasVjPKwp1cQaABrvdonwnrsd5LPfwjWPEq4w%40mail.gmail.com.

Re: [tesseract-ocr] Improve tesseract accuracy.

Reply via email to