Re: [tesseract-ocr] Incorrect OCR of 4-digit number

Orsey Aehr Mon, 28 Feb 2022 05:01:50 -0800

I found that this PR reduced errors by around 75% in my case: 
https://github.com/tesseract-ocr/tesseract/pull/3476


On Monday, 28 February 2022 at 02:06:06 UTC+9 zdenop wrote:

> my 2 cents:
>
> First of all create the public testing case/repository focused on this 
> problem e.g. different font families, font size, shot text (like 
> 0swZuoU.png), long text, etc. This could be used for finding problems/bugs, 
> evaluating possible solutions, maybe (re)training. So synthetic data 
> imitating real-world cases are fine. 
> I would suggest focusing on the most common fonts as used on different 
> platforms (e.g. on Windows  Arial, Times New Roman, Courier New, Calibri, 
> Cambria, Consolas, Segoe UI on Linux probably DejaVu, Liberation, Ubuntu, 
> not sure about Mac&IOS ;-)
> I would suggest using column or paragraph style for input image (e.g to 
> avoid problems with document layout analysis like tables, header, footer..)
>
> Zdenko
>
>
> ne 27. 2. 2022 o 14:36 Chris McClelland <proph...@gmail.com> napísal(a):
>
>> So I did a similar analysis to Willus (see link posted by Zdenko), 
>> downscaling the images to try a range of heights for digits. Unfortunately 
>> my result is not as nice as Willus's (where he finds that the error rate 
>> drops to zero for capital-letter heights of 30-33 pixels). In my case I 
>> have 336 images each containing a column of ~30 #### numbers (dataset T) 
>> and 336 images each containing a column of ~30 #.# numbers (dataset D).
>>
>> The error rate for D seems to tend to zero for larger digit-heights (i.e 
>> more pixels) - the most common errors for smaller sizes seem to be missing 
>> the decimal point, e.g getting input "1.2" and producing output "12". To 
>> eliminate those errors, I need digits about 92 pixels high.
>>
>> The error rate for T is more complex. It has a broad trough in the 
>> digit-height range 20-48 pixels, with several points (20,32,38) with a 
>> perfect score, but no obvious range which produces a perfect score.
>>
>> Perhaps I could train it myself? Is 336*30*4 ~ 40,000 digits of training 
>> data enough to get meaningful results with OCR?
>>
>> Chris
>>
>> On Sunday, 27 February 2022 at 13:23:09 UTC zdenop wrote:
>>
>>> I do not know. The trick with upscaling is here from version 3.x.  The 
>>> trick with downscaling works from version 4.x 
>>> Just looking at Willus Dotkom's chart[1] I would guess there is some 
>>> design decision... But without explanation from original/google 
>>> programmers, we can just guess or find a bug ;-)
>>>
>>> [1] 
>>> https://groups.google.com/group/tesseract-ocr/attach/51b840d4782db/tess4_error_rate.png?part=0.2&view=1
>>>
>>>
>>> Zdenko
>>>
>>>
>>> ne 27. 2. 2022 o 11:27 Merlijn B.W. Wajer <mer...@archive.org> 
>>> napísal(a):
>>>
>>>> Hi,
>>>>
>>>> On 27/02/2022 08:55, Zdenko Podobny wrote:
>>>> > tesseract fix_size.png -
>>>> > 
>>>> > 0326
>>>> > 0939
>>>> > 1552
>>>> > 2206
>>>> > 
>>>> > 
>>>> > See doc for explaining: 
>>>> > 
>>>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling
>>>>  
>>>> > <
>>>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling
>>>> >
>>>>
>>>> Thanks for the suggestion, I'm also running into this problem in some 
>>>> cases. Is it possible that this is also some kind of segmentation bug? 
>>>> I 
>>>> wonder what Tesseract finds here in this clear image that causes it to 
>>>> produce an extra character.
>>>>
>>>> Regards,
>>>> Merlijn
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>>
>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/2435ccff-11e1-0848-6d57-600a4262d963%40archive.org
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/33bd80fb-ece5-434c-a44a-84750b416c93n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/33bd80fb-ece5-434c-a44a-84750b416c93n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/efe5e658-e12e-4f31-8795-c32b34264eb5n%40googlegroups.com.

Re: [tesseract-ocr] Incorrect OCR of 4-digit number

Reply via email to