Re: [tesseract-ocr] quality of recognition of customer invoices

A Nederpelt Fri, 22 Sep 2023 04:50:55 -0700

Well, the problem is that why it chooses for:
NLOO7900000B01
[image: Lambregts0001 - cleaned - btwnr.jpg]
2 times character O and 5 times a 0 (ZERO)


Google vision result: "NL007900000B01"

Nuance / OMNIPage: "NL007900000B01"

Leadtools demo: "NL007900000B01"

I want too use Tesseract, but i guess i need things like "second pass" or 
"preprocessing", no dictionary etc.etc.etc
So, i more like a CPU usage of 99,99% and not superspeed.

Can somebody help me ?

Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef desal...@gmail.com:

> Apparently, version 4 doesn't support white listing. 
> https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE
> That is not good. 
> On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote:
>
>> The difference between zero and O is deeply problematic, for the human 
>> eye. Some fonts make it even harder. 
>> You can try the method used here: 
>> https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/
>> if that helps. 
>> On Friday, September 22, 2023 at 9:43:51 AM UTC+3 powe...@gmail.com 
>> wrote:
>>
>>> I found the parameters
>>> "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - 
>>> cleaned.jpg" "Lambregts0001 - cleaned.txt" -c 
>>> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
>>>  
>>> :@."
>>> It is not working. "uw BTW nummer:: NLOO7900000B01"
>>>
>>> Any other ideas ?
>>>
>>> Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef 
>>> elvi...@gmail.com:
>>>
>>>> White list the digits so that the O will not confuse it. 
>>>>
>>> You can also try --psm 13 if all of your texts are single line.
>>>>
>>>
>>>> On Thu, Sep 21, 2023, 4:07 PM A Nederpelt <powe...@gmail.com> wrote:
>>>>
>>>>> Hi.
>>>>> I am trying to use the tesseract engine instead of the nuance engine.
>>>>> When i currently use tesseract.exe the image it returns a few strange 
>>>>> characters.
>>>>> 2x OO instead of 00
>>>>>   "uw BTW nummer:: NLOO7900000B01"
>>>>> instead of
>>>>>   "uw BTW nummer:: NL007900000B01"
>>>>> and
>>>>> "Tel £01"
>>>>> instead of
>>>>> "Tel : 01"
>>>>> but "Tel : 0168-452452" is recognized ok.
>>>>>
>>>>> I see no optimization using 
>>>>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md 
>>>>> because it are really clean documents.
>>>>>
>>>>> Am i missing some parameters ? Like a second run, or more accurate run 
>>>>> etc.
>>>>> Maybe compile tesseract.exe myself with different more quality 
>>>>> parameters ?
>>>>>
>>>>> Thanks,
>>>>> Alwin
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/60caf669-edb7-4517-9e07-8ad49f1b0d85n%40googlegroups.com.

Re: [tesseract-ocr] quality of recognition of customer invoices

Reply via email to