Re: [tesseract-ocr] quality of recognition of customer invoices

Des Bw Fri, 22 Sep 2023 05:03:58 -0700

The CPU usage is unusual. I have pretty old mac (from 2011); have been 
running Tesseract quite fine.
But, as to the accuracy, if your project is limited in scale, the 
commercial tools would definitely perform better for you. But, if you have 
long lasting, and extensive projects, Tesseract is worth spending your time 
and developing (training) it.



On Friday, September 22, 2023 at 2:50:50 PM UTC+3 [email protected] wrote:

> Well, the problem is that why it chooses for:
> NLOO7900000B01
> [image: Lambregts0001 - cleaned - btwnr.jpg]
> 2 times character O and 5 times a 0 (ZERO)
>
> Google vision result: "NL007900000B01"
>
> Nuance / OMNIPage: "NL007900000B01"
>
> Leadtools demo: "NL007900000B01"
>
> I want too use Tesseract, but i guess i need things like "second pass" or 
> "preprocessing", no dictionary etc.etc.etc
> So, i more like a CPU usage of 99,99% and not superspeed.
>
> Can somebody help me ?
>
> Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef [email protected]:
>
>> Apparently, version 4 doesn't support white listing. 
>> https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE
>> That is not good. 
>> On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote:
>>
>>> The difference between zero and O is deeply problematic, for the human 
>>> eye. Some fonts make it even harder. 
>>> You can try the method used here: 
>>> https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/
>>> if that helps. 
>>> On Friday, September 22, 2023 at 9:43:51 AM UTC+3 [email protected] 
>>> wrote:
>>>
>>>> I found the parameters
>>>> "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - 
>>>> cleaned.jpg" "Lambregts0001 - cleaned.txt" -c 
>>>> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
>>>>  
>>>> :@."
>>>> It is not working. "uw BTW nummer:: NLOO7900000B01"
>>>>
>>>> Any other ideas ?
>>>>
>>>> Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef 
>>>> [email protected]:
>>>>
>>>>> White list the digits so that the O will not confuse it. 
>>>>>
>>>> You can also try --psm 13 if all of your texts are single line.
>>>>>
>>>>
>>>>> On Thu, Sep 21, 2023, 4:07 PM A Nederpelt <[email protected]> wrote:
>>>>>
>>>>>> Hi.
>>>>>> I am trying to use the tesseract engine instead of the nuance engine.
>>>>>> When i currently use tesseract.exe the image it returns a few strange 
>>>>>> characters.
>>>>>> 2x OO instead of 00
>>>>>>   "uw BTW nummer:: NLOO7900000B01"
>>>>>> instead of
>>>>>>   "uw BTW nummer:: NL007900000B01"
>>>>>> and
>>>>>> "Tel £01"
>>>>>> instead of
>>>>>> "Tel : 01"
>>>>>> but "Tel : 0168-452452" is recognized ok.
>>>>>>
>>>>>> I see no optimization using 
>>>>>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md 
>>>>>> because it are really clean documents.
>>>>>>
>>>>>> Am i missing some parameters ? Like a second run, or more accurate 
>>>>>> run etc.
>>>>>> Maybe compile tesseract.exe myself with different more quality 
>>>>>> parameters ?
>>>>>>
>>>>>> Thanks,
>>>>>> Alwin
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/35f2a66b-a733-4a23-b413-ded82115d8d6n%40googlegroups.com.

Re: [tesseract-ocr] quality of recognition of customer invoices

Reply via email to