Re: [tesseract-ocr] quality of recognition of customer invoices

A Nederpelt Fri, 22 Sep 2023 05:24:55 -0700

Well i have approximatelly 3000 customers at the moment for our software. 
We are using lots of invoices to OCR i.e. 1 customer uses approx 10.000 
documents a month. 
So opensource is worth it. I want tesseract, sinds it is free to use. 
I believe opensource is the future.


So, can somebody help me optimize it. 

With lots of CPU usage i mean when it needs to use more CPU for some 
parameter like "super quality". I want to use that parameter.

Op vrijdag 22 september 2023 om 14:03:53 UTC+2 schreef desal...@gmail.com:

> The CPU usage is unusual. I have pretty old mac (from 2011); have been 
> running Tesseract quite fine.
> But, as to the accuracy, if your project is limited in scale, the 
> commercial tools would definitely perform better for you. But, if you have 
> long lasting, and extensive projects, Tesseract is worth spending your time 
> and developing (training) it. 
>
>
> On Friday, September 22, 2023 at 2:50:50 PM UTC+3 powe...@gmail.com wrote:
>
>> Well, the problem is that why it chooses for:
>> NLOO7900000B01
>> [image: Lambregts0001 - cleaned - btwnr.jpg]
>> 2 times character O and 5 times a 0 (ZERO)
>>
>> Google vision result: "NL007900000B01"
>>
>> Nuance / OMNIPage: "NL007900000B01"
>>
>> Leadtools demo: "NL007900000B01"
>>
>> I want too use Tesseract, but i guess i need things like "second pass" or 
>> "preprocessing", no dictionary etc.etc.etc
>> So, i more like a CPU usage of 99,99% and not superspeed.
>>
>> Can somebody help me ?
>>
>> Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef desal...@gmail.com
>> :
>>
>>> Apparently, version 4 doesn't support white listing. 
>>> https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE
>>> That is not good. 
>>> On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote:
>>>
>>>> The difference between zero and O is deeply problematic, for the human 
>>>> eye. Some fonts make it even harder. 
>>>> You can try the method used here: 
>>>> https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/
>>>> if that helps. 
>>>> On Friday, September 22, 2023 at 9:43:51 AM UTC+3 powe...@gmail.com 
>>>> wrote:
>>>>
>>>>> I found the parameters
>>>>> "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - 
>>>>> cleaned.jpg" "Lambregts0001 - cleaned.txt" -c 
>>>>> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
>>>>>  
>>>>> :@."
>>>>> It is not working. "uw BTW nummer:: NLOO7900000B01"
>>>>>
>>>>> Any other ideas ?
>>>>>
>>>>> Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef 
>>>>> elvi...@gmail.com:
>>>>>
>>>>>> White list the digits so that the O will not confuse it. 
>>>>>>
>>>>> You can also try --psm 13 if all of your texts are single line.
>>>>>>
>>>>>
>>>>>> On Thu, Sep 21, 2023, 4:07 PM A Nederpelt <powe...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi.
>>>>>>> I am trying to use the tesseract engine instead of the nuance engine.
>>>>>>> When i currently use tesseract.exe the image it returns a few 
>>>>>>> strange characters.
>>>>>>> 2x OO instead of 00
>>>>>>>   "uw BTW nummer:: NLOO7900000B01"
>>>>>>> instead of
>>>>>>>   "uw BTW nummer:: NL007900000B01"
>>>>>>> and
>>>>>>> "Tel £01"
>>>>>>> instead of
>>>>>>> "Tel : 01"
>>>>>>> but "Tel : 0168-452452" is recognized ok.
>>>>>>>
>>>>>>> I see no optimization using 
>>>>>>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md 
>>>>>>> because it are really clean documents.
>>>>>>>
>>>>>>> Am i missing some parameters ? Like a second run, or more accurate 
>>>>>>> run etc.
>>>>>>> Maybe compile tesseract.exe myself with different more quality 
>>>>>>> parameters ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Alwin
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com.

Re: [tesseract-ocr] quality of recognition of customer invoices

Reply via email to