Re: [tesseract-ocr] quality of recognition of customer invoices

Des Bw Fri, 22 Sep 2023 06:10:47 -0700

Shree is one of the most experienced; and definitely the most 
helpful member of  this group. I have also seen Zdenko answering some 
questions. You might have a good luck with either of them.


On Friday, September 22, 2023 at 4:07:12 PM UTC+3 Des Bw wrote:

> If you have income source, you might be able to give some compensation for 
> his/her  time; and an  experienced user or even developer might help you to 
> fine tune the software for your needs. You ask Shree if he/she will be 
> interested. 
>
> On Friday, September 22, 2023 at 3:24:52 PM UTC+3 [email protected] wrote:
>
>> Well i have approximatelly 3000 customers at the moment for our software. 
>> We are using lots of invoices to OCR i.e. 1 customer uses approx 10.000 
>> documents a month. 
>> So opensource is worth it. I want tesseract, sinds it is free to use. 
>> I believe opensource is the future.
>>
>> So, can somebody help me optimize it. 
>>
>> With lots of CPU usage i mean when it needs to use more CPU for some 
>> parameter like "super quality". I want to use that parameter.
>>
>> Op vrijdag 22 september 2023 om 14:03:53 UTC+2 schreef [email protected]
>> :
>>
>>> The CPU usage is unusual. I have pretty old mac (from 2011); have been 
>>> running Tesseract quite fine.
>>> But, as to the accuracy, if your project is limited in scale, the 
>>> commercial tools would definitely perform better for you. But, if you have 
>>> long lasting, and extensive projects, Tesseract is worth spending your time 
>>> and developing (training) it. 
>>>
>>>
>>> On Friday, September 22, 2023 at 2:50:50 PM UTC+3 [email protected] 
>>> wrote:
>>>
>>>> Well, the problem is that why it chooses for:
>>>> NLOO7900000B01
>>>> [image: Lambregts0001 - cleaned - btwnr.jpg]
>>>> 2 times character O and 5 times a 0 (ZERO)
>>>>
>>>> Google vision result: "NL007900000B01"
>>>>
>>>> Nuance / OMNIPage: "NL007900000B01"
>>>>
>>>> Leadtools demo: "NL007900000B01"
>>>>
>>>> I want too use Tesseract, but i guess i need things like "second pass" 
>>>> or "preprocessing", no dictionary etc.etc.etc
>>>> So, i more like a CPU usage of 99,99% and not superspeed.
>>>>
>>>> Can somebody help me ?
>>>>
>>>> Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef 
>>>> [email protected]:
>>>>
>>>>> Apparently, version 4 doesn't support white listing. 
>>>>> https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE
>>>>> That is not good. 
>>>>> On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote:
>>>>>
>>>>>> The difference between zero and O is deeply problematic, for the 
>>>>>> human eye. Some fonts make it even harder. 
>>>>>> You can try the method used here: 
>>>>>> https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/
>>>>>> if that helps. 
>>>>>> On Friday, September 22, 2023 at 9:43:51 AM UTC+3 [email protected] 
>>>>>> wrote:
>>>>>>
>>>>>>> I found the parameters
>>>>>>> "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - 
>>>>>>> cleaned.jpg" "Lambregts0001 - cleaned.txt" -c 
>>>>>>> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
>>>>>>>  
>>>>>>> :@."
>>>>>>> It is not working. "uw BTW nummer:: NLOO7900000B01"
>>>>>>>
>>>>>>> Any other ideas ?
>>>>>>>
>>>>>>> Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef 
>>>>>>> [email protected]:
>>>>>>>
>>>>>>>> White list the digits so that the O will not confuse it. 
>>>>>>>>
>>>>>>> You can also try --psm 13 if all of your texts are single line.
>>>>>>>>
>>>>>>>
>>>>>>>> On Thu, Sep 21, 2023, 4:07 PM A Nederpelt <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi.
>>>>>>>>> I am trying to use the tesseract engine instead of the nuance 
>>>>>>>>> engine.
>>>>>>>>> When i currently use tesseract.exe the image it returns a few 
>>>>>>>>> strange characters.
>>>>>>>>> 2x OO instead of 00
>>>>>>>>>   "uw BTW nummer:: NLOO7900000B01"
>>>>>>>>> instead of
>>>>>>>>>   "uw BTW nummer:: NL007900000B01"
>>>>>>>>> and
>>>>>>>>> "Tel £01"
>>>>>>>>> instead of
>>>>>>>>> "Tel : 01"
>>>>>>>>> but "Tel : 0168-452452" is recognized ok.
>>>>>>>>>
>>>>>>>>> I see no optimization using 
>>>>>>>>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md 
>>>>>>>>> because it are really clean documents.
>>>>>>>>>
>>>>>>>>> Am i missing some parameters ? Like a second run, or more accurate 
>>>>>>>>> run etc.
>>>>>>>>> Maybe compile tesseract.exe myself with different more quality 
>>>>>>>>> parameters ?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Alwin
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to [email protected].
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com
>>>>>>>>>  
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ce0482a9-acc1-4bb5-a575-9d6ae97fd4den%40googlegroups.com.

Re: [tesseract-ocr] quality of recognition of customer invoices

Reply via email to