Re: [tesseract-ocr] quality of recognition of customer invoices

Zdenko Podobny Fri, 22 Sep 2023 07:39:10 -0700

I know there are (were) people at the forum that implemented Tesseract as
part of invoice processing - but as a commercial solution.


It is not as easy as it looks: there is a need for a custom solution for
text detection (e.g. skipping logos and other graphics, possible
handwriting). As far as I remember they created a new engine for amount
recognition - this is the most critical part of invoice processing.

A few years ago I had a discussion with a professional provider of such
services in Europe (they did not use Tesseract) and they informed me they
try to avoid data extraction from invoices and they insist on invoice data
exchange because it is cheaper and more reliable...

Just my 2 cents - what you can expect or what problems you will need to
solve.

Zdenko


pi 22. 9. 2023 o 14:24 A Nederpelt <powern...@gmail.com> napísal(a):

> Well i have approximatelly 3000 customers at the moment for our software.
> We are using lots of invoices to OCR i.e. 1 customer uses approx 10.000
> documents a month.
> So opensource is worth it. I want tesseract, sinds it is free to use.
> I believe opensource is the future.
>
> So, can somebody help me optimize it.
>
> With lots of CPU usage i mean when it needs to use more CPU for some
> parameter like "super quality". I want to use that parameter.
>
> Op vrijdag 22 september 2023 om 14:03:53 UTC+2 schreef desal...@gmail.com:
>
>> The CPU usage is unusual. I have pretty old mac (from 2011); have been
>> running Tesseract quite fine.
>> But, as to the accuracy, if your project is limited in scale, the
>> commercial tools would definitely perform better for you. But, if you have
>> long lasting, and extensive projects, Tesseract is worth spending your time
>> and developing (training) it.
>>
>>
>> On Friday, September 22, 2023 at 2:50:50 PM UTC+3 powe...@gmail.com
>> wrote:
>>
>>> Well, the problem is that why it chooses for:
>>> NLOO7900000B01
>>> [image: Lambregts0001 - cleaned - btwnr.jpg]
>>> 2 times character O and 5 times a 0 (ZERO)
>>>
>>> Google vision result: "NL007900000B01"
>>>
>>> Nuance / OMNIPage: "NL007900000B01"
>>>
>>> Leadtools demo: "NL007900000B01"
>>>
>>> I want too use Tesseract, but i guess i need things like "second pass"
>>> or "preprocessing", no dictionary etc.etc.etc
>>> So, i more like a CPU usage of 99,99% and not superspeed.
>>>
>>> Can somebody help me ?
>>>
>>> Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef
>>> desal...@gmail.com:
>>>
>>>> Apparently, version 4 doesn't support white listing.
>>>> https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE
>>>> That is not good.
>>>> On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote:
>>>>
>>>>> The difference between zero and O is deeply problematic, for the human
>>>>> eye. Some fonts make it even harder.
>>>>> You can try the method used here:
>>>>> https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/
>>>>> if that helps.
>>>>> On Friday, September 22, 2023 at 9:43:51 AM UTC+3 powe...@gmail.com
>>>>> wrote:
>>>>>
>>>>>> I found the parameters
>>>>>> "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 -
>>>>>> cleaned.jpg" "Lambregts0001 - cleaned.txt" -c
>>>>>> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
>>>>>> :@."
>>>>>> It is not working. "uw BTW nummer:: NLOO7900000B01"
>>>>>>
>>>>>> Any other ideas ?
>>>>>>
>>>>>> Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef
>>>>>> elvi...@gmail.com:
>>>>>>
>>>>>>> White list the digits so that the O will not confuse it.
>>>>>>>
>>>>>> You can also try --psm 13 if all of your texts are single line.
>>>>>>>
>>>>>>
>>>>>>> On Thu, Sep 21, 2023, 4:07 PM A Nederpelt <powe...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi.
>>>>>>>> I am trying to use the tesseract engine instead of the nuance
>>>>>>>> engine.
>>>>>>>> When i currently use tesseract.exe the image it returns a few
>>>>>>>> strange characters.
>>>>>>>> 2x OO instead of 00
>>>>>>>>   "uw BTW nummer:: NLOO7900000B01"
>>>>>>>> instead of
>>>>>>>>   "uw BTW nummer:: NL007900000B01"
>>>>>>>> and
>>>>>>>> "Tel £01"
>>>>>>>> instead of
>>>>>>>> "Tel : 01"
>>>>>>>> but "Tel : 0168-452452" is recognized ok.
>>>>>>>>
>>>>>>>> I see no optimization using
>>>>>>>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
>>>>>>>> because it are really clean documents.
>>>>>>>>
>>>>>>>> Am i missing some parameters ? Like a second run, or more accurate
>>>>>>>> run etc.
>>>>>>>> Maybe compile tesseract.exe myself with different more quality
>>>>>>>> parameters ?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Alwin
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z9WRjLJ8w1We9HOosgzEvzu_9p-8Q1pO-C33gFujk_Pw%40mail.gmail.com.

Re: [tesseract-ocr] quality of recognition of customer invoices

Reply via email to