RE: [tesseract-ocr] quality of recognition of customer invoices

Art Rhyno Sun, 24 Sep 2023 05:18:26 -0700

It is not a “super quality” parameter, but one possible approach to critical 
numbers and other types of content where a dictionary is not helpful is to 
target individual characters. Tesseract will provide individual characters and 
probabilities of accuracy for each, either using the API or in hocr with "-c 
hocr_char_boxes=1". With the glyph coordinates and something like a range 
between 90 and 98 percent probability, it might be possible to get closer to 99 
per cent by extracting individual glyphs and using single character recognition 
(PSM 10). This, of course, adds a lot more overhead but it can help with tricky 
recognition, like distinguishing between "O" and "0".

art

From: tesseract-ocr@googlegroups.com <tesseract-ocr@googlegroups.com> On Behalf 
Of A Nederpelt
Sent: Friday, September 22, 2023 8:25 AM
To: tesseract-ocr <tesseract-ocr@googlegroups.com>
Subject: Re: [tesseract-ocr] quality of recognition of customer invoices

Well i have approximatelly 3000 customers at the moment for our software. We 
are using lots of invoices to OCR i.e. 1 customer uses approx 10.000 documents 
a month.
So opensource is worth it. I want tesseract, sinds it is free to use.
I believe opensource is the future.

So, can somebody help me optimize it.

With lots of CPU usage i mean when it needs to use more CPU for some parameter 
like "super quality". I want to use that parameter.
Op vrijdag 22 september 2023 om 14:03:53 UTC+2 schreef 
desal...@gmail.com<mailto:desal...@gmail.com>:
The CPU usage is unusual. I have pretty old mac (from 2011); have been running 
Tesseract quite fine.
But, as to the accuracy, if your project is limited in scale, the commercial 
tools would definitely perform better for you. But, if you have long lasting, 
and extensive projects, Tesseract is worth spending your time and developing 
(training) it.

On Friday, September 22, 2023 at 2:50:50 PM UTC+3 powe...@gmail.com wrote:
Well, the problem is that why it chooses for:
NLOO7900000B01
[https://groups.google.com/group/tesseract-ocr/attach/34576be24307/Lambregts0001%20-%20cleaned%20-%20btwnr.jpg?part=0.1&view=1]
2 times character O and 5 times a 0 (ZERO)

Google vision result: "NL007900000B01"

Nuance / OMNIPage: "NL007900000B01"

Leadtools demo: "NL007900000B01"

I want too use Tesseract, but i guess i need things like "second pass" or 
"preprocessing", no dictionary etc.etc.etc
So, i more like a CPU usage of 99,99% and not superspeed.

Can somebody help me ?

Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef 
desal...@gmail.com<mailto:desal...@gmail.com>:
Apparently, version 4 doesn't support white listing. 
https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE
That is not good.
On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote:
The difference between zero and O is deeply problematic, for the human eye. 
Some fonts make it even harder.
You can try the method used here: 
https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/
if that helps.
On Friday, September 22, 2023 at 9:43:51 AM UTC+3 
powe...@gmail.com<mailto:powe...@gmail.com> wrote:
I found the parameters
"C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned.jpg" 
"Lambregts0001 - cleaned.txt" -c 
tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
 :@."
It is not working. "uw BTW nummer:: NLOO7900000B01"

Any other ideas ?

Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef 
elvi...@gmail.com<mailto:elvi...@gmail.com>:
White list the digits so that the O will not confuse it.
You can also try --psm 13 if all of your texts are single line.

On Thu, Sep 21, 2023, 4:07 PM A Nederpelt 
<powe...@gmail.com<mailto:powe...@gmail.com>> wrote:
Hi.
I am trying to use the tesseract engine instead of the nuance engine.
When i currently use tesseract.exe the image it returns a few strange 
characters.
2x OO instead of 00
  "uw BTW nummer:: NLOO7900000B01"
instead of
  "uw BTW nummer:: NL007900000B01"
and
"Tel £01"
instead of
"Tel : 01"
but "Tel : 0168-452452" is recognized ok.

I see no optimization using 
https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md because it 
are really clean documents.

Am i missing some parameters ? Like a second run, or more accurate run etc.
Maybe compile tesseract.exe myself with different more quality parameters ?

Thanks,
Alwin
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-oc...@googlegroups.com<mailto:tesseract-oc...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/YT3PR01MB98973D54C4A4B406DFA11DEADCFDA%40YT3PR01MB9897.CANPRD01.PROD.OUTLOOK.COM.

RE: [tesseract-ocr] quality of recognition of customer invoices

Reply via email to