Re: [tesseract-ocr] quality of recognition of customer invoices

A Nederpelt Mon, 25 Sep 2023 08:13:39 -0700

After removing the "L" from the picture, everything is OCR'd as zero: "uw 
BTW nummer:: N 007900000B01"
[image: Lambregts0001 - cleaned - bw 2a.jpg]
Is there any "extra" quality of Tesseract i want to disable ?
It looks like Tesseract "thinks" as "Is the 'sign' zero or "O"...... 
hmmmmm..... there is text before the 'sign' so, it would rather be "O" than 
zero"
Looking at the "x_conf" ENTRY:
2 boxes:
      <span class='ocrx_word' id='word_1_23' title='bbox 1625 1250 1636 
1273; x_wconf 90'>
       <span class='ocrx_cinfo' title='x_bboxes 1625 1250 1636 1273; x_conf 
98.790306'>N</span>
      </span>
      <span class='ocrx_word' id='word_1_24' title='bbox 1658 1249 1900 
1274; x_wconf 90'>
       <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1676 1273; x_conf 
98.878639'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1679 1250 1697 1273; x_conf 
99.001717'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1699 1249 1716 1273; x_conf 
99.028969'>7</span>
       <span class='ocrx_cinfo' title='x_bboxes 1718 1250 1736 1274; x_conf 
98.723213'>9</span>
       <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1757 1273; x_conf 
98.880974'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1759 1250 1777 1273; x_conf 
99.03817'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1798 1273; x_conf 
99.012474'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1819 1273; x_conf 
98.878563'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1821 1250 1839 1273; x_conf 
98.819069'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 
99.029968'>B</span>
       <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 
98.80172'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1900 1273; x_conf 
98.938919'>1</span>
      </span>


1 box (previous mail)
      <span class='ocrx_word' id='word_1_24' title='bbox 1614 1250 1899 
1273; x_wconf 75'>
       <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 
99.020287'>N</span>
       <span class='ocrx_cinfo' title='x_bboxes 1639 1250 1657 1273; x_conf 
99.020271'>L</span>
       <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1675 1273; x_conf 
98.428726'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1678 1250 1695 1273; x_conf 
98.632645'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1699 1250 1716 1273; x_conf 
98.987907'>7</span>
       <span class='ocrx_cinfo' title='x_bboxes 1719 1250 1736 1273; x_conf 
99.028702'>9</span>
       <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1756 1273; x_conf 
98.484917'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1760 1250 1777 1273; x_conf 
99.03093'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1797 1273; x_conf 
98.998169'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1818 1273; x_conf 
99.012581'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1822 1250 1839 1273; x_conf 
99.038429'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 
98.716026'>B</span>
       <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 
96.535439'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1899 1273; x_conf 
98.847801'>1</span>
      </span>

As for the VAT number, changing it afterwards is no problem at all, but i 
want the OCR engine to recognize this in advance. The quality of the 
picture is 100%.

I am looking for settings, and don't want to dig into the source code. So 
has anybody got any idea's ?



Op maandag 25 september 2023 om 12:58:31 UTC+2 schreef g...@hobbelt.com:

One more other thing to test is disabling the dictionary used to help 
evaluate character recognition. I haven't checked this myself, so YMMV -- I 
know the behaviour is quite different for classic tesseract (v3) engine and 
LSTM engine (v4/v5 addition); IIRC the `dawg` ones only affect the v3 
engine, f.e.
(Which reminds me: you might want to check v3 (`--oem 0` IIRC) output as 
well as LSTM output for BTW/VAT numbers specifically.)


Anyway:
python - Pytesseract disable dictionary (multiple configuration arguments) 
- Stack Overflow 
<https://stackoverflow.com/questions/71566701/pytesseract-disable-dictionary-multiple-configuration-arguments>
<https://stackoverflow.com/questions/71566701/pytesseract-disable-dictionary-multiple-configuration-arguments>extra
 
parameters mentioned here:
Disable dictionary-assisted OCR in tesseract C++ API - Stack Overflow 
<https://stackoverflow.com/questions/33005215/disable-dictionary-assisted-ocr-in-tesseract-c-api>

these parameters can be set via the CLI by using the `-c <PARAM>=<VALUE>` 
command line option. One param per `-c`, hence like this `-c PARAM1=0   -c 
PARAM2=0  ...`

However, keep in mind the very(!) valid remark of Zdenko: 'O', 'o' and '0' 
are already *hard* to differentiate for humans and classically this was 
solved by printing zeroes with a slash (Slashed zero - Wikipedia 
<https://en.wikipedia.org/wiki/Slashed_zero>), so DO NOT expect perfect 
results any time.

Your Dutch VAT ("BTW") codes can (and should!) be 
validated/corrected-in-post as they follow a very specific template 
(/NL[0-9]{9}B[0-9]{2}/) (Using and verifying VAT numbers | Business.gov.nl 
<https://business.gov.nl/regulation/using-checking-vat-numbers/>  &  
Netherlands 
VAT Guide for Global Business - Rates & Compliance Requirements (fonoa.com) 
<https://www.fonoa.com/countries/netherlands> ) and your software should be 
able to disambiguate the [Oo0] set at each position in "post": while almost 
everyone is focusing on *pre*processing (image processing to improve 
tesseract OCR accuracy - Stack Overflow 
<https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy>
, Improving the quality of the output | tessdoc (tesseract-ocr.github.io) 
<https://tesseract-ocr.github.io/tessdoc/ImproveQuality#still-having-problems>, 
et al), yours is a very *domain specific* issue as you are trying to read 
non-human text elements, such as VAT codes, for which *OCR postprocessing* 
the (HOCR or similar) tesseract output might be used: this way one can 
improve OCR-ing credit card "numbers" and the like, such as VAT codes: 
anything that's restricted by its format but not well suited for 
human-language dictionary- or Markov-chain-based 'grading' (I **assume** 
(haven't checked tesseract internals about this) you're getting 'NLOO' 
instead of 'NL00' because in human languages, one would expect more 
alphabetic characters to show up in a "word" with far higher probability 
than *numerical digits* -- this is why it might help to fiddle with the 
settings in the parameters mentioned in the second SO link above -- and 
when that doesn't work for you, there's always the blunt 'replace O by 
0(zero) and revalidate VAT number' approach. :-)   )

NEVER assume you'll get a 100% hit rate with OCR, though. OCR-A and OCR-B 
fonts were once developed to get much closer to that number, but 
ultimately, any pixels-to-text approach is bound to fail, if only 
ultimately by Murphy's Law. Plenty of opportunity for that one along the 
way in your scenario. Caveat emptor.

(On a side note: I see quite a few folks trying this same thing (OCR-ing 
invoices and bills) the last few years; all the companies I have worked at 
or be in contact with require B2B electronic data exchange (in TEXT format; 
often as XML, some as (text-carrying) PDF/A) for the above reason (<100% 
success rate over time, ergo human inspection/correction required at random 
time intervals --> obnoxious process) when they are processing invoices in 
sufficient numbers to make human handling [deemed] too costly. Curiously 
this gap between manual single entry and electronic data exchange for 
invoices still is a business opportunity today, apparently.)

(On another side note: observe the particular typography used for the 
numerals in the BTW codes as displayed at the Dutch governmental gov.nl web 
site page (linked above): they clearly made a very conscious choice to use 
"lowercase digits" (typography - How to construct "lowercase digits" (i.e. 
text figures)? - Graphic Design Stack Exchange 
<https://graphicdesign.stackexchange.com/questions/8360/how-to-construct-lowercase-digits-i-e-text-figures>)
 
for the numeric parts. This, however, would not *solve* the readability 
issue either as now 'o' (lowercase Oh) and 0(digit zero) are confused by 
human readers instead. Bottom line: always expect confusion, also by your 
OCR machine, so second-guessing, i.e. "post processing" is in order. If 
only for flagging any "BTW nummer" (VAT code) that turns up as being 
invalid. Heck, you'll need that anyway as a first-line defense against 
(basic) VAT fraud anyway! ;-)) )


If you really want to dig further beyond tesseract `-c` parameter tweaking 
& experimentation, there's the source code itself to tweak to provide you 
with a more abundant output, along the lines of 
https://github.com/sirfz/tesserocr/issues/166, 
https://github.com/tesseract-ocr/tesseract/issues/1465, et al. The key 
thought here is: your application of OCR might actually benefit (as it 
requires POST-processing) from being able to feed your postprocessor with a 
*set of choices per character* to help the postprocessor decide what's best 
in your particular (domain specific) case, i.e. almost walk in the other 
direction where "fixing diplopia" took tesseract till today. Would be an 
interesting research project, I'm sure.



Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------


On Mon, Sep 25, 2023 at 9:46 AM A Nederpelt <powe...@gmail.com> wrote:

Well the strange effect is, that hocr shows different characters.
"C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - 
cleaned.jpg" "Lambregts0001 - cleaned" -c hocr_char_boxes=1 hocr
result 2 times a character 'O' and the rest is '0' zero.
      <span class='ocrx_word' id='word_1_24' title='bbox 1614 1250 1899 
1273; x_wconf 75'>
       <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 
99.020287'>N</span>
       <span class='ocrx_cinfo' title='x_bboxes 1639 1250 1657 1273; x_conf 
99.020271'>L</span>
       <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1675 1273; x_conf 
98.428726'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1678 1250 1695 1273; x_conf 
98.632645'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1699 1250 1716 1273; x_conf 
98.987907'>7</span>
       <span class='ocrx_cinfo' title='x_bboxes 1719 1250 1736 1273; x_conf 
99.028702'>9</span>
       <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1756 1273; x_conf 
98.484917'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1760 1250 1777 1273; x_conf 
99.03093'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1797 1273; x_conf 
98.998169'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1818 1273; x_conf 
99.012581'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1822 1250 1839 1273; x_conf 
99.038429'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 
98.716026'>B</span>
       <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 
96.535439'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1899 1273; x_conf 
98.847801'>1</span>
      </span>

 But in the picture they all  look 100% the same as shown before.

And then i converted the painting to black and white, and copy/pasted the 
signs on the pdf
(I still see no differences). I copied the red-sign to the orange-signs...
"C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned 
- bw 2.jpg" "Lambregts0001 - cleaned - bw 2" -c hocr_char_boxes=1 hocr
      <span class='ocrx_word' id='word_1_23' title='bbox 1614 1249 1900 
1274; x_wconf 77'>
       <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 
99.039665'>N</span>
       <span class='ocrx_cinfo' title='x_bboxes 1638 1250 1657 1273; x_conf 
99.031548'>L</span>
       <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1676 1273; x_conf 
97.601151'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1679 1250 1697 1273; x_conf 
96.843338'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1699 1249 1716 1273; x_conf 
98.95182'>7</span>
       <span class='ocrx_cinfo' title='x_bboxes 1718 1250 1736 1274; x_conf 
98.925072'>9</span>
       <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1757 1273; x_conf 
98.905106'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1759 1250 1777 1273; x_conf 
98.670326'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1798 1273; x_conf 
98.658737'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1819 1273; x_conf 
99.03775'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1821 1250 1839 1273; x_conf 
99.0326'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 
98.578423'>B</span>
       <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 
98.561943'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1900 1273; x_conf 
98.727348'>1</span>
      </span>
[image: Lambregts0001 - cleaned - bw 3.jpg]

The x_conf changes from 98 to 97 & 96

Any ideas ?

Op zondag 24 september 2023 om 14:18:30 UTC+2 schreef Art Rhyno:

It is not a “super quality” parameter, but one possible approach to 
critical numbers and other types of content where a dictionary is not 
helpful is to target individual characters. Tesseract will provide 
individual characters and probabilities of accuracy for each, either using 
the API or in hocr with "-c hocr_char_boxes=1". With the glyph coordinates 
and something like a range between 90 and 98 percent probability, it might 
be possible to get closer to 99 per cent by extracting individual glyphs 
and using single character recognition (PSM 10). This, of course, adds a 
lot more overhead but it can help with tricky recognition, like 
distinguishing between "O" and "0".

 

art

 

*From:* tesser...@googlegroups.com <tesser...@googlegroups.com> *On Behalf 
Of *A Nederpelt
*Sent:* Friday, September 22, 2023 8:25 AM
*To:* tesseract-ocr <tesser...@googlegroups.com>
*Subject:* Re: [tesseract-ocr] quality of recognition of customer invoices

 

Well i have approximatelly 3000 customers at the moment for our software. 
We are using lots of invoices to OCR i.e. 1 customer uses approx 10.000 
documents a month. 

So opensource is worth it. I want tesseract, sinds it is free to use. 

I believe opensource is the future.

 

So, can somebody help me optimize it. 

 

With lots of CPU usage i mean when it needs to use more CPU for some 
parameter like "super quality". I want to use that parameter.

Op vrijdag 22 september 2023 om 14:03:53 UTC+2 schreef desal...@gmail.com:

The CPU usage is unusual. I have pretty old mac (from 2011); have been 
running Tesseract quite fine.

But, as to the accuracy, if your project is limited in scale, the 
commercial tools would definitely perform better for you. But, if you have 
long lasting, and extensive projects, Tesseract is worth spending your time 
and developing (training) it. 

 

On Friday, September 22, 2023 at 2:50:50 PM UTC+3 powe...@gmail.com wrote:

Well, the problem is that why it chooses for:

NLOO7900000B01

2 times character O and 5 times a 0 (ZERO)

 

Google vision result: "NL007900000B01"

 

Nuance / OMNIPage: "NL007900000B01"

 

Leadtools demo: "NL007900000B01"

 

I want too use Tesseract, but i guess i need things like "second pass" or 
"preprocessing", no dictionary etc.etc.etc

So, i more like a CPU usage of 99,99% and not superspeed.

 

Can somebody help me ?

 

Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef desal...@gmail.com:

Apparently, version 4 doesn't support white listing. 
https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE

That is not good. 

On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote:

The difference between zero and O is deeply problematic, for the human eye. 
Some fonts make it even harder. 

You can try the method used here: 
https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/

if that helps. 

On Friday, September 22, 2023 at 9:43:51 AM UTC+3 powe...@gmail.com wrote:

I found the parameters

"C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - 
cleaned.jpg" "Lambregts0001 - cleaned.txt" -c 
tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
 
:@."
It is not working. "uw BTW nummer:: NLOO7900000B01"

 

Any other ideas ?

 

Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef elvi...@gmail.com:

White list the digits so that the O will not confuse it. 

You can also try --psm 13 if all of your texts are single line.

 

On Thu, Sep 21, 2023, 4:07 PM A Nederpelt <powe...@gmail.com> wrote:

Hi.

I am trying to use the tesseract engine instead of the nuance engine.

When i currently use tesseract.exe the image it returns a few strange 
characters.

2x OO instead of 00

  "uw BTW nummer:: NLOO7900000B01"

instead of

  "uw BTW nummer:: NL007900000B01"

and

"Tel £01"

instead of

"Tel : 01"

but "Tel : 0168-452452" is recognized ok.

 

I see no optimization using 
https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md 
because it are really clean documents.

 

Am i missing some parameters ? Like a second run, or more accurate run etc.

Maybe compile tesseract.exe myself with different more quality parameters ?

 

Thanks,

Alwin

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an 
email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com
 
<https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer>
.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an 
email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com
 
<https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com?utm_medium=email&utm_source=footer>
.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an 
email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3801b871-aa3f-4ddc-85c0-6df1f9063180n%40googlegroups.com
 
<https://groups.google.com/d/msgid/tesseract-ocr/3801b871-aa3f-4ddc-85c0-6df1f9063180n%40googlegroups.com?utm_medium=email&utm_source=footer>
.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/58746195-e20a-4858-9b25-498315788c37n%40googlegroups.com.

Re: [tesseract-ocr] quality of recognition of customer invoices

Reply via email to