RE: [tesseract-ocr] quality of recognition of customer invoices

Art Rhyno Mon, 25 Sep 2023 06:14:59 -0700

Not sure I am following, hocr is just an output format, the results should be 
the same. The trick would be to use the coordinates to extract the glyphs for 
problem characters, like the two Os below, and then use single character mode 
on the resulting images. I put a simple demo of this approach here [1], you 
would probably want to test if the approach consistently caught problem 
characters and then use the API to get better performance in production.

art
---
1. https://github.com/OurDigitalWorld/tesschar

From: tesseract-ocr@googlegroups.com <tesseract-ocr@googlegroups.com> On Behalf 
Of A Nederpelt
Sent: Monday, September 25, 2023 3:46 AM
To: tesseract-ocr <tesseract-ocr@googlegroups.com>
Subject: Re: [tesseract-ocr] quality of recognition of customer invoices

Well the strange effect is, that hocr shows different characters.
"C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned.jpg" 
"Lambregts0001 - cleaned" -c hocr_char_boxes=1 hocr
result 2 times a character 'O' and the rest is '0' zero.
      <span class='ocrx_word' id='word_1_24' title='bbox 1614 1250 1899 1273; 
x_wconf 75'>
       <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 
99.020287'>N</span>
       <span class='ocrx_cinfo' title='x_bboxes 1639 1250 1657 1273; x_conf 
99.020271'>L</span>
       <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1675 1273; x_conf 
98.428726'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1678 1250 1695 1273; x_conf 
98.632645'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1699 1250 1716 1273; x_conf 
98.987907'>7</span>
       <span class='ocrx_cinfo' title='x_bboxes 1719 1250 1736 1273; x_conf 
99.028702'>9</span>
       <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1756 1273; x_conf 
98.484917'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1760 1250 1777 1273; x_conf 
99.03093'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1797 1273; x_conf 
98.998169'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1818 1273; x_conf 
99.012581'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1822 1250 1839 1273; x_conf 
99.038429'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 
98.716026'>B</span>
       <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 
96.535439'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1899 1273; x_conf 
98.847801'>1</span>
      </span>

 But in the picture they all  look 100% the same as shown before.

And then i converted the painting to black and white, and copy/pasted the signs 
on the pdf
(I still see no differences). I copied the red-sign to the orange-signs...
"C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned - bw 
2.jpg" "Lambregts0001 - cleaned - bw 2" -c hocr_char_boxes=1 hocr
      <span class='ocrx_word' id='word_1_23' title='bbox 1614 1249 1900 1274; 
x_wconf 77'>
       <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 
99.039665'>N</span>
       <span class='ocrx_cinfo' title='x_bboxes 1638 1250 1657 1273; x_conf 
99.031548'>L</span>
       <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1676 1273; x_conf 
97.601151'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1679 1250 1697 1273; x_conf 
96.843338'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1699 1249 1716 1273; x_conf 
98.95182'>7</span>
       <span class='ocrx_cinfo' title='x_bboxes 1718 1250 1736 1274; x_conf 
98.925072'>9</span>
       <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1757 1273; x_conf 
98.905106'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1759 1250 1777 1273; x_conf 
98.670326'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1798 1273; x_conf 
98.658737'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1819 1273; x_conf 
99.03775'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1821 1250 1839 1273; x_conf 
99.0326'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 
98.578423'>B</span>
       <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 
98.561943'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1900 1273; x_conf 
98.727348'>1</span>
      </span>
[cid:image001.jpg@01D9EF8B.4C76D890]

The x_conf changes from 98 to 97 & 96

Any ideas ?

Op zondag 24 september 2023 om 14:18:30 UTC+2 schreef Art Rhyno:
It is not a “super quality” parameter, but one possible approach to critical 
numbers and other types of content where a dictionary is not helpful is to 
target individual characters. Tesseract will provide individual characters and 
probabilities of accuracy for each, either using the API or in hocr with "-c 
hocr_char_boxes=1". With the glyph coordinates and something like a range 
between 90 and 98 percent probability, it might be possible to get closer to 99 
per cent by extracting individual glyphs and using single character recognition 
(PSM 10). This, of course, adds a lot more overhead but it can help with tricky 
recognition, like distinguishing between "O" and "0".

art

From: tesser...@googlegroups.com <tesser...@googlegroups.com> On Behalf Of A 
Nederpelt
Sent: Friday, September 22, 2023 8:25 AM
To: tesseract-ocr <tesser...@googlegroups.com>
Subject: Re: [tesseract-ocr] quality of recognition of customer invoices

Well i have approximatelly 3000 customers at the moment for our software. We 
are using lots of invoices to OCR i.e. 1 customer uses approx 10.000 documents 
a month.
So opensource is worth it. I want tesseract, sinds it is free to use.
I believe opensource is the future.

So, can somebody help me optimize it.

With lots of CPU usage i mean when it needs to use more CPU for some parameter 
like "super quality". I want to use that parameter.
Op vrijdag 22 september 2023 om 14:03:53 UTC+2 schreef desal...@gmail.com:
The CPU usage is unusual. I have pretty old mac (from 2011); have been running 
Tesseract quite fine.
But, as to the accuracy, if your project is limited in scale, the commercial 
tools would definitely perform better for you. But, if you have long lasting, 
and extensive projects, Tesseract is worth spending your time and developing 
(training) it.

On Friday, September 22, 2023 at 2:50:50 PM UTC+3 powe...@gmail.com wrote:
Well, the problem is that why it chooses for:
NLOO7900000B01
[https://groups.google.com/group/tesseract-ocr/attach/34576be24307/Lambregts0001%20-%20cleaned%20-%20btwnr.jpg?part=0.1&view=1]
2 times character O and 5 times a 0 (ZERO)

Google vision result: "NL007900000B01"

Nuance / OMNIPage: "NL007900000B01"

Leadtools demo: "NL007900000B01"

I want too use Tesseract, but i guess i need things like "second pass" or 
"preprocessing", no dictionary etc.etc.etc
So, i more like a CPU usage of 99,99% and not superspeed.

Can somebody help me ?

Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef desal...@gmail.com:
Apparently, version 4 doesn't support white listing. 
https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE
That is not good.
On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote:
The difference between zero and O is deeply problematic, for the human eye. 
Some fonts make it even harder.
You can try the method used here: 
https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/
if that helps.
On Friday, September 22, 2023 at 9:43:51 AM UTC+3 powe...@gmail.com wrote:
I found the parameters
"C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned.jpg" 
"Lambregts0001 - cleaned.txt" -c 
tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
 :@."
It is not working. "uw BTW nummer:: NLOO7900000B01"

Any other ideas ?

Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef elvi...@gmail.com:
White list the digits so that the O will not confuse it.
You can also try --psm 13 if all of your texts are single line.

On Thu, Sep 21, 2023, 4:07 PM A Nederpelt <powe...@gmail.com> wrote:
Hi.
I am trying to use the tesseract engine instead of the nuance engine.
When i currently use tesseract.exe the image it returns a few strange 
characters.
2x OO instead of 00
  "uw BTW nummer:: NLOO7900000B01"
instead of
  "uw BTW nummer:: NL007900000B01"
and
"Tel £01"
instead of
"Tel : 01"
but "Tel : 0168-452452" is recognized ok.

I see no optimization using 
https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md because it 
are really clean documents.

Am i missing some parameters ? Like a second run, or more accurate run etc.
Maybe compile tesseract.exe myself with different more quality parameters ?

Thanks,
Alwin
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3801b871-aa3f-4ddc-85c0-6df1f9063180n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/3801b871-aa3f-4ddc-85c0-6df1f9063180n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/YT3PR01MB9897FA6E54CEB2A01E0BEC11DCFCA%40YT3PR01MB9897.CANPRD01.PROD.OUTLOOK.COM.

RE: [tesseract-ocr] quality of recognition of customer invoices

Reply via email to