Re: [tesseract-ocr] quality of recognition of customer invoices

A Nederpelt Mon, 25 Sep 2023 00:46:11 -0700

Well the strange effect is, that hocr shows different characters.
"C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - 
cleaned.jpg" "Lambregts0001 - cleaned" -c hocr_char_boxes=1 hocr
result 2 times a character 'O' and the rest is '0' zero.
      <span class='ocrx_word' id='word_1_24' title='bbox 1614 1250 1899 
1273; x_wconf 75'>
       <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 
99.020287'>N</span>
       <span class='ocrx_cinfo' title='x_bboxes 1639 1250 1657 1273; x_conf 
99.020271'>L</span>
       <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1675 1273; x_conf 
98.428726'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1678 1250 1695 1273; x_conf 
98.632645'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1699 1250 1716 1273; x_conf 
98.987907'>7</span>
       <span class='ocrx_cinfo' title='x_bboxes 1719 1250 1736 1273; x_conf 
99.028702'>9</span>
       <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1756 1273; x_conf 
98.484917'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1760 1250 1777 1273; x_conf 
99.03093'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1797 1273; x_conf 
98.998169'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1818 1273; x_conf 
99.012581'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1822 1250 1839 1273; x_conf 
99.038429'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 
98.716026'>B</span>
       <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 
96.535439'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1899 1273; x_conf 
98.847801'>1</span>
      </span>


 But in the picture they all  look 100% the same as shown before.

And then i converted the painting to black and white, and copy/pasted the 
signs on the pdf
(I still see no differences). I copied the red-sign to the orange-signs...
"C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned 
- bw 2.jpg" "Lambregts0001 - cleaned - bw 2" -c hocr_char_boxes=1 hocr
      <span class='ocrx_word' id='word_1_23' title='bbox 1614 1249 1900 
1274; x_wconf 77'>
       <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 
99.039665'>N</span>
       <span class='ocrx_cinfo' title='x_bboxes 1638 1250 1657 1273; x_conf 
99.031548'>L</span>
       <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1676 1273; x_conf 
97.601151'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1679 1250 1697 1273; x_conf 
96.843338'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 1699 1249 1716 1273; x_conf 
98.95182'>7</span>
       <span class='ocrx_cinfo' title='x_bboxes 1718 1250 1736 1274; x_conf 
98.925072'>9</span>
       <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1757 1273; x_conf 
98.905106'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1759 1250 1777 1273; x_conf 
98.670326'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1798 1273; x_conf 
98.658737'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1819 1273; x_conf 
99.03775'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1821 1250 1839 1273; x_conf 
99.0326'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 
98.578423'>B</span>
       <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 
98.561943'>0</span>
       <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1900 1273; x_conf 
98.727348'>1</span>
      </span>
[image: Lambregts0001 - cleaned - bw 3.jpg]

The x_conf changes from 98 to 97 & 96

Any ideas ?

Op zondag 24 september 2023 om 14:18:30 UTC+2 schreef Art Rhyno:

> It is not a “super quality” parameter, but one possible approach to 
> critical numbers and other types of content where a dictionary is not 
> helpful is to target individual characters. Tesseract will provide 
> individual characters and probabilities of accuracy for each, either using 
> the API or in hocr with "-c hocr_char_boxes=1". With the glyph coordinates 
> and something like a range between 90 and 98 percent probability, it might 
> be possible to get closer to 99 per cent by extracting individual glyphs 
> and using single character recognition (PSM 10). This, of course, adds a 
> lot more overhead but it can help with tricky recognition, like 
> distinguishing between "O" and "0".
>
>  
>
> art
>
>  
>
> *From:* tesser...@googlegroups.com <tesser...@googlegroups.com> *On 
> Behalf Of *A Nederpelt
> *Sent:* Friday, September 22, 2023 8:25 AM
> *To:* tesseract-ocr <tesser...@googlegroups.com>
> *Subject:* Re: [tesseract-ocr] quality of recognition of customer invoices
>
>  
>
> Well i have approximatelly 3000 customers at the moment for our software. 
> We are using lots of invoices to OCR i.e. 1 customer uses approx 10.000 
> documents a month. 
>
> So opensource is worth it. I want tesseract, sinds it is free to use. 
>
> I believe opensource is the future.
>
>  
>
> So, can somebody help me optimize it. 
>
>  
>
> With lots of CPU usage i mean when it needs to use more CPU for some 
> parameter like "super quality". I want to use that parameter.
>
> Op vrijdag 22 september 2023 om 14:03:53 UTC+2 schreef desal...@gmail.com:
>
> The CPU usage is unusual. I have pretty old mac (from 2011); have been 
> running Tesseract quite fine.
>
> But, as to the accuracy, if your project is limited in scale, the 
> commercial tools would definitely perform better for you. But, if you have 
> long lasting, and extensive projects, Tesseract is worth spending your time 
> and developing (training) it. 
>
>  
>
> On Friday, September 22, 2023 at 2:50:50 PM UTC+3 powe...@gmail.com wrote:
>
> Well, the problem is that why it chooses for:
>
> NLOO7900000B01
>
> 2 times character O and 5 times a 0 (ZERO)
>
>  
>
> Google vision result: "NL007900000B01"
>
>  
>
> Nuance / OMNIPage: "NL007900000B01"
>
>  
>
> Leadtools demo: "NL007900000B01"
>
>  
>
> I want too use Tesseract, but i guess i need things like "second pass" or 
> "preprocessing", no dictionary etc.etc.etc
>
> So, i more like a CPU usage of 99,99% and not superspeed.
>
>  
>
> Can somebody help me ?
>
>  
>
> Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef desal...@gmail.com:
>
> Apparently, version 4 doesn't support white listing. 
> https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE
>
> That is not good. 
>
> On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote:
>
> The difference between zero and O is deeply problematic, for the human 
> eye. Some fonts make it even harder. 
>
> You can try the method used here: 
> https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/
>
> if that helps. 
>
> On Friday, September 22, 2023 at 9:43:51 AM UTC+3 powe...@gmail.com wrote:
>
> I found the parameters
>
> "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - 
> cleaned.jpg" "Lambregts0001 - cleaned.txt" -c 
> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
>  
> :@."
> It is not working. "uw BTW nummer:: NLOO7900000B01"
>
>  
>
> Any other ideas ?
>
>  
>
> Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef elvi...@gmail.com
> :
>
> White list the digits so that the O will not confuse it. 
>
> You can also try --psm 13 if all of your texts are single line.
>
>  
>
> On Thu, Sep 21, 2023, 4:07 PM A Nederpelt <powe...@gmail.com> wrote:
>
> Hi.
>
> I am trying to use the tesseract engine instead of the nuance engine.
>
> When i currently use tesseract.exe the image it returns a few strange 
> characters.
>
> 2x OO instead of 00
>
>   "uw BTW nummer:: NLOO7900000B01"
>
> instead of
>
>   "uw BTW nummer:: NL007900000B01"
>
> and
>
> "Tel £01"
>
> instead of
>
> "Tel : 01"
>
> but "Tel : 0168-452452" is recognized ok.
>
>  
>
> I see no optimization using 
> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md 
> because it are really clean documents.
>
>  
>
> Am i missing some parameters ? Like a second run, or more accurate run etc.
>
> Maybe compile tesseract.exe myself with different more quality parameters ?
>
>  
>
> Thanks,
>
> Alwin
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to tesseract-oc...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to tesseract-oc...@googlegroups.com.
>
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3801b871-aa3f-4ddc-85c0-6df1f9063180n%40googlegroups.com.

Re: [tesseract-ocr] quality of recognition of customer invoices

Reply via email to