Well the strange effect is, that hocr shows different characters. "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned.jpg" "Lambregts0001 - cleaned" -c hocr_char_boxes=1 hocr result 2 times a character 'O' and the rest is '0' zero. <span class='ocrx_word' id='word_1_24' title='bbox 1614 1250 1899 1273; x_wconf 75'> <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 99.020287'>N</span> <span class='ocrx_cinfo' title='x_bboxes 1639 1250 1657 1273; x_conf 99.020271'>L</span> <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1675 1273; x_conf 98.428726'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1678 1250 1695 1273; x_conf 98.632645'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1699 1250 1716 1273; x_conf 98.987907'>7</span> <span class='ocrx_cinfo' title='x_bboxes 1719 1250 1736 1273; x_conf 99.028702'>9</span> <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1756 1273; x_conf 98.484917'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1760 1250 1777 1273; x_conf 99.03093'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1797 1273; x_conf 98.998169'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1818 1273; x_conf 99.012581'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1822 1250 1839 1273; x_conf 99.038429'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 98.716026'>B</span> <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 96.535439'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1899 1273; x_conf 98.847801'>1</span> </span>
But in the picture they all look 100% the same as shown before. And then i converted the painting to black and white, and copy/pasted the signs on the pdf (I still see no differences). I copied the red-sign to the orange-signs... "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned - bw 2.jpg" "Lambregts0001 - cleaned - bw 2" -c hocr_char_boxes=1 hocr <span class='ocrx_word' id='word_1_23' title='bbox 1614 1249 1900 1274; x_wconf 77'> <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 99.039665'>N</span> <span class='ocrx_cinfo' title='x_bboxes 1638 1250 1657 1273; x_conf 99.031548'>L</span> <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1676 1273; x_conf 97.601151'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1679 1250 1697 1273; x_conf 96.843338'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1699 1249 1716 1273; x_conf 98.95182'>7</span> <span class='ocrx_cinfo' title='x_bboxes 1718 1250 1736 1274; x_conf 98.925072'>9</span> <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1757 1273; x_conf 98.905106'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1759 1250 1777 1273; x_conf 98.670326'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1798 1273; x_conf 98.658737'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1819 1273; x_conf 99.03775'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1821 1250 1839 1273; x_conf 99.0326'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 98.578423'>B</span> <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 98.561943'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1900 1273; x_conf 98.727348'>1</span> </span> [image: Lambregts0001 - cleaned - bw 3.jpg] The x_conf changes from 98 to 97 & 96 Any ideas ? Op zondag 24 september 2023 om 14:18:30 UTC+2 schreef Art Rhyno: > It is not a “super quality” parameter, but one possible approach to > critical numbers and other types of content where a dictionary is not > helpful is to target individual characters. Tesseract will provide > individual characters and probabilities of accuracy for each, either using > the API or in hocr with "-c hocr_char_boxes=1". With the glyph coordinates > and something like a range between 90 and 98 percent probability, it might > be possible to get closer to 99 per cent by extracting individual glyphs > and using single character recognition (PSM 10). This, of course, adds a > lot more overhead but it can help with tricky recognition, like > distinguishing between "O" and "0". > > > > art > > > > *From:* tesser...@googlegroups.com <tesser...@googlegroups.com> *On > Behalf Of *A Nederpelt > *Sent:* Friday, September 22, 2023 8:25 AM > *To:* tesseract-ocr <tesser...@googlegroups.com> > *Subject:* Re: [tesseract-ocr] quality of recognition of customer invoices > > > > Well i have approximatelly 3000 customers at the moment for our software. > We are using lots of invoices to OCR i.e. 1 customer uses approx 10.000 > documents a month. > > So opensource is worth it. I want tesseract, sinds it is free to use. > > I believe opensource is the future. > > > > So, can somebody help me optimize it. > > > > With lots of CPU usage i mean when it needs to use more CPU for some > parameter like "super quality". I want to use that parameter. > > Op vrijdag 22 september 2023 om 14:03:53 UTC+2 schreef desal...@gmail.com: > > The CPU usage is unusual. I have pretty old mac (from 2011); have been > running Tesseract quite fine. > > But, as to the accuracy, if your project is limited in scale, the > commercial tools would definitely perform better for you. But, if you have > long lasting, and extensive projects, Tesseract is worth spending your time > and developing (training) it. > > > > On Friday, September 22, 2023 at 2:50:50 PM UTC+3 powe...@gmail.com wrote: > > Well, the problem is that why it chooses for: > > NLOO7900000B01 > > 2 times character O and 5 times a 0 (ZERO) > > > > Google vision result: "NL007900000B01" > > > > Nuance / OMNIPage: "NL007900000B01" > > > > Leadtools demo: "NL007900000B01" > > > > I want too use Tesseract, but i guess i need things like "second pass" or > "preprocessing", no dictionary etc.etc.etc > > So, i more like a CPU usage of 99,99% and not superspeed. > > > > Can somebody help me ? > > > > Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef desal...@gmail.com: > > Apparently, version 4 doesn't support white listing. > https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE > > That is not good. > > On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote: > > The difference between zero and O is deeply problematic, for the human > eye. Some fonts make it even harder. > > You can try the method used here: > https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/ > > if that helps. > > On Friday, September 22, 2023 at 9:43:51 AM UTC+3 powe...@gmail.com wrote: > > I found the parameters > > "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - > cleaned.jpg" "Lambregts0001 - cleaned.txt" -c > tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 > > :@." > It is not working. "uw BTW nummer:: NLOO7900000B01" > > > > Any other ideas ? > > > > Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef elvi...@gmail.com > : > > White list the digits so that the O will not confuse it. > > You can also try --psm 13 if all of your texts are single line. > > > > On Thu, Sep 21, 2023, 4:07 PM A Nederpelt <powe...@gmail.com> wrote: > > Hi. > > I am trying to use the tesseract engine instead of the nuance engine. > > When i currently use tesseract.exe the image it returns a few strange > characters. > > 2x OO instead of 00 > > "uw BTW nummer:: NLOO7900000B01" > > instead of > > "uw BTW nummer:: NL007900000B01" > > and > > "Tel £01" > > instead of > > "Tel : 01" > > but "Tel : 0168-452452" is recognized ok. > > > > I see no optimization using > https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md > because it are really clean documents. > > > > Am i missing some parameters ? Like a second run, or more accurate run etc. > > Maybe compile tesseract.exe myself with different more quality parameters ? > > > > Thanks, > > Alwin > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-oc...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com > > <https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-oc...@googlegroups.com. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com > > <https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3801b871-aa3f-4ddc-85c0-6df1f9063180n%40googlegroups.com.