After removing the "L" from the picture, everything is OCR'd as zero: "uw BTW nummer:: N 007900000B01" [image: Lambregts0001 - cleaned - bw 2a.jpg] Is there any "extra" quality of Tesseract i want to disable ? It looks like Tesseract "thinks" as "Is the 'sign' zero or "O"...... hmmmmm..... there is text before the 'sign' so, it would rather be "O" than zero" Looking at the "x_conf" ENTRY: 2 boxes: <span class='ocrx_word' id='word_1_23' title='bbox 1625 1250 1636 1273; x_wconf 90'> <span class='ocrx_cinfo' title='x_bboxes 1625 1250 1636 1273; x_conf 98.790306'>N</span> </span> <span class='ocrx_word' id='word_1_24' title='bbox 1658 1249 1900 1274; x_wconf 90'> <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1676 1273; x_conf 98.878639'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1679 1250 1697 1273; x_conf 99.001717'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1699 1249 1716 1273; x_conf 99.028969'>7</span> <span class='ocrx_cinfo' title='x_bboxes 1718 1250 1736 1274; x_conf 98.723213'>9</span> <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1757 1273; x_conf 98.880974'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1759 1250 1777 1273; x_conf 99.03817'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1798 1273; x_conf 99.012474'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1819 1273; x_conf 98.878563'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1821 1250 1839 1273; x_conf 98.819069'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 99.029968'>B</span> <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 98.80172'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1900 1273; x_conf 98.938919'>1</span> </span>
1 box (previous mail) <span class='ocrx_word' id='word_1_24' title='bbox 1614 1250 1899 1273; x_wconf 75'> <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 99.020287'>N</span> <span class='ocrx_cinfo' title='x_bboxes 1639 1250 1657 1273; x_conf 99.020271'>L</span> <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1675 1273; x_conf 98.428726'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1678 1250 1695 1273; x_conf 98.632645'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1699 1250 1716 1273; x_conf 98.987907'>7</span> <span class='ocrx_cinfo' title='x_bboxes 1719 1250 1736 1273; x_conf 99.028702'>9</span> <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1756 1273; x_conf 98.484917'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1760 1250 1777 1273; x_conf 99.03093'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1797 1273; x_conf 98.998169'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1818 1273; x_conf 99.012581'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1822 1250 1839 1273; x_conf 99.038429'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 98.716026'>B</span> <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 96.535439'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1899 1273; x_conf 98.847801'>1</span> </span> As for the VAT number, changing it afterwards is no problem at all, but i want the OCR engine to recognize this in advance. The quality of the picture is 100%. I am looking for settings, and don't want to dig into the source code. So has anybody got any idea's ? Op maandag 25 september 2023 om 12:58:31 UTC+2 schreef g...@hobbelt.com: One more other thing to test is disabling the dictionary used to help evaluate character recognition. I haven't checked this myself, so YMMV -- I know the behaviour is quite different for classic tesseract (v3) engine and LSTM engine (v4/v5 addition); IIRC the `dawg` ones only affect the v3 engine, f.e. (Which reminds me: you might want to check v3 (`--oem 0` IIRC) output as well as LSTM output for BTW/VAT numbers specifically.) Anyway: python - Pytesseract disable dictionary (multiple configuration arguments) - Stack Overflow <https://stackoverflow.com/questions/71566701/pytesseract-disable-dictionary-multiple-configuration-arguments> <https://stackoverflow.com/questions/71566701/pytesseract-disable-dictionary-multiple-configuration-arguments>extra parameters mentioned here: Disable dictionary-assisted OCR in tesseract C++ API - Stack Overflow <https://stackoverflow.com/questions/33005215/disable-dictionary-assisted-ocr-in-tesseract-c-api> these parameters can be set via the CLI by using the `-c <PARAM>=<VALUE>` command line option. One param per `-c`, hence like this `-c PARAM1=0 -c PARAM2=0 ...` However, keep in mind the very(!) valid remark of Zdenko: 'O', 'o' and '0' are already *hard* to differentiate for humans and classically this was solved by printing zeroes with a slash (Slashed zero - Wikipedia <https://en.wikipedia.org/wiki/Slashed_zero>), so DO NOT expect perfect results any time. Your Dutch VAT ("BTW") codes can (and should!) be validated/corrected-in-post as they follow a very specific template (/NL[0-9]{9}B[0-9]{2}/) (Using and verifying VAT numbers | Business.gov.nl <https://business.gov.nl/regulation/using-checking-vat-numbers/> & Netherlands VAT Guide for Global Business - Rates & Compliance Requirements (fonoa.com) <https://www.fonoa.com/countries/netherlands> ) and your software should be able to disambiguate the [Oo0] set at each position in "post": while almost everyone is focusing on *pre*processing (image processing to improve tesseract OCR accuracy - Stack Overflow <https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy> , Improving the quality of the output | tessdoc (tesseract-ocr.github.io) <https://tesseract-ocr.github.io/tessdoc/ImproveQuality#still-having-problems>, et al), yours is a very *domain specific* issue as you are trying to read non-human text elements, such as VAT codes, for which *OCR postprocessing* the (HOCR or similar) tesseract output might be used: this way one can improve OCR-ing credit card "numbers" and the like, such as VAT codes: anything that's restricted by its format but not well suited for human-language dictionary- or Markov-chain-based 'grading' (I **assume** (haven't checked tesseract internals about this) you're getting 'NLOO' instead of 'NL00' because in human languages, one would expect more alphabetic characters to show up in a "word" with far higher probability than *numerical digits* -- this is why it might help to fiddle with the settings in the parameters mentioned in the second SO link above -- and when that doesn't work for you, there's always the blunt 'replace O by 0(zero) and revalidate VAT number' approach. :-) ) NEVER assume you'll get a 100% hit rate with OCR, though. OCR-A and OCR-B fonts were once developed to get much closer to that number, but ultimately, any pixels-to-text approach is bound to fail, if only ultimately by Murphy's Law. Plenty of opportunity for that one along the way in your scenario. Caveat emptor. (On a side note: I see quite a few folks trying this same thing (OCR-ing invoices and bills) the last few years; all the companies I have worked at or be in contact with require B2B electronic data exchange (in TEXT format; often as XML, some as (text-carrying) PDF/A) for the above reason (<100% success rate over time, ergo human inspection/correction required at random time intervals --> obnoxious process) when they are processing invoices in sufficient numbers to make human handling [deemed] too costly. Curiously this gap between manual single entry and electronic data exchange for invoices still is a business opportunity today, apparently.) (On another side note: observe the particular typography used for the numerals in the BTW codes as displayed at the Dutch governmental gov.nl web site page (linked above): they clearly made a very conscious choice to use "lowercase digits" (typography - How to construct "lowercase digits" (i.e. text figures)? - Graphic Design Stack Exchange <https://graphicdesign.stackexchange.com/questions/8360/how-to-construct-lowercase-digits-i-e-text-figures>) for the numeric parts. This, however, would not *solve* the readability issue either as now 'o' (lowercase Oh) and 0(digit zero) are confused by human readers instead. Bottom line: always expect confusion, also by your OCR machine, so second-guessing, i.e. "post processing" is in order. If only for flagging any "BTW nummer" (VAT code) that turns up as being invalid. Heck, you'll need that anyway as a first-line defense against (basic) VAT fraud anyway! ;-)) ) If you really want to dig further beyond tesseract `-c` parameter tweaking & experimentation, there's the source code itself to tweak to provide you with a more abundant output, along the lines of https://github.com/sirfz/tesserocr/issues/166, https://github.com/tesseract-ocr/tesseract/issues/1465, et al. The key thought here is: your application of OCR might actually benefit (as it requires POST-processing) from being able to feed your postprocessor with a *set of choices per character* to help the postprocessor decide what's best in your particular (domain specific) case, i.e. almost walk in the other direction where "fixing diplopia" took tesseract till today. Would be an interesting research project, I'm sure. Met vriendelijke groeten / Best regards, Ger Hobbelt -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: g...@hobbelt.com mobile: +31-6-11 120 978 -------------------------------------------------- On Mon, Sep 25, 2023 at 9:46 AM A Nederpelt <powe...@gmail.com> wrote: Well the strange effect is, that hocr shows different characters. "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned.jpg" "Lambregts0001 - cleaned" -c hocr_char_boxes=1 hocr result 2 times a character 'O' and the rest is '0' zero. <span class='ocrx_word' id='word_1_24' title='bbox 1614 1250 1899 1273; x_wconf 75'> <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 99.020287'>N</span> <span class='ocrx_cinfo' title='x_bboxes 1639 1250 1657 1273; x_conf 99.020271'>L</span> <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1675 1273; x_conf 98.428726'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1678 1250 1695 1273; x_conf 98.632645'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1699 1250 1716 1273; x_conf 98.987907'>7</span> <span class='ocrx_cinfo' title='x_bboxes 1719 1250 1736 1273; x_conf 99.028702'>9</span> <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1756 1273; x_conf 98.484917'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1760 1250 1777 1273; x_conf 99.03093'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1797 1273; x_conf 98.998169'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1818 1273; x_conf 99.012581'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1822 1250 1839 1273; x_conf 99.038429'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 98.716026'>B</span> <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 96.535439'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1899 1273; x_conf 98.847801'>1</span> </span> But in the picture they all look 100% the same as shown before. And then i converted the painting to black and white, and copy/pasted the signs on the pdf (I still see no differences). I copied the red-sign to the orange-signs... "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned - bw 2.jpg" "Lambregts0001 - cleaned - bw 2" -c hocr_char_boxes=1 hocr <span class='ocrx_word' id='word_1_23' title='bbox 1614 1249 1900 1274; x_wconf 77'> <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 99.039665'>N</span> <span class='ocrx_cinfo' title='x_bboxes 1638 1250 1657 1273; x_conf 99.031548'>L</span> <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1676 1273; x_conf 97.601151'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1679 1250 1697 1273; x_conf 96.843338'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1699 1249 1716 1273; x_conf 98.95182'>7</span> <span class='ocrx_cinfo' title='x_bboxes 1718 1250 1736 1274; x_conf 98.925072'>9</span> <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1757 1273; x_conf 98.905106'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1759 1250 1777 1273; x_conf 98.670326'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1798 1273; x_conf 98.658737'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1819 1273; x_conf 99.03775'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1821 1250 1839 1273; x_conf 99.0326'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 98.578423'>B</span> <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 98.561943'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1900 1273; x_conf 98.727348'>1</span> </span> [image: Lambregts0001 - cleaned - bw 3.jpg] The x_conf changes from 98 to 97 & 96 Any ideas ? Op zondag 24 september 2023 om 14:18:30 UTC+2 schreef Art Rhyno: It is not a “super quality” parameter, but one possible approach to critical numbers and other types of content where a dictionary is not helpful is to target individual characters. Tesseract will provide individual characters and probabilities of accuracy for each, either using the API or in hocr with "-c hocr_char_boxes=1". With the glyph coordinates and something like a range between 90 and 98 percent probability, it might be possible to get closer to 99 per cent by extracting individual glyphs and using single character recognition (PSM 10). This, of course, adds a lot more overhead but it can help with tricky recognition, like distinguishing between "O" and "0". art *From:* tesser...@googlegroups.com <tesser...@googlegroups.com> *On Behalf Of *A Nederpelt *Sent:* Friday, September 22, 2023 8:25 AM *To:* tesseract-ocr <tesser...@googlegroups.com> *Subject:* Re: [tesseract-ocr] quality of recognition of customer invoices Well i have approximatelly 3000 customers at the moment for our software. We are using lots of invoices to OCR i.e. 1 customer uses approx 10.000 documents a month. So opensource is worth it. I want tesseract, sinds it is free to use. I believe opensource is the future. So, can somebody help me optimize it. With lots of CPU usage i mean when it needs to use more CPU for some parameter like "super quality". I want to use that parameter. Op vrijdag 22 september 2023 om 14:03:53 UTC+2 schreef desal...@gmail.com: The CPU usage is unusual. I have pretty old mac (from 2011); have been running Tesseract quite fine. But, as to the accuracy, if your project is limited in scale, the commercial tools would definitely perform better for you. But, if you have long lasting, and extensive projects, Tesseract is worth spending your time and developing (training) it. On Friday, September 22, 2023 at 2:50:50 PM UTC+3 powe...@gmail.com wrote: Well, the problem is that why it chooses for: NLOO7900000B01 2 times character O and 5 times a 0 (ZERO) Google vision result: "NL007900000B01" Nuance / OMNIPage: "NL007900000B01" Leadtools demo: "NL007900000B01" I want too use Tesseract, but i guess i need things like "second pass" or "preprocessing", no dictionary etc.etc.etc So, i more like a CPU usage of 99,99% and not superspeed. Can somebody help me ? Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef desal...@gmail.com: Apparently, version 4 doesn't support white listing. https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE That is not good. On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote: The difference between zero and O is deeply problematic, for the human eye. Some fonts make it even harder. You can try the method used here: https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/ if that helps. On Friday, September 22, 2023 at 9:43:51 AM UTC+3 powe...@gmail.com wrote: I found the parameters "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned.jpg" "Lambregts0001 - cleaned.txt" -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 :@." It is not working. "uw BTW nummer:: NLOO7900000B01" Any other ideas ? Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef elvi...@gmail.com: White list the digits so that the O will not confuse it. You can also try --psm 13 if all of your texts are single line. On Thu, Sep 21, 2023, 4:07 PM A Nederpelt <powe...@gmail.com> wrote: Hi. I am trying to use the tesseract engine instead of the nuance engine. When i currently use tesseract.exe the image it returns a few strange characters. 2x OO instead of 00 "uw BTW nummer:: NLOO7900000B01" instead of "uw BTW nummer:: NL007900000B01" and "Tel £01" instead of "Tel : 01" but "Tel : 0168-452452" is recognized ok. I see no optimization using https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md because it are really clean documents. Am i missing some parameters ? Like a second run, or more accurate run etc. Maybe compile tesseract.exe myself with different more quality parameters ? Thanks, Alwin -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com <https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer> . -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com <https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com?utm_medium=email&utm_source=footer> . -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3801b871-aa3f-4ddc-85c0-6df1f9063180n%40googlegroups.com <https://groups.google.com/d/msgid/tesseract-ocr/3801b871-aa3f-4ddc-85c0-6df1f9063180n%40googlegroups.com?utm_medium=email&utm_source=footer> . -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/58746195-e20a-4858-9b25-498315788c37n%40googlegroups.com.