Not sure I am following, hocr is just an output format, the results should be the same. The trick would be to use the coordinates to extract the glyphs for problem characters, like the two Os below, and then use single character mode on the resulting images. I put a simple demo of this approach here [1], you would probably want to test if the approach consistently caught problem characters and then use the API to get better performance in production.
art --- 1. https://github.com/OurDigitalWorld/tesschar From: tesseract-ocr@googlegroups.com <tesseract-ocr@googlegroups.com> On Behalf Of A Nederpelt Sent: Monday, September 25, 2023 3:46 AM To: tesseract-ocr <tesseract-ocr@googlegroups.com> Subject: Re: [tesseract-ocr] quality of recognition of customer invoices Well the strange effect is, that hocr shows different characters. "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned.jpg" "Lambregts0001 - cleaned" -c hocr_char_boxes=1 hocr result 2 times a character 'O' and the rest is '0' zero. <span class='ocrx_word' id='word_1_24' title='bbox 1614 1250 1899 1273; x_wconf 75'> <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 99.020287'>N</span> <span class='ocrx_cinfo' title='x_bboxes 1639 1250 1657 1273; x_conf 99.020271'>L</span> <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1675 1273; x_conf 98.428726'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1678 1250 1695 1273; x_conf 98.632645'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1699 1250 1716 1273; x_conf 98.987907'>7</span> <span class='ocrx_cinfo' title='x_bboxes 1719 1250 1736 1273; x_conf 99.028702'>9</span> <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1756 1273; x_conf 98.484917'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1760 1250 1777 1273; x_conf 99.03093'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1797 1273; x_conf 98.998169'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1818 1273; x_conf 99.012581'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1822 1250 1839 1273; x_conf 99.038429'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 98.716026'>B</span> <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 96.535439'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1899 1273; x_conf 98.847801'>1</span> </span> But in the picture they all look 100% the same as shown before. And then i converted the painting to black and white, and copy/pasted the signs on the pdf (I still see no differences). I copied the red-sign to the orange-signs... "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned - bw 2.jpg" "Lambregts0001 - cleaned - bw 2" -c hocr_char_boxes=1 hocr <span class='ocrx_word' id='word_1_23' title='bbox 1614 1249 1900 1274; x_wconf 77'> <span class='ocrx_cinfo' title='x_bboxes 1614 1250 1636 1273; x_conf 99.039665'>N</span> <span class='ocrx_cinfo' title='x_bboxes 1638 1250 1657 1273; x_conf 99.031548'>L</span> <span class='ocrx_cinfo' title='x_bboxes 1658 1250 1676 1273; x_conf 97.601151'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1679 1250 1697 1273; x_conf 96.843338'>O</span> <span class='ocrx_cinfo' title='x_bboxes 1699 1249 1716 1273; x_conf 98.95182'>7</span> <span class='ocrx_cinfo' title='x_bboxes 1718 1250 1736 1274; x_conf 98.925072'>9</span> <span class='ocrx_cinfo' title='x_bboxes 1739 1250 1757 1273; x_conf 98.905106'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1759 1250 1777 1273; x_conf 98.670326'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1780 1250 1798 1273; x_conf 98.658737'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1801 1250 1819 1273; x_conf 99.03775'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1821 1250 1839 1273; x_conf 99.0326'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1840 1250 1862 1273; x_conf 98.578423'>B</span> <span class='ocrx_cinfo' title='x_bboxes 1865 1250 1882 1273; x_conf 98.561943'>0</span> <span class='ocrx_cinfo' title='x_bboxes 1889 1250 1900 1273; x_conf 98.727348'>1</span> </span> [cid:image001.jpg@01D9EF8B.4C76D890] The x_conf changes from 98 to 97 & 96 Any ideas ? Op zondag 24 september 2023 om 14:18:30 UTC+2 schreef Art Rhyno: It is not a “super quality” parameter, but one possible approach to critical numbers and other types of content where a dictionary is not helpful is to target individual characters. Tesseract will provide individual characters and probabilities of accuracy for each, either using the API or in hocr with "-c hocr_char_boxes=1". With the glyph coordinates and something like a range between 90 and 98 percent probability, it might be possible to get closer to 99 per cent by extracting individual glyphs and using single character recognition (PSM 10). This, of course, adds a lot more overhead but it can help with tricky recognition, like distinguishing between "O" and "0". art From: tesser...@googlegroups.com <tesser...@googlegroups.com> On Behalf Of A Nederpelt Sent: Friday, September 22, 2023 8:25 AM To: tesseract-ocr <tesser...@googlegroups.com> Subject: Re: [tesseract-ocr] quality of recognition of customer invoices Well i have approximatelly 3000 customers at the moment for our software. We are using lots of invoices to OCR i.e. 1 customer uses approx 10.000 documents a month. So opensource is worth it. I want tesseract, sinds it is free to use. I believe opensource is the future. So, can somebody help me optimize it. With lots of CPU usage i mean when it needs to use more CPU for some parameter like "super quality". I want to use that parameter. Op vrijdag 22 september 2023 om 14:03:53 UTC+2 schreef desal...@gmail.com: The CPU usage is unusual. I have pretty old mac (from 2011); have been running Tesseract quite fine. But, as to the accuracy, if your project is limited in scale, the commercial tools would definitely perform better for you. But, if you have long lasting, and extensive projects, Tesseract is worth spending your time and developing (training) it. On Friday, September 22, 2023 at 2:50:50 PM UTC+3 powe...@gmail.com wrote: Well, the problem is that why it chooses for: NLOO7900000B01 [https://groups.google.com/group/tesseract-ocr/attach/34576be24307/Lambregts0001%20-%20cleaned%20-%20btwnr.jpg?part=0.1&view=1] 2 times character O and 5 times a 0 (ZERO) Google vision result: "NL007900000B01" Nuance / OMNIPage: "NL007900000B01" Leadtools demo: "NL007900000B01" I want too use Tesseract, but i guess i need things like "second pass" or "preprocessing", no dictionary etc.etc.etc So, i more like a CPU usage of 99,99% and not superspeed. Can somebody help me ? Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef desal...@gmail.com: Apparently, version 4 doesn't support white listing. https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE That is not good. On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote: The difference between zero and O is deeply problematic, for the human eye. Some fonts make it even harder. You can try the method used here: https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/ if that helps. On Friday, September 22, 2023 at 9:43:51 AM UTC+3 powe...@gmail.com wrote: I found the parameters "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - cleaned.jpg" "Lambregts0001 - cleaned.txt" -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 :@." It is not working. "uw BTW nummer:: NLOO7900000B01" Any other ideas ? Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef elvi...@gmail.com: White list the digits so that the O will not confuse it. You can also try --psm 13 if all of your texts are single line. On Thu, Sep 21, 2023, 4:07 PM A Nederpelt <powe...@gmail.com> wrote: Hi. I am trying to use the tesseract engine instead of the nuance engine. When i currently use tesseract.exe the image it returns a few strange characters. 2x OO instead of 00 "uw BTW nummer:: NLOO7900000B01" instead of "uw BTW nummer:: NL007900000B01" and "Tel £01" instead of "Tel : 01" but "Tel : 0168-452452" is recognized ok. I see no optimization using https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md because it are really clean documents. Am i missing some parameters ? Like a second run, or more accurate run etc. Maybe compile tesseract.exe myself with different more quality parameters ? Thanks, Alwin -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/5aa9548f-a539-46d1-94a1-fc25850d5982n%40googlegroups.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3801b871-aa3f-4ddc-85c0-6df1f9063180n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/3801b871-aa3f-4ddc-85c0-6df1f9063180n%40googlegroups.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/YT3PR01MB9897FA6E54CEB2A01E0BEC11DCFCA%40YT3PR01MB9897.CANPRD01.PROD.OUTLOOK.COM.