Well, the problem is that why it chooses for: NLOO7900000B01 [image: Lambregts0001 - cleaned - btwnr.jpg] 2 times character O and 5 times a 0 (ZERO)
Google vision result: "NL007900000B01" Nuance / OMNIPage: "NL007900000B01" Leadtools demo: "NL007900000B01" I want too use Tesseract, but i guess i need things like "second pass" or "preprocessing", no dictionary etc.etc.etc So, i more like a CPU usage of 99,99% and not superspeed. Can somebody help me ? Op vrijdag 22 september 2023 om 13:25:21 UTC+2 schreef desal...@gmail.com: > Apparently, version 4 doesn't support white listing. > https://groups.google.com/g/tesseract-ocr/c/IBbQIQpdSpE > That is not good. > On Friday, September 22, 2023 at 2:23:39 PM UTC+3 Des Bw wrote: > >> The difference between zero and O is deeply problematic, for the human >> eye. Some fonts make it even harder. >> You can try the method used here: >> https://pyimagesearch.com/2021/09/06/whitelisting-and-blacklisting-characters-with-tesseract-and-python/ >> if that helps. >> On Friday, September 22, 2023 at 9:43:51 AM UTC+3 powe...@gmail.com >> wrote: >> >>> I found the parameters >>> "C:\Program Files\Tesseract-OCR\tesseract.exe" "..\Lambregts0001 - >>> cleaned.jpg" "Lambregts0001 - cleaned.txt" -c >>> tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 >>> >>> :@." >>> It is not working. "uw BTW nummer:: NLOO7900000B01" >>> >>> Any other ideas ? >>> >>> Op donderdag 21 september 2023 om 22:25:12 UTC+2 schreef >>> elvi...@gmail.com: >>> >>>> White list the digits so that the O will not confuse it. >>>> >>> You can also try --psm 13 if all of your texts are single line. >>>> >>> >>>> On Thu, Sep 21, 2023, 4:07 PM A Nederpelt <powe...@gmail.com> wrote: >>>> >>>>> Hi. >>>>> I am trying to use the tesseract engine instead of the nuance engine. >>>>> When i currently use tesseract.exe the image it returns a few strange >>>>> characters. >>>>> 2x OO instead of 00 >>>>> "uw BTW nummer:: NLOO7900000B01" >>>>> instead of >>>>> "uw BTW nummer:: NL007900000B01" >>>>> and >>>>> "Tel £01" >>>>> instead of >>>>> "Tel : 01" >>>>> but "Tel : 0168-452452" is recognized ok. >>>>> >>>>> I see no optimization using >>>>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md >>>>> because it are really clean documents. >>>>> >>>>> Am i missing some parameters ? Like a second run, or more accurate run >>>>> etc. >>>>> Maybe compile tesseract.exe myself with different more quality >>>>> parameters ? >>>>> >>>>> Thanks, >>>>> Alwin >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6f5f957e-4f33-419f-aba6-2e8a3f6f8d92n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/60caf669-edb7-4517-9e07-8ad49f1b0d85n%40googlegroups.com.