Hello Zdenko, 1) Can I assume you used the latest version of tesseract to produce the output you produced? To install the latest version, do I need to first *uninstall *the older version that I have on my PC? 2) How do I create a custom image segmentation?
Thanks, Hylton On Sat, Oct 3, 2020 at 12:21 PM Zdenko Podobny <zde...@gmail.com> wrote: > 1. try the latest version > 2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300 > produces: > > 8 27 26 10 04 03 01 > > N29 19 16 14 09 03 > > 131 27 25 18 12 03 > > N21 18 16 13 07 04 > > N32 232112 10 07 > > N 36 34 30 27 21 01 > > X35 3417 13 10 08 > > N36 33 29 28 14 09 > > R 33 32 31 21 06 01 > > - oe ———— > > —— — ——— —— a = — > > R 37 27 19 09 05 03 > > -——— > > Fra anny > > 156136 > > -—— > > 3198(19): ‘on iam mn > > 10:52:25 28.11.19 1 09 > > > .. . custom image segmentation would help too (and then to OCR each "cell" > individually) > > Zdenko > > > so 3. 10. 2020 o 7:06 H Brenner <hyltonbren...@gmail.com> napísal(a): > >> Hi, >> >> I have tesseract 3.02 on a Windows 10 PC. >> >> I am trying to recognise text on a form scanned with a camera that has >> numbers mostly in tabular form with a small amount of Hebrew characters >> plus one English "graphical" word. I processed the photo to remove a pink >> background pattern, and to enhance the text in the image (the original - >> minus the pink pattern - produced the same results) >> >> [image: 3198Rfat.png] >> >> The Hebrew text on the bottom 2 lines is cut off on the right, but this >> does not matter to me. >> >> Only the numbers are of interest to me in the output. >> >> I am running tesseract in Python using the pytesseract wrapper, and I am >> running the following command: >> >> - Imaj=Image.open(ImgPath) # ImgPath is the full path to the .png >> file. >> - print('\n\n','v'*20,'\n', >> pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n') # use eng default >> >> I believe this corresponds to the command-line: >> >> - tesseract ImgPath out (I used the actual path) >> >> The output that I get is the following: >> >> - 7547512723 2 >> - >> - 1334718913 >> - 0000000000 >> - 3927010465. >> - 4483273819.. >> - 0.|..1|.|.1ln/_1|.7_n/.01 >> - 0556107919.. >> - 1|11n/Tln/_nJ110._O...|__ >> - 6978344327.. >> - n/..|9._..l9._Q.:1Jn.o3n/___ >> - _/0._1|.|9._n0EunD3./: >> - n/L232333333““ >> - >> - A —:1 qnnwn N >> - >> - 156138 >> - >> - ::§1§§?13:?76fi-fi333ii‘ifi1 >> - 10:52:25 29.11.19 :1 ma‘ >> >> Most of it is meaningless gibberish to me. Only the highlighted text is >> recognised correctly/ >> >> When I ran it with the Hebrew language selected, it produced similar >> results, but with *some *of the Hebrew characters and only the "156138" >> recognised correctly. >> >> Running tesseract manually (English) in a 'CMD' window produced the >> attached file 'out.txt'. >> >> I suspect that the font used in the form is the problem - the form was >> not printed on a normal Windows, Mac or linux computer. >> >> Which fonts were used to create heb.traineddata? Is there a way for me to >> display them? >> >> Do I have to train tesseract with the font in the form? >> >> Any help will be appreciated! >> >> Thanks! >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/xhCARSW3RaU/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xwpL-8%3DS4OwmtxNtkR47E-q5%2BtpncF%2BkPa05QkwGWWvA%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xwpL-8%3DS4OwmtxNtkR47E-q5%2BtpncF%2BkPa05QkwGWWvA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJpqH1h-RxdqqONwcz%3D%3D2aDR1Nxhwvk0hKW4eY%3DgyvfWg4ND2Q%40mail.gmail.com.