Hi Zdenko, Per you suggestion I have installed the latest version of tesseract (Ver 5), and I played with the psm.
I get the best result using --psm 11, like you did. Other values of psm give poor results. npsm 11 is the best, but it is still not good. How do I create custom image segmentation? Thank you in advance for your help. Hylton On Saturday, October 3, 2020 at 12:21:10 PM UTC+3 zdenop wrote: > 1. try the latest version > 2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300 > produces: > > 8 27 26 10 04 03 01 > > N29 19 16 14 09 03 > > 131 27 25 18 12 03 > > N21 18 16 13 07 04 > > N32 232112 10 07 > > N 36 34 30 27 21 01 > > X35 3417 13 10 08 > > N36 33 29 28 14 09 > > R 33 32 31 21 06 01 > > - oe ———— > > —— — ——— —— a = — > > R 37 27 19 09 05 03 > > -——— > > Fra anny > > 156136 > > -—— > > 3198(19): ‘on iam mn > > 10:52:25 28.11.19 1 09 > > > .. . custom image segmentation would help too (and then to OCR each "cell" > individually) > > Zdenko > > > so 3. 10. 2020 o 7:06 H Brenner <hylton...@gmail.com> napísal(a): > >> Hi, >> >> I have tesseract 3.02 on a Windows 10 PC. >> >> I am trying to recognise text on a form scanned with a camera that has >> numbers mostly in tabular form with a small amount of Hebrew characters >> plus one English "graphical" word. I processed the photo to remove a pink >> background pattern, and to enhance the text in the image (the original - >> minus the pink pattern - produced the same results) >> >> [image: 3198Rfat.png] >> >> The Hebrew text on the bottom 2 lines is cut off on the right, but this >> does not matter to me. >> >> Only the numbers are of interest to me in the output. >> >> I am running tesseract in Python using the pytesseract wrapper, and I am >> running the following command: >> >> - Imaj=Image.open(ImgPath) # ImgPath is the full path to the .png >> file. >> - print('\n\n','v'*20,'\n', >> pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n') # use eng default >> >> I believe this corresponds to the command-line: >> >> - tesseract ImgPath out (I used the actual path) >> >> The output that I get is the following: >> >> - 7547512723 <(754)%20751-2723> 2 >> - >> - 1334718913 >> - 0000000000 >> - 3927010465. >> - 4483273819.. >> - 0.|..1|.|.1ln/_1|.7_n/.01 >> - 0556107919.. >> - 1|11n/Tln/_nJ110._O...|__ >> - 6978344327.. >> - n/..|9._..l9._Q.:1Jn.o3n/___ >> - _/0._1|.|9._n0EunD3./: >> - n/L232333333““ >> - >> - A —:1 qnnwn N >> - >> - 156138 >> - >> - ::§1§§?13:?76fi-fi333ii‘ifi1 >> - 10:52:25 29.11.19 :1 ma‘ >> >> Most of it is meaningless gibberish to me. Only the highlighted text is >> recognised correctly/ >> >> When I ran it with the Hebrew language selected, it produced similar >> results, but with *some *of the Hebrew characters and only the "156138" >> recognised correctly. >> >> Running tesseract manually (English) in a 'CMD' window produced the >> attached file 'out.txt'. >> >> I suspect that the font used in the form is the problem - the form was >> not printed on a normal Windows, Mac or linux computer. >> >> Which fonts were used to create heb.traineddata? Is there a way for me to >> display them? >> >> Do I have to train tesseract with the font in the form? >> >> Any help will be appreciated! >> >> Thanks! >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/66846144-4cbb-444a-8385-98edfbf1b1c3n%40googlegroups.com.