Thanks a lot for your answer ! After playing around, the issue is that apparently both whitelist and blacklist aren't supported in this scenario and make tesseract return nothing, but i don't really understand why because it works find in another scenario (for whole picture recognition, before slicing into smaller parts). Regarding documentation, i have big troubles to find informations on tesseract-ocr.github.io or in the github doc about theses two options and *how they behave when put together.* Maybe it's in a corner or a detail i missed, anyway, if anyone stumble on this topic in the future it might be helpful to better reference it in the doc. Beside the char types definition, i don't find much about it : https://tesseract-ocr.github.io/tessapi/3.x/a00624.html#aba81894cd2dc9f32e71da97cabad5580
Sorry if it sounds a bit dumb, but again, i'm a newbie on OCR and image recognition, and i like newbie friendly tools ;) Le mercredi 14 février 2024 à 07:02:36 UTC+1, zdenop a écrit : > Works like a charm: just read and follow documentation carefully: > > >tesseract e_I_read_documetation_carefully.png - --psm 10 > D > >tesseract d_I_read_documetation_carefully.png - --psm 10 > E > >tesseract d-I_read_documetation_carefully.png - --psm 10 > D- > > > Zdenko > > > st 14. 2. 2024 o 2:14 dev 313153 <dev3...@gmail.com> napísal(a): > >> Hello, >> I managed to implement a dynamic parsing to get rid of OSD issues i had. >> However i'm blocking on recognizing single uppercase letter, i tried many >> different configurations for preprocessing but i can't get to find the >> right one, even with PSM set to 10, i don't really know what i could try. >> Any help is appreciated. >> >> Here is code snippet for testing with pictures attached : >> import cv2 >> import os >> import pytesseract >> import numpy as np >> >> pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR >> \tesseract.exe' >> >> for pic in ["e.png","d-.png","d.png"]: >> img=cv2.imread(pic) >> >> #Preprocessing >> img = cv2.resize(img, (70, 90), interpolation=cv2.INTER_NEAREST) >> norm_img = np.zeros((img.shape[0], img.shape[1])) >> img = cv2.normalize(img, norm_img, 0, 255, cv2.NORM_MINMAX) >> img = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 15) >> img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) >> img = cv2.bitwise_not(img) >> img = cv2.threshold(img,127,255,cv2.THRESH_BINARY) [1] >> cv2.imwrite("processed-"+pic, img) >> >> # Tesseract OCR >> text = pytesseract.image_to_string(img, lang='eng', config='-c >> tessedit_char_whitelist=\\ ABCDEF+- tessedit_char_blacklist=\\=!,*%^$°:. >> --psm 10 -oem 3') >> print(str(text).replace("\n", " ")) >> >> >> Le mercredi 7 février 2024 à 06:39:37 UTC+1, dev 313153 a écrit : >> >>> Hello, >>> I am very new to tesseract, as well as in image processing in general. >>> I have screenshots from which i want to extract text for further >>> processing, i played around with tesseract after checking the Improve >>> Quality URL and was able to extract what i need (most of the time). >>> For example, in attached screenshots, i want to extract names of the >>> stats and the following letter together, but it doesn't always work. >>> Sometime the letter isn't extracted, and sometime it is, but the OSD >>> consider it belongs on an other level or row and it's output ahead or >>> before the stats names when i use image_to_string. >>> I also tried to play with oem and psm settings, without much >>> improvements. >>> >>> I attached some example of image_to_string outputs for different >>> pictures as well as images and the python code i'm using as testing bench. >>> >>> I am getting a bit desesperate, so i consider the following approaches : >>> - training my own dataset for this need, having sufficient data >>> shouldn't be an issue over time but i have zero experience on this kind of >>> thing. >>> - looking for the stats names coordinates, and then cropping the picture >>> around it to make sure tesseract focusses on it and extract it properly >>> (sounds like a chore code wise, but doable i think). >>> >>> Let me know what you think about it or if you have a improvements to >>> suggest. >>> Best Regards, >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/cd13256e-46f1-405a-842b-e2d781d22e4en%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/cd13256e-46f1-405a-842b-e2d781d22e4en%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c8939207-9d9d-4d7a-8950-c3bab691fc88n%40googlegroups.com.