Yes, sure, the input file is a snippet with a capital letter followed by 9 digits. The correct user pattern, corresponding to [1] is:
``\A\d\d\d\d\d\d\d\d\d`` The result of Tesseract (psm 8) is fully correct. Nevertheless, user patterns is not working in the way described above. For instance, I have tried to extract only the capital character with user patterns (not with whitelist), which is: \A In this case, the capital letter and all digits are given back by tesseract. I've attached my input file and the corresponding Python snippet for reading and proessing the image with tesserocr from [2] [1] https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h#L197 [2] https://github.com/sirfz/tesserocr Am Fr., 1. März 2024 um 18:59 Uhr schrieb René JM Clais < reneclai...@gmail.com>: > Can you send an example of an input document and the output of tesseract > as well of what should be your expectation using the pattern file. > > Le jeu. 29 févr. 2024 à 21:40, Roman Seidel <roman.seide...@gmail.com> a > écrit : > >> Hi all, >> >> I am currently try to use user-patterns on the PyTessBaseAPI from >> tesserocr [1]. >> >> What I've done is to initialize the API with: >> >> with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', lang= >> LANGUAGE, psm=int(psm), oem=int(TOEM)) as api: >> >> setting the user patterns file with: >> >> api.SetVariable('user_patterns_file', >> '/home/roman/Dev_d/playground/user_patterns/deu.patterns') >> >> Where the user patterns file contains a pattern, e.g.: >> >> \A\A\A >> >> (which means three characters in capital letters. >> >> >> The result, independently ,whether I use the user_patterns_file argument >> or not, are the same. This brings me to the question if tesserocr supports >> user (and word) patterns? >> >> My versions: >> >> tesserocr 2.6.2 >> tesseract 5.3.3 >> leptonica-1.83.1 >> libpng 1.6.34 : zlib 1.2.11 >> >> Thanks a lot for your help and best wishes, >> Roman >> >> >> >> >> >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/767cc60f-5325-43d7-a6ef-9cf879f82950n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/767cc60f-5325-43d7-a6ef-9cf879f82950n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/MMtdkQu3vSM/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_ok%2BQec6cJ1fxfb5NOqLVr8MAovZMNdXT-N3QS3di%2B%3Dng%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_ok%2BQec6cJ1fxfb5NOqLVr8MAovZMNdXT-N3QS3di%2B%3Dng%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5v%3DLm8Bf_5qE2yaFGb7sY99%3DLceSWTqEk8DMMR_GYWjeg%40mail.gmail.com.
deu.patterns
Description: Binary data
import numpy as np from PIL import Image import json import tesserocr from tesserocr import PyTessBaseAPI, RIL, PSM, OEM from pathlib import Path def read_image(input_image): image = np.asarray(Image.open(input_image).convert('RGB')) return image def detect_text(image, psm, whitelist): # convert list to PIL.image for reading by tesseract img_arr = np.array(image, dtype=np.uint8) new_image = Image.fromarray(img_arr) DPI = '300' CONF = 0.5 LANGUAGE = 'deu' TOEM = 0 box_list = [] # 11 0 # with PyTessBaseAPI(lang='deu', psm=PSM.SPARSE_TEXT, oem=OEM.TESSERACT_ONLY) as api: with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', lang=LANGUAGE, psm=int(psm), oem=int(TOEM)) as api: #api.SetImageBytes(image.tobytes(), image.shape[1], image.shape[0], 1, image.shape[1]) api.SetImage(new_image) api.SetVariable("tessedit_char_whitelist", str(whitelist)) api.SetVariable("user_defined_dpi", DPI) # user patterns # api.SetVariable('user_patterns_file', '/home/roman/Dev_d/playground/user_patterns/deu.patterns') boxes = api.GetComponentImages(RIL.WORD, True) #print('Found {} textline image components.'.format(len(boxes))) for i, (im, box, _, _) in enumerate(boxes): # im is a PIL image object # box is a dict with x, y, w and h keys api.SetRectangle(box['x'], box['y'], box['w'], box['h']) text = api.GetUTF8Text() text = text.replace("\n", "") conf = api.MeanTextConf() # beautify data data = { 'text': text, 'x': box['x'], 'y': box['y'], 'w': box['w'], 'h': box['h'], 'c': conf} if conf >= CONF: print(u"Box[{0}]: x={x}, y={y}, w={w}, h={h}, " "confidence: {1}, text: {2}".format(i, conf, text, **box)) box_list.append(data) return box_list def main(): print(tesserocr.tesseract_version()) print(tesserocr.get_languages()) input_image = '/home/roman/Dev_d/playground/user_patterns/betriebsstaette.png' image = read_image(input_image) #box_list = detect_text(image, 8, "abcdefghijklmnopqrstuvwxyzäöüABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜß0123456789,.;- ") box_list = detect_text(image, 8, "ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜß0123456789") #print(f"box list: {box_list}") if __name__ == "__main__": main()