Maybe I am wrong, but it looks to me like you are expecting from user-patterns something it never promises to provide. What we know/experienced:
- user-patterns extends the Tesseract legacy engine dictionary. - putting a word/pattern to the Tesseract Legacy Engine dictionary never guarantees word is recognized correctly (see remark https://tesseract-ocr.github.io/tessdoc/APIExample-user_patterns.html) - somebody (I can not find details as it was a long time ago) made tests and he found that the Tesseract legacy engine dictionary has limited effect. For "nonword" text (like "codes" with mixed letter&digits" people usually turn off the dictionary) - some users prefer to use the Legacy engine for "codes" instead of LSTM As far as I know, nobody made tests regarding LSTM and dictionaries e.g. an investigation if user-patterns also affect LSTM engine (as for LSTM there are new dictionary components lstm-punc-dawg, lstm-punc-dawg, lstm-number-dawg) ... Zdenko ne 3. 3. 2024 o 23:02 Roman Seidel <roman.seide...@gmail.com> napísal(a): > To be more precise with my questions: > > - Is the user-patterns functiontionality implemented in the tesserocr > Python API of tesseract? > - How exact is the syntax of specifying user patterns with the tesserocr > Python API. Is SetVariable() correct and how is the path (Linux) and the > attribute specified? > - is there a default path, where it is lookes for the *.patterns / > *.user-patterns file > > With the attached code from my last message, I've tested different > constellations with/without the combination of whitelist, different > atrributes and path notations, which was not successfull. > > If I use the following notation for user patterns, it has no effect on the > results independently from the entries of the *.patterns file: > > api.SetVariable('user_patterns_file', > '/home/roman/Dev_d/playground/user_patterns/deu.patterns') > > Does anyone has (successfully) used user patterns with the tesserocr > Python API of tesseract? > > best wishes and thanks, Roman > > > Am Sa., 2. März 2024 um 13:08 Uhr schrieb Zdenko Podobny <zde...@gmail.com > >: > >> Can you please elaborate on: >> >> Nevertheless, user patterns is not working in the way described above. >> >> >> >> Zdenko >> >> >> so 2. 3. 2024 o 10:45 Roman Seidel <roman.seide...@gmail.com> napísal(a): >> >>> Yes, sure, the input file is a snippet with a capital letter followed by >>> 9 digits. The correct user pattern, corresponding to [1] is: >>> >>> ``\A\d\d\d\d\d\d\d\d\d`` >>> >>> The result of Tesseract (psm 8) is fully correct. Nevertheless, user >>> patterns is not working in the way described above. >>> >>> For instance, I have tried to extract only the capital character with >>> user patterns (not with whitelist), which is: >>> >>> \A >>> >>> In this case, the capital letter and all digits are given back by >>> tesseract. >>> >>> I've attached my input file and the corresponding Python snippet for >>> reading and proessing the image with tesserocr from [2] >>> >>> >>> [1] >>> https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h#L197 >>> [2] https://github.com/sirfz/tesserocr >>> >>> >>> >>> Am Fr., 1. März 2024 um 18:59 Uhr schrieb René JM Clais < >>> reneclai...@gmail.com>: >>> >>>> Can you send an example of an input document and the output of >>>> tesseract as well of what should be your expectation using the pattern >>>> file. >>>> >>>> Le jeu. 29 févr. 2024 à 21:40, Roman Seidel <roman.seide...@gmail.com> >>>> a écrit : >>>> >>>>> Hi all, >>>>> >>>>> I am currently try to use user-patterns on the PyTessBaseAPI from >>>>> tesserocr [1]. >>>>> >>>>> What I've done is to initialize the API with: >>>>> >>>>> with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', lang >>>>> =LANGUAGE, psm=int(psm), oem=int(TOEM)) as api: >>>>> >>>>> setting the user patterns file with: >>>>> >>>>> api.SetVariable('user_patterns_file', >>>>> '/home/roman/Dev_d/playground/user_patterns/deu.patterns') >>>>> >>>>> Where the user patterns file contains a pattern, e.g.: >>>>> >>>>> \A\A\A >>>>> >>>>> (which means three characters in capital letters. >>>>> >>>>> >>>>> The result, independently ,whether I use the user_patterns_file >>>>> argument or not, are the same. This brings me to the question if tesserocr >>>>> supports user (and word) patterns? >>>>> >>>>> My versions: >>>>> >>>>> tesserocr 2.6.2 >>>>> tesseract 5.3.3 >>>>> leptonica-1.83.1 >>>>> libpng 1.6.34 : zlib 1.2.11 >>>>> >>>>> Thanks a lot for your help and best wishes, >>>>> Roman >>>>> >>>> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5uQAOGF7dD%2BtP2xt93Phv9OYy6anDGLdar4gxZxEDwjYQ%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5uQAOGF7dD%2BtP2xt93Phv9OYy6anDGLdar4gxZxEDwjYQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z_JjZ1_%2BUaRVPDKG8bpZp-S%3DcQdJ98qW0YXap2Xh5H1A%40mail.gmail.com.