One correction: I checked the example in the below mentioned url with the Tesseract executable and tessdata repository. The result is that user_pattern is effecting also LSTM. This could be easily tested by generating output without user_patters (Arial.txt):
tesseract Arial.png Arial And with patterns: tesseract Arial.png Arial.pat --user-patterns my.patterns tesseract Arial.png Arial.pat.oem0 --user-patterns my.patterns --oem 0 tesseract Arial.png Arial.pat.oem1 --user-patterns my.patterns --oem 1 tesseract Arial.png Arial.pat.oem2 --user-patterns my.patterns --oem 2 Zdenko ne 10. 3. 2024 o 17:32 Zdenko Podobny <zde...@gmail.com> napísal(a): > Maybe I am wrong, but it looks to me like you are expecting from > user-patterns something it never promises to provide. > What we know/experienced: > > - user-patterns extends the Tesseract legacy engine dictionary. > - putting a word/pattern to the Tesseract Legacy Engine dictionary > never guarantees word is recognized correctly (see remark > https://tesseract-ocr.github.io/tessdoc/APIExample-user_patterns.html) > - somebody (I can not find details as it was a long time ago) made > tests and he found that the Tesseract legacy engine dictionary has limited > effect. For "nonword" text (like "codes" with mixed letter&digits" people > usually turn off the dictionary) > - some users prefer to use the Legacy engine for "codes" instead of > LSTM > > As far as I know, nobody made tests regarding LSTM and dictionaries e.g. > an investigation if user-patterns also affect LSTM engine (as for LSTM > there are new dictionary > components lstm-punc-dawg, lstm-punc-dawg, lstm-number-dawg) ... > > > Zdenko > > > ne 3. 3. 2024 o 23:02 Roman Seidel <roman.seide...@gmail.com> napísal(a): > >> To be more precise with my questions: >> >> - Is the user-patterns functiontionality implemented in the tesserocr >> Python API of tesseract? >> - How exact is the syntax of specifying user patterns with the tesserocr >> Python API. Is SetVariable() correct and how is the path (Linux) and the >> attribute specified? >> - is there a default path, where it is lookes for the *.patterns / >> *.user-patterns file >> >> With the attached code from my last message, I've tested different >> constellations with/without the combination of whitelist, different >> atrributes and path notations, which was not successfull. >> >> If I use the following notation for user patterns, it has no effect on >> the results independently from the entries of the *.patterns file: >> >> api.SetVariable('user_patterns_file', >> '/home/roman/Dev_d/playground/user_patterns/deu.patterns') >> >> Does anyone has (successfully) used user patterns with the tesserocr >> Python API of tesseract? >> >> best wishes and thanks, Roman >> >> >> Am Sa., 2. März 2024 um 13:08 Uhr schrieb Zdenko Podobny < >> zde...@gmail.com>: >> >>> Can you please elaborate on: >>> >>> Nevertheless, user patterns is not working in the way described above. >>> >>> >>> >>> Zdenko >>> >>> >>> so 2. 3. 2024 o 10:45 Roman Seidel <roman.seide...@gmail.com> >>> napísal(a): >>> >>>> Yes, sure, the input file is a snippet with a capital letter followed >>>> by 9 digits. The correct user pattern, corresponding to [1] is: >>>> >>>> ``\A\d\d\d\d\d\d\d\d\d`` >>>> >>>> The result of Tesseract (psm 8) is fully correct. Nevertheless, user >>>> patterns is not working in the way described above. >>>> >>>> For instance, I have tried to extract only the capital character with >>>> user patterns (not with whitelist), which is: >>>> >>>> \A >>>> >>>> In this case, the capital letter and all digits are given back by >>>> tesseract. >>>> >>>> I've attached my input file and the corresponding Python snippet for >>>> reading and proessing the image with tesserocr from [2] >>>> >>>> >>>> [1] >>>> https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h#L197 >>>> [2] https://github.com/sirfz/tesserocr >>>> >>>> >>>> >>>> Am Fr., 1. März 2024 um 18:59 Uhr schrieb René JM Clais < >>>> reneclai...@gmail.com>: >>>> >>>>> Can you send an example of an input document and the output of >>>>> tesseract as well of what should be your expectation using the pattern >>>>> file. >>>>> >>>>> Le jeu. 29 févr. 2024 à 21:40, Roman Seidel <roman.seide...@gmail.com> >>>>> a écrit : >>>>> >>>>>> Hi all, >>>>>> >>>>>> I am currently try to use user-patterns on the PyTessBaseAPI from >>>>>> tesserocr [1]. >>>>>> >>>>>> What I've done is to initialize the API with: >>>>>> >>>>>> with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', >>>>>> lang=LANGUAGE, psm=int(psm), oem=int(TOEM)) as api: >>>>>> >>>>>> setting the user patterns file with: >>>>>> >>>>>> api.SetVariable('user_patterns_file', >>>>>> '/home/roman/Dev_d/playground/user_patterns/deu.patterns') >>>>>> >>>>>> Where the user patterns file contains a pattern, e.g.: >>>>>> >>>>>> \A\A\A >>>>>> >>>>>> (which means three characters in capital letters. >>>>>> >>>>>> >>>>>> The result, independently ,whether I use the user_patterns_file >>>>>> argument or not, are the same. This brings me to the question if >>>>>> tesserocr >>>>>> supports user (and word) patterns? >>>>>> >>>>>> My versions: >>>>>> >>>>>> tesserocr 2.6.2 >>>>>> tesseract 5.3.3 >>>>>> leptonica-1.83.1 >>>>>> libpng 1.6.34 : zlib 1.2.11 >>>>>> >>>>>> Thanks a lot for your help and best wishes, >>>>>> Roman >>>>>> >>>>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5uQAOGF7dD%2BtP2xt93Phv9OYy6anDGLdar4gxZxEDwjYQ%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5uQAOGF7dD%2BtP2xt93Phv9OYy6anDGLdar4gxZxEDwjYQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xRksBf56OQ%2BtbJetfu3gwR%3DYe6b%3DBfF59Ry43G9uFkxg%40mail.gmail.com.