Maybe I am wrong, but it looks to me like you are expecting from
user-patterns something it never promises to provide.
What we know/experienced:

   - user-patterns extends the Tesseract legacy engine dictionary.
   - putting a word/pattern to the Tesseract Legacy Engine dictionary never
   guarantees word is recognized correctly (see remark
   https://tesseract-ocr.github.io/tessdoc/APIExample-user_patterns.html)
   - somebody (I can not find details as it was a long time ago) made tests
   and he found that the Tesseract legacy engine dictionary has limited
   effect. For "nonword" text (like "codes" with mixed letter&digits" people
   usually turn off the dictionary)
   - some users prefer to use the Legacy engine for "codes" instead of LSTM

As far as I know, nobody made tests regarding LSTM and dictionaries e.g.
an investigation if user-patterns also affect LSTM engine (as for LSTM
there are new dictionary
components lstm-punc-dawg, lstm-punc-dawg, lstm-number-dawg) ...


Zdenko


ne 3. 3. 2024 o 23:02 Roman Seidel <roman.seide...@gmail.com> napísal(a):

> To be more precise with my questions:
>
> - Is the user-patterns functiontionality implemented in the tesserocr
> Python API of tesseract?
> - How exact is the syntax of specifying user patterns with the tesserocr
> Python API. Is SetVariable() correct and how is the path (Linux) and the
> attribute specified?
> - is there a default path, where it is lookes for the *.patterns /
> *.user-patterns file
>
> With the attached code from my last message, I've tested different
> constellations with/without the combination of whitelist, different
> atrributes and path notations, which was not successfull.
>
> If I use the following notation for user patterns, it has no effect on the
> results independently from the entries of the *.patterns file:
>
> api.SetVariable('user_patterns_file',
> '/home/roman/Dev_d/playground/user_patterns/deu.patterns')
>
> Does anyone has (successfully) used user patterns with the tesserocr
> Python API of tesseract?
>
> best wishes and thanks, Roman
>
>
> Am Sa., 2. März 2024 um 13:08 Uhr schrieb Zdenko Podobny <zde...@gmail.com
> >:
>
>> Can you please elaborate on:
>>
>> Nevertheless, user patterns is not working in the way described above.
>>
>>
>>
>> Zdenko
>>
>>
>> so 2. 3. 2024 o 10:45 Roman Seidel <roman.seide...@gmail.com> napísal(a):
>>
>>> Yes, sure, the input file is a snippet with a capital letter followed by
>>> 9 digits. The correct user pattern, corresponding to [1] is:
>>>
>>> ``\A\d\d\d\d\d\d\d\d\d``
>>>
>>> The result of Tesseract (psm 8) is fully correct. Nevertheless, user
>>> patterns is not working in the way described above.
>>>
>>> For instance, I have tried to extract only the capital character with
>>> user patterns (not with whitelist), which is:
>>>
>>> \A
>>>
>>> In this case, the capital letter and all digits are given back by
>>> tesseract.
>>>
>>> I've attached my input file and the corresponding Python snippet for
>>> reading and proessing the image with tesserocr from [2]
>>>
>>>
>>> [1]
>>> https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h#L197
>>> [2] https://github.com/sirfz/tesserocr
>>>
>>>
>>>
>>> Am Fr., 1. März 2024 um 18:59 Uhr schrieb René JM Clais <
>>> reneclai...@gmail.com>:
>>>
>>>> Can you send an example of an input document and the output of
>>>> tesseract as well of what should be your expectation using the pattern
>>>> file.
>>>>
>>>> Le jeu. 29 févr. 2024 à 21:40, Roman Seidel <roman.seide...@gmail.com>
>>>> a écrit :
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am currently try to use user-patterns on the PyTessBaseAPI from
>>>>> tesserocr [1].
>>>>>
>>>>> What I've done is to initialize the API with:
>>>>>
>>>>> with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', lang
>>>>> =LANGUAGE, psm=int(psm), oem=int(TOEM)) as api:
>>>>>
>>>>> setting the user patterns file with:
>>>>>
>>>>> api.SetVariable('user_patterns_file',
>>>>> '/home/roman/Dev_d/playground/user_patterns/deu.patterns')
>>>>>
>>>>> Where the user patterns file contains a pattern, e.g.:
>>>>>
>>>>> \A\A\A
>>>>>
>>>>> (which means three characters in capital letters.
>>>>>
>>>>>
>>>>> The result, independently ,whether I use the user_patterns_file
>>>>> argument or not, are the same. This brings me to the question if tesserocr
>>>>> supports user (and word) patterns?
>>>>>
>>>>> My versions:
>>>>>
>>>>> tesserocr 2.6.2
>>>>> tesseract 5.3.3
>>>>>  leptonica-1.83.1
>>>>>   libpng 1.6.34 : zlib 1.2.11
>>>>>
>>>>> Thanks a lot for your help and best wishes,
>>>>> Roman
>>>>>
>>>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5uQAOGF7dD%2BtP2xt93Phv9OYy6anDGLdar4gxZxEDwjYQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5uQAOGF7dD%2BtP2xt93Phv9OYy6anDGLdar4gxZxEDwjYQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z_JjZ1_%2BUaRVPDKG8bpZp-S%3DcQdJ98qW0YXap2Xh5H1A%40mail.gmail.com.

Reply via email to