One correction:

I checked the example in the below mentioned url with the Tesseract
executable and tessdata repository. The result is that user_pattern is
effecting also LSTM. This could be easily tested by generating output
without user_patters (Arial.txt):

tesseract Arial.png Arial

And with patterns:
tesseract Arial.png Arial.pat --user-patterns my.patterns
tesseract Arial.png Arial.pat.oem0 --user-patterns my.patterns --oem 0
tesseract Arial.png Arial.pat.oem1 --user-patterns my.patterns --oem 1
tesseract Arial.png Arial.pat.oem2 --user-patterns my.patterns --oem 2

Zdenko


ne 10. 3. 2024 o 17:32 Zdenko Podobny <zde...@gmail.com> napísal(a):

> Maybe I am wrong, but it looks to me like you are expecting from
> user-patterns something it never promises to provide.
> What we know/experienced:
>
>    - user-patterns extends the Tesseract legacy engine dictionary.
>    - putting a word/pattern to the Tesseract Legacy Engine dictionary
>    never guarantees word is recognized correctly (see remark
>    https://tesseract-ocr.github.io/tessdoc/APIExample-user_patterns.html)
>    - somebody (I can not find details as it was a long time ago) made
>    tests and he found that the Tesseract legacy engine dictionary has limited
>    effect. For "nonword" text (like "codes" with mixed letter&digits" people
>    usually turn off the dictionary)
>    - some users prefer to use the Legacy engine for "codes" instead of
>    LSTM
>
> As far as I know, nobody made tests regarding LSTM and dictionaries e.g.
> an investigation if user-patterns also affect LSTM engine (as for LSTM
> there are new dictionary
> components lstm-punc-dawg, lstm-punc-dawg, lstm-number-dawg) ...
>
>
> Zdenko
>
>
> ne 3. 3. 2024 o 23:02 Roman Seidel <roman.seide...@gmail.com> napísal(a):
>
>> To be more precise with my questions:
>>
>> - Is the user-patterns functiontionality implemented in the tesserocr
>> Python API of tesseract?
>> - How exact is the syntax of specifying user patterns with the tesserocr
>> Python API. Is SetVariable() correct and how is the path (Linux) and the
>> attribute specified?
>> - is there a default path, where it is lookes for the *.patterns /
>> *.user-patterns file
>>
>> With the attached code from my last message, I've tested different
>> constellations with/without the combination of whitelist, different
>> atrributes and path notations, which was not successfull.
>>
>> If I use the following notation for user patterns, it has no effect on
>> the results independently from the entries of the *.patterns file:
>>
>> api.SetVariable('user_patterns_file',
>> '/home/roman/Dev_d/playground/user_patterns/deu.patterns')
>>
>> Does anyone has (successfully) used user patterns with the tesserocr
>> Python API of tesseract?
>>
>> best wishes and thanks, Roman
>>
>>
>> Am Sa., 2. März 2024 um 13:08 Uhr schrieb Zdenko Podobny <
>> zde...@gmail.com>:
>>
>>> Can you please elaborate on:
>>>
>>> Nevertheless, user patterns is not working in the way described above.
>>>
>>>
>>>
>>> Zdenko
>>>
>>>
>>> so 2. 3. 2024 o 10:45 Roman Seidel <roman.seide...@gmail.com>
>>> napísal(a):
>>>
>>>> Yes, sure, the input file is a snippet with a capital letter followed
>>>> by 9 digits. The correct user pattern, corresponding to [1] is:
>>>>
>>>> ``\A\d\d\d\d\d\d\d\d\d``
>>>>
>>>> The result of Tesseract (psm 8) is fully correct. Nevertheless, user
>>>> patterns is not working in the way described above.
>>>>
>>>> For instance, I have tried to extract only the capital character with
>>>> user patterns (not with whitelist), which is:
>>>>
>>>> \A
>>>>
>>>> In this case, the capital letter and all digits are given back by
>>>> tesseract.
>>>>
>>>> I've attached my input file and the corresponding Python snippet for
>>>> reading and proessing the image with tesserocr from [2]
>>>>
>>>>
>>>> [1]
>>>> https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h#L197
>>>> [2] https://github.com/sirfz/tesserocr
>>>>
>>>>
>>>>
>>>> Am Fr., 1. März 2024 um 18:59 Uhr schrieb René JM Clais <
>>>> reneclai...@gmail.com>:
>>>>
>>>>> Can you send an example of an input document and the output of
>>>>> tesseract as well of what should be your expectation using the pattern
>>>>> file.
>>>>>
>>>>> Le jeu. 29 févr. 2024 à 21:40, Roman Seidel <roman.seide...@gmail.com>
>>>>> a écrit :
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am currently try to use user-patterns on the PyTessBaseAPI from
>>>>>> tesserocr [1].
>>>>>>
>>>>>> What I've done is to initialize the API with:
>>>>>>
>>>>>> with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata',
>>>>>> lang=LANGUAGE, psm=int(psm), oem=int(TOEM)) as api:
>>>>>>
>>>>>> setting the user patterns file with:
>>>>>>
>>>>>> api.SetVariable('user_patterns_file',
>>>>>> '/home/roman/Dev_d/playground/user_patterns/deu.patterns')
>>>>>>
>>>>>> Where the user patterns file contains a pattern, e.g.:
>>>>>>
>>>>>> \A\A\A
>>>>>>
>>>>>> (which means three characters in capital letters.
>>>>>>
>>>>>>
>>>>>> The result, independently ,whether I use the user_patterns_file
>>>>>> argument or not, are the same. This brings me to the question if 
>>>>>> tesserocr
>>>>>> supports user (and word) patterns?
>>>>>>
>>>>>> My versions:
>>>>>>
>>>>>> tesserocr 2.6.2
>>>>>> tesseract 5.3.3
>>>>>>  leptonica-1.83.1
>>>>>>   libpng 1.6.34 : zlib 1.2.11
>>>>>>
>>>>>> Thanks a lot for your help and best wishes,
>>>>>> Roman
>>>>>>
>>>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5uQAOGF7dD%2BtP2xt93Phv9OYy6anDGLdar4gxZxEDwjYQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5uQAOGF7dD%2BtP2xt93Phv9OYy6anDGLdar4gxZxEDwjYQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xRksBf56OQ%2BtbJetfu3gwR%3DYe6b%3DBfF59Ry43G9uFkxg%40mail.gmail.com.

Reply via email to