Re: [tesseract-ocr] Re: Creating Starter Traineddata

Dellu Bw Sat, 20 Jan 2024 05:19:27 -0800

You need to look at it in the unicode list.

On Sat, Jan 20, 2024, 3:50 PM Simon <smong5...@gmail.com> wrote:


> Hey thanks for the response!
>
> How exactly do I add characters to the unicharset?
>
> Typically the unicharset has to follow a specific pattern (
> Tesseract-unicharset_uni-mannheim
> <https://digi.bib.uni-mannheim.de/tesseract/manuals/unicharset.5.html>)
>
> Here is an example of the Latin unicharset:
>
> ⇆ 0 24,76,166,249,122,224,6,30,136,256 Common 1600 10 1600 ⇆ # ⇆ [21c6 ]
>
> If I want to add for example this character "⌖" how would I know what
> numbers I need to put for the glyph information?
>
> And also what does the "10" and "[21c6]" mean?
>
>
>
>
> elvi...@gmail.com schrieb am Freitag, 19. Januar 2024 um 16:22:24 UTC+1:
>
>> Yes, you need to add them before you create the starter model. You can
>> edit the Latin.unicarset before you run the combine command.
>>
>> On Fri, Jan 19, 2024, 5:27 PM Simon <smon...@gmail.com> wrote:
>>
>>> Ok somehow I had "no entry point found" errors in the dll files.
>>> Reinstallation of Tesseract solved the Problem.
>>>
>>> Now I encounter another interesting Problem.
>>>
>>> combine_lang_model --input_unicharset
>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/Latin.unicharset
>>> --script_dir
>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng --lang
>>> --output_dir C:/Users/LCAdmin/Documents/FineTuning/output
>>>
>>> When I run this command Tesseract tries to load many unicharsets. I
>>> don't understand why it tries to. It doesn't make any sense to me.
>>> Whats the reason for loading all these unicharsets:
>>>
>>> Failed to load script unicharset
>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Latin.unicharset
>>> Failed to load script unicharset
>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Inherited.unicharset
>>> Failed to load script unicharset
>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Unknown.unicharset
>>> Failed to load script unicharset
>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Greek.unicharset
>>> Failed to load script unicharset
>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Armenian.unicharset
>>> Failed to load script unicharset
>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Arabic.unicharset
>>> Failed to load script unicharset
>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Devanagari.unicharset
>>> Failed to load script unicharset
>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Gujarati.unicharset
>>> Failed to load script unicharset
>>> from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Bopomofo.unicharset
>>>
>>> when I only want to train the english model?
>>>
>>> Also another question arised:
>>> When I try to train some new characters do I have to add them to the
>>> Latin.unicharset before I create the starter traineddata or do I just add
>>> these characters to the created unicharset after I created starter
>>> traineddata?
>>>
>>> Simon schrieb am Freitag, 19. Januar 2024 um 10:38:24 UTC+1:
>>>
>>>> Here is a link to the Website of Uni Mannheim: COMBINE_LANG_MODEL -
>>>> generate starter traineddata
>>>> <https://digi.bib.uni-mannheim.de/tesseract/manuals/combine_lang_model.1.html>
>>>>
>>>> Unfortunately the command doesn't create any files and after running
>>>> the command I don't get any Feedback on why the command didn't work
>>>> properly.
>>>> Even when I porposely use non existent paths I still get no error
>>>> message!
>>>>
>>>> PS C:\Windows\system32> combine_lang_model --input_unicharset
>>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/Latin.unicharset
>>>> --script_dir
>>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng  --lang eng
>>>> --wordlist
>>>> C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/eng.wordlist
>>>> --output_dir C:/Users/LCAdmin/Documents/FineTuning/output
>>>> PS C:\Users\LCAdmin\Documents\FineTuning>
>>>>
>>>> PS C:\Users\LCAdmin\Documents\FineTuning> combine_lang_model
>>>> --input_unicharset tesstutorial/langdata/Latin.unicharset --script_dir
>>>> tesstutorial/langdata/eng  --lang eng --wordlist
>>>> asdfasfdef/langdata/eng/eng.wordlist --output_dir output
>>>> PS C:\Users\LCAdmin\Documents\FineTuning>
>>>>
>>>> Does anyone have an idea how I can get insights in some log messages or
>>>> something that could give me more insights on why it didn't work?
>>>>
>>>>
>>>>
>>>> Simon schrieb am Donnerstag, 18. Januar 2024 um 11:11:52 UTC+1:
>>>>
>>>>> Hello everybody,
>>>>>
>>>>> I have a question regarding "Fine Tuning +- a few characters".
>>>>>
>>>>> In general the instructions on
>>>>> https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#fine-tuning-for--a-few-characters
>>>>> say that you have to make a starter traineddata from the unicharset, but 
>>>>> is
>>>>> this also required if I want to fine tune?
>>>>>
>>>>> Furthermore I have absolutely no idea how I can create a starter
>>>>> traineddata. I read the "creating starter traineddata" chapter but I have
>>>>> absolutely no clue how I do that. This site is supposed to be a tutorial,
>>>>> therefore I expect a step for step instruction.
>>>>>
>>>>> Can anyone help me with this?
>>>>>
>>>>> I am a newby at tersseract training, so I would appreciate any help.
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/31a0381f-f407-43d7-a9a1-8450394c20fcn%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/31a0381f-f407-43d7-a9a1-8450394c20fcn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/91aeac2a-1e1a-439a-9f92-6abdda3dc695n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/91aeac2a-1e1a-439a-9f92-6abdda3dc695n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kCmCTLi4LYWj%3DM%3DOJzRpmGKHB%3DNDiOHCx2t6q2QcCDkRQ%40mail.gmail.com.

Re: [tesseract-ocr] Re: Creating Starter Traineddata

Reply via email to