[tesseract-ocr] Re: How to start from scratch (new language) in Tesseract 5

Des Bw Thu, 16 Nov 2023 10:10:57 -0800

 

Hi Jephthah,



*Creating a starter traineddata: *



You need: 

1. *unicharset*: you can prepare it by hand. You can take the English 
sample and modify it. 

2. *script*: if the language is written in Latin, you can download the 
latin script from the tesseract GitHub repo (
https://github.com/tesseract-ocr/langdata_lstm). If the language uses 
Cyrillic 
<https://github.com/tesseract-ocr/langdata_lstm/blob/main/Cyrillic.unicharset>, 
you download the respective script. 

*The following are optional: *


*3. word*: if you want add word list, you can create a word list. 

*4. number*: if you have patterns where numbers appear

*5. punc*: if you have pattern where punctuations appear. 

(a 6th one is the redical stroke file. You can download it from the above 
repot. But, my experience is that tesseract creates it automatically.) 


Assume the name of your language is *Jephthah*: you are going to organize 
those files as: 

jep.unicharset

jep.word

jep.pun

jep.num


You put these files together in one folder (call it *langModel* for 
simplicity). You create other folders such as  *script* and myOutput inside 
*langModel* folder . And, then point your terminal to the langModel folder 
and run *combine_lang_model --input_unicharset jep.unicharset --script_dir 
script --output_dir myOutput --lang jep --words jep.word --puncs jep.punc 
--numbers jep.number*


That will produce a traineddata file:* jep.traineddata *inside myOutput 
folder. That is your starter traineddata. 

On Thursday, November 16, 2023 at 6:39:28 PM UTC+3 israel...@gmail.com 
wrote:

> Hi Des,
>
> I am attempting to walk the same path you just walked and was hoping you 
> could provide me with information on where to start. I want to train / 
> create a new language in tesseract that would recognize texts of that 
> language. How do i create the files you mentioned above? Is there a central 
> wiki with all the info i need to get started? What were the biggest 
> challenges you faced and in your opinion is it feasible to attempt to 
> create a new language?
>
> Thank you for your help
>
> On Sunday, September 10, 2023 at 2:49:15 p.m. UTC-2:30 desal...@gmail.com 
> wrote:
>
>> I am trying to train a new language. I have prepared the all the 
>> necessary files as per the manual. I have also combined them to a trained 
>> data file using the *combine_lang_model command. *
>>
>> - I also have my training files such as the text files, box files and 
>> .lsmf files inside oro-ground-truth folder. 
>>
>>
>> But, I am having trouble to proceed from there. All the instructions for 
>> training from scratch talk about using tesstrain.sh., which the manual 
>> calls unsupported and outdated. 
>>
>> - What should I do? Can you guys help me please?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/88730f15-a106-45e2-a8d8-b6cd938384cen%40googlegroups.com.

[tesseract-ocr] Re: How to start from scratch (new language) in Tesseract 5

Reply via email to