Hi Jephthan, If you are trying to train a new language, your first step is to produce a starter traineddata. Once, you have the starter model, you can then produce a training material (such the text lines; sentences from the language).
In this email, I will share you the ways to produce a starter traineddata (starter model) to the point I understood. *Creating a starter traineddata: * You need: 1. lang.unicharset: you can prepare it by hand. You can take the English sample and modify it. This file contains all the characters of the language. 2. script: if the language is written in Latin, you can download the latin script from the tesseract GitHub repo ( https://github.com/tesseract-ocr/langdata_lstm). If the language uses Cyrillic <https://github.com/tesseract-ocr/langdata_lstm/blob/main/Cyrillic.unicharset>, you download that script. 3. *Radical Stroke, *you can download it from the repo. But, I think tesseract can also automatically produce it. The following are *optional*: 4. *word*: if you want add word list, you can create a word list. 5. *number*: if you have patterns where numbers appear 6. *punc*: if you have pattern where punctuations appear. Assume the name of your language is *English*: you are going to organize those files as: eng.unicharset eng.word eng.pun eng.num You put these files together in one folder (call it *langModel* for simplicity). You create other folders such as *script* and myOutput inside *langModel* folder . And, then point your terminal to the langModel folder and run *combine_lang_model --input_unicharset lan.unicharset --script_dir script --output_dir myOutput --lang ben --words eng.word --puncs eng.punc --numbers eng.number* That will produce a traineddata file: eng.traineddata inside myOutput folder. That is your starter traineddata/model. You will use it to train from that one once you have your ground truth texts. On 16 Nov 2023 at 6:39:28 PM, Jephthah Anga <israeljay...@gmail.com> wrote: > Hi Des, > > I am attempting to walk the same path you just walked and was hoping you > could provide me with information on where to start. I want to train / > create a new language in tesseract that would recognize texts of that > language. How do i create the files you mentioned above? Is there a central > wiki with all the info i need to get started? What were the biggest > challenges you faced and in your opinion is it feasible to attempt to > create a new language? > > Thank you for your help > > On Sunday, September 10, 2023 at 2:49:15 p.m. UTC-2:30 desal...@gmail.com > wrote: > >> I am trying to train a new language. I have prepared the all the >> necessary files as per the manual. I have also combined them to a trained >> data file using the *combine_lang_model command. * >> >> - I also have my training files such as the text files, box files and >> .lsmf files inside oro-ground-truth folder. >> >> >> But, I am having trouble to proceed from there. All the instructions for >> training from scratch talk about using tesstrain.sh., which the manual >> calls unsupported and outdated. >> >> - What should I do? Can you guys help me please? >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/78655442-7c94-4404-b609-ba5deaf345aen%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/78655442-7c94-4404-b609-ba5deaf345aen%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kCm%2BEDiTs3213L-qU_WF%3DvirvF_28V4snx57iCbLOk6tg%40mail.gmail.com.