Re: [tesseract-ocr] Re: How to start from scratch (new language) in Tesseract 5

Dellu Bw Thu, 16 Nov 2023 10:00:11 -0800

 Hi Jephthan,
If you are trying to train a new language, your first step is to produce a
starter traineddata. Once, you have the starter model, you can then produce
a training material (such the text lines; sentences from the language).

In this email, I will share you the ways to produce a starter traineddata
(starter model) to the point I understood.

*Creating a starter traineddata: *

You need:

   1. lang.unicharset: you can prepare it by hand. You can take the English
   sample and modify it. This file contains all the characters of the
   language.
   2. script: if the language is written in Latin, you can download the
   latin script from the tesseract GitHub repo (
   https://github.com/tesseract-ocr/langdata_lstm). If the language uses
   Cyrillic

<https://github.com/tesseract-ocr/langdata_lstm/blob/main/Cyrillic.unicharset>,
   you download that script.
   3. *Radical Stroke, *you can download it from the repo. But, I think
   tesseract can also automatically produce it.

   The following are *optional*:
   4. *word*: if you want add word list, you can create a word list.
   5. *number*: if you have patterns where numbers appear
   6. *punc*: if you have pattern where punctuations appear.

Assume the name of your language is *English*: you are going to organize
those files as:

eng.unicharset

eng.word

eng.pun

eng.num

You put these files together in one folder (call it *langModel* for
simplicity). You create other folders such as  *script* and myOutput inside
*langModel* folder . And, then point your terminal to the langModel folder
and run *combine_lang_model --input_unicharset lan.unicharset --script_dir
script --output_dir myOutput --lang ben --words eng.word --puncs eng.punc
--numbers eng.number*

That will produce a traineddata file: eng.traineddata inside myOutput
folder. That is your starter traineddata/model. You will use it to train
from that one once you have your ground truth texts.

On 16 Nov 2023 at 6:39:28 PM, Jephthah Anga <israeljay...@gmail.com> wrote:

> Hi Des,
>
> I am attempting to walk the same path you just walked and was hoping you
> could provide me with information on where to start. I want to train /
> create a new language in tesseract that would recognize texts of that
> language. How do i create the files you mentioned above? Is there a central
> wiki with all the info i need to get started? What were the biggest
> challenges you faced and in your opinion is it feasible to attempt to
> create a new language?
>
> Thank you for your help
>
> On Sunday, September 10, 2023 at 2:49:15 p.m. UTC-2:30 desal...@gmail.com
> wrote:
>
>> I am trying to train a new language. I have prepared the all the
>> necessary files as per the manual. I have also combined them to a trained
>> data file using the *combine_lang_model command. *
>>
>> - I also have my training files such as the text files, box files and
>> .lsmf files inside oro-ground-truth folder.
>>
>>
>> But, I am having trouble to proceed from there. All the instructions for
>> training from scratch talk about using tesstrain.sh., which the manual
>> calls unsupported and outdated.
>>
>> - What should I do? Can you guys help me please?
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/78655442-7c94-4404-b609-ba5deaf345aen%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/78655442-7c94-4404-b609-ba5deaf345aen%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kCm%2BEDiTs3213L-qU_WF%3DvirvF_28V4snx57iCbLOk6tg%40mail.gmail.com.

Re: [tesseract-ocr] Re: How to start from scratch (new language) in Tesseract 5

Reply via email to