Re: [tesseract-ocr] Advice on training for Old Amharic texts

Menelik Berhan Sun, 14 Jan 2024 05:07:13 -0800

And yes I'm using Ubuntu 20.04 on windows with WSL.

On Sun, Jan 14, 2024 at 4:06 PM Menelik Berhan <menelikber...@gmail.com>
wrote:


> Yes I'm In addis.
> My pc is not that powerful either. But I could find a couple of good
> desktop PCs for the training.
>
> It would be my pleasure to meet in person, I've some questions about the
> training process that I'll ask when we meet.
>
> I'm free almost all day after 10 a.m (EAT) (ketewatu arat seat local
> time). Let me know the time and place of your convenience.
>
> Thanks
>
> On Sun, Jan 14, 2024 at 3:22 PM Dellu Bw <elvia...@gmail.com> wrote:
>
>> Hi Menilik, are you in Addis?
>> I have figured out most of the workings of Tesseract. I really fall into
>> a trap because of the electric blackouts and the underpowered pc. I feel
>> that we can train everything of Ethiopic (Geez, Amharic, Tigringa and every
>> other ) in one sweep. I have about 8gb of data to  train Amharic. But my pc
>> just cannot handle it. We can meet in person and generate(collect ) more
>> data to include the other Ethiopics and train it.
>> (Sorry i am writing on my phone.)
>>
>> On Sun, Jan 14, 2024, 3:14 PM Dellu Bw <elvia...@gmail.com> wrote:
>>
>>> Most of the guide written for version 4 actually work for version 5. The
>>> changes are minimal. It is better to keep version 5 because it seems
>>> perform better. Are u using linux?
>>>
>>> On Sat, Jan 13, 2024, 4:08 PM Menelik Berhan <menelikber...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for your swift reply. It would be my pleasure to collaborate
>>>> with you.
>>>>
>>>> I've noticed that there is are extensive guides and tutorials regarding
>>>> training tesseract 4.x, and I wanted to switch to 4.x version.
>>>> I wanted to ask what would be the trade off if I used tesseract 4.x
>>>> instead of 5.x ?
>>>>
>>>> Thanks for your time!!!
>>>>
>>>>
>>>> On Saturday, January 13, 2024 at 12:49:36 PM UTC+3 elvi...@gmail.com
>>>> wrote:
>>>>
>>>>> I spend some time trying to improve the default model of Amharic. I
>>>>> default model has a couple of characters missing. As i have noted in many
>>>>> posts in this forum, training by removing the top layer is the best method
>>>>> to introduce new characters.
>>>>>
>>>>> But i really struggled because the training is deteriotating the base
>>>>> (default) model. I also have the shortage of processing power.
>>>>> Tesseract 5.3 also has some flaws which made it hard to use in the
>>>>> third countries ( electric blackouts)
>>>>>
>>>>> Dear Menilik, we might need to put out hands together on this.
>>>>>
>>>>> On Sat, Jan 13, 2024, 11:21 AM Menelik Berhan <meneli...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> *Background*
>>>>>> I'm trying to use tesseract 5.3.3 on scanned old books written in
>>>>>> Amharic (which uses Ethiopic script).
>>>>>>
>>>>>> *Major Shortcomings of amh.traineddata from tesseract*
>>>>>>
>>>>>> *Difference in type of Ethiopic script:* there are Ethiopic script
>>>>>> characters in old Amharic texts that are not used in the unicharset of
>>>>>> amh.traineddata.
>>>>>>
>>>>>> *Difference in punctuation styles:* the old texts use some
>>>>>> punctuations not used in modern Amharic, and also for some that are used 
>>>>>> in
>>>>>> modern Amharic, the old texts have d/t pattern (mostly space b/n word and
>>>>>> punctuation character --- while the old texts always put space b/n
>>>>>> punctuation chars and both preceding and following words, in modern times
>>>>>> these punctuation chars doesn't have space b/n them and the preceding 
>>>>>> word).
>>>>>>
>>>>>> *Very narrow training_text & wordlist (based on
>>>>>> tesseract/langdata_lstm)*
>>>>>> The amh.training_text & amh.wordlist text files used by tesseract
>>>>>> (the one from langdata_lstm) is very small. (to give you an Idea: for
>>>>>> tir.traineddata (another language which uses Ethiopic script) the
>>>>>> tir.training_text from langdata_lstm has more than 400,000 lines while 
>>>>>> the
>>>>>> amh.training_text has only around 400 lines)
>>>>>>
>>>>>> *Other challenges*
>>>>>>
>>>>>>    - The old Amharic books use a font that's not in use (or
>>>>>>    available).
>>>>>>    - The old Amharic books contain many Ge'ez words (a liturgical
>>>>>>    language like latin which uses Ethiopic script).
>>>>>>    - The old Amharic books mostly use Ge'ez numbers, while modern
>>>>>>    Amharic texts use Arabic numbers.
>>>>>>
>>>>>> *WHAT I'VE DONE SO FAR*
>>>>>> As an experiment I've tried to fine tune amh.traineddata_best (using
>>>>>> `make training`) with close to 300 line images & texts (from sample pages
>>>>>> of some old Amharic books) and using files from langdata_lstm (for 10,000
>>>>>> iterations).
>>>>>>
>>>>>> The resulting traineddata has a very satisfactory improvement in
>>>>>> addressing some of the challenges mentioned above, especially those
>>>>>> regarding punctuation chars.
>>>>>>
>>>>>> But it still fails to solve the problems I've with some characters
>>>>>> (the ones not present in the unicharset of amh.traineddata) and fails for
>>>>>> almost all Ge'ez numbers (eventhough the training sample pages have many
>>>>>> Ge'ez nums).
>>>>>>
>>>>>> *WHAT I'M PLANNING TO DO*
>>>>>> First I want to train tesseract with a large training_text & wordlist
>>>>>> files, and also a complete unicharset file ,
>>>>>> Then fine tune the resulting traineddata based on sample line images
>>>>>> from the old books.
>>>>>>
>>>>>> *QUESTIONS (for now. I'll definitely add more questions later)*
>>>>>> Is there another path I should take that would get me to where I want?
>>>>>>
>>>>>> *Regarding training tesseract with large training_text & wordlist
>>>>>> files, and also a complete unicharset file:*
>>>>>>
>>>>>>    - How to prepare the training_text & wordlist file? (What the
>>>>>>    text files should contain)
>>>>>>    - How to prepare the unicharset file, and also how to pass it to
>>>>>>    the `make training` command ?
>>>>>>
>>>>>>
>>>>>> *Regarding generating a text, image(tif) and box file from
>>>>>> training_text:*
>>>>>>
>>>>>> I've looked up python scripts to do this job, but have question about
>>>>>> the proper values for these params in text2image:
>>>>>> --font (what criteria should I use to select the list of fonts),
>>>>>> --leading, --xsize, --ysize, --char_spacing, --exposure,
>>>>>> --unicharset_file and --margin.
>>>>>>
>>>>>> I've noticed from tesstrain repo for tesseract 5 that the line images
>>>>>> are tightly cropped (with minimal margin around text line). Is the same
>>>>>> property (minimal margins) required/desired of the line images generated
>>>>>> using text2image from the training_text?
>>>>>>
>>>>>> *THANKS FOR YOUR TIME !!!*
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/bf4d57dc-a4ea-4157-8782-0acca178c9dan%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/bf4d57dc-a4ea-4157-8782-0acca178c9dan%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/qhrcsS37Kn4/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kAcJGE9Qbp9RQYz%3Dnp-Na35E-1ZukwbWdYOdVo79Fjewg%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kAcJGE9Qbp9RQYz%3Dnp-Na35E-1ZukwbWdYOdVo79Fjewg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAEQfXZU9BsBjtYuCg1XY1H25zco%2Bid6k7%3DtkRnW5gmEgainyCQ%40mail.gmail.com.

Re: [tesseract-ocr] Advice on training for Old Amharic texts

Reply via email to