And yes I'm using Ubuntu 20.04 on windows with WSL. On Sun, Jan 14, 2024 at 4:06 PM Menelik Berhan <menelikber...@gmail.com> wrote:
> Yes I'm In addis. > My pc is not that powerful either. But I could find a couple of good > desktop PCs for the training. > > It would be my pleasure to meet in person, I've some questions about the > training process that I'll ask when we meet. > > I'm free almost all day after 10 a.m (EAT) (ketewatu arat seat local > time). Let me know the time and place of your convenience. > > Thanks > > On Sun, Jan 14, 2024 at 3:22 PM Dellu Bw <elvia...@gmail.com> wrote: > >> Hi Menilik, are you in Addis? >> I have figured out most of the workings of Tesseract. I really fall into >> a trap because of the electric blackouts and the underpowered pc. I feel >> that we can train everything of Ethiopic (Geez, Amharic, Tigringa and every >> other ) in one sweep. I have about 8gb of data to train Amharic. But my pc >> just cannot handle it. We can meet in person and generate(collect ) more >> data to include the other Ethiopics and train it. >> (Sorry i am writing on my phone.) >> >> On Sun, Jan 14, 2024, 3:14 PM Dellu Bw <elvia...@gmail.com> wrote: >> >>> Most of the guide written for version 4 actually work for version 5. The >>> changes are minimal. It is better to keep version 5 because it seems >>> perform better. Are u using linux? >>> >>> On Sat, Jan 13, 2024, 4:08 PM Menelik Berhan <menelikber...@gmail.com> >>> wrote: >>> >>>> Thanks for your swift reply. It would be my pleasure to collaborate >>>> with you. >>>> >>>> I've noticed that there is are extensive guides and tutorials regarding >>>> training tesseract 4.x, and I wanted to switch to 4.x version. >>>> I wanted to ask what would be the trade off if I used tesseract 4.x >>>> instead of 5.x ? >>>> >>>> Thanks for your time!!! >>>> >>>> >>>> On Saturday, January 13, 2024 at 12:49:36 PM UTC+3 elvi...@gmail.com >>>> wrote: >>>> >>>>> I spend some time trying to improve the default model of Amharic. I >>>>> default model has a couple of characters missing. As i have noted in many >>>>> posts in this forum, training by removing the top layer is the best method >>>>> to introduce new characters. >>>>> >>>>> But i really struggled because the training is deteriotating the base >>>>> (default) model. I also have the shortage of processing power. >>>>> Tesseract 5.3 also has some flaws which made it hard to use in the >>>>> third countries ( electric blackouts) >>>>> >>>>> Dear Menilik, we might need to put out hands together on this. >>>>> >>>>> On Sat, Jan 13, 2024, 11:21 AM Menelik Berhan <meneli...@gmail.com> >>>>> wrote: >>>>> >>>>>> *Background* >>>>>> I'm trying to use tesseract 5.3.3 on scanned old books written in >>>>>> Amharic (which uses Ethiopic script). >>>>>> >>>>>> *Major Shortcomings of amh.traineddata from tesseract* >>>>>> >>>>>> *Difference in type of Ethiopic script:* there are Ethiopic script >>>>>> characters in old Amharic texts that are not used in the unicharset of >>>>>> amh.traineddata. >>>>>> >>>>>> *Difference in punctuation styles:* the old texts use some >>>>>> punctuations not used in modern Amharic, and also for some that are used >>>>>> in >>>>>> modern Amharic, the old texts have d/t pattern (mostly space b/n word and >>>>>> punctuation character --- while the old texts always put space b/n >>>>>> punctuation chars and both preceding and following words, in modern times >>>>>> these punctuation chars doesn't have space b/n them and the preceding >>>>>> word). >>>>>> >>>>>> *Very narrow training_text & wordlist (based on >>>>>> tesseract/langdata_lstm)* >>>>>> The amh.training_text & amh.wordlist text files used by tesseract >>>>>> (the one from langdata_lstm) is very small. (to give you an Idea: for >>>>>> tir.traineddata (another language which uses Ethiopic script) the >>>>>> tir.training_text from langdata_lstm has more than 400,000 lines while >>>>>> the >>>>>> amh.training_text has only around 400 lines) >>>>>> >>>>>> *Other challenges* >>>>>> >>>>>> - The old Amharic books use a font that's not in use (or >>>>>> available). >>>>>> - The old Amharic books contain many Ge'ez words (a liturgical >>>>>> language like latin which uses Ethiopic script). >>>>>> - The old Amharic books mostly use Ge'ez numbers, while modern >>>>>> Amharic texts use Arabic numbers. >>>>>> >>>>>> *WHAT I'VE DONE SO FAR* >>>>>> As an experiment I've tried to fine tune amh.traineddata_best (using >>>>>> `make training`) with close to 300 line images & texts (from sample pages >>>>>> of some old Amharic books) and using files from langdata_lstm (for 10,000 >>>>>> iterations). >>>>>> >>>>>> The resulting traineddata has a very satisfactory improvement in >>>>>> addressing some of the challenges mentioned above, especially those >>>>>> regarding punctuation chars. >>>>>> >>>>>> But it still fails to solve the problems I've with some characters >>>>>> (the ones not present in the unicharset of amh.traineddata) and fails for >>>>>> almost all Ge'ez numbers (eventhough the training sample pages have many >>>>>> Ge'ez nums). >>>>>> >>>>>> *WHAT I'M PLANNING TO DO* >>>>>> First I want to train tesseract with a large training_text & wordlist >>>>>> files, and also a complete unicharset file , >>>>>> Then fine tune the resulting traineddata based on sample line images >>>>>> from the old books. >>>>>> >>>>>> *QUESTIONS (for now. I'll definitely add more questions later)* >>>>>> Is there another path I should take that would get me to where I want? >>>>>> >>>>>> *Regarding training tesseract with large training_text & wordlist >>>>>> files, and also a complete unicharset file:* >>>>>> >>>>>> - How to prepare the training_text & wordlist file? (What the >>>>>> text files should contain) >>>>>> - How to prepare the unicharset file, and also how to pass it to >>>>>> the `make training` command ? >>>>>> >>>>>> >>>>>> *Regarding generating a text, image(tif) and box file from >>>>>> training_text:* >>>>>> >>>>>> I've looked up python scripts to do this job, but have question about >>>>>> the proper values for these params in text2image: >>>>>> --font (what criteria should I use to select the list of fonts), >>>>>> --leading, --xsize, --ysize, --char_spacing, --exposure, >>>>>> --unicharset_file and --margin. >>>>>> >>>>>> I've noticed from tesstrain repo for tesseract 5 that the line images >>>>>> are tightly cropped (with minimal margin around text line). Is the same >>>>>> property (minimal margins) required/desired of the line images generated >>>>>> using text2image from the training_text? >>>>>> >>>>>> *THANKS FOR YOUR TIME !!!* >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/bf4d57dc-a4ea-4157-8782-0acca178c9dan%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/bf4d57dc-a4ea-4157-8782-0acca178c9dan%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "tesseract-ocr" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/tesseract-ocr/qhrcsS37Kn4/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kAcJGE9Qbp9RQYz%3Dnp-Na35E-1ZukwbWdYOdVo79Fjewg%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kAcJGE9Qbp9RQYz%3Dnp-Na35E-1ZukwbWdYOdVo79Fjewg%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAEQfXZU9BsBjtYuCg1XY1H25zco%2Bid6k7%3DtkRnW5gmEgainyCQ%40mail.gmail.com.