Most of the guide written for version 4 actually work for version 5. The changes are minimal. It is better to keep version 5 because it seems perform better. Are u using linux?
On Sat, Jan 13, 2024, 4:08 PM Menelik Berhan <menelikber...@gmail.com> wrote: > Thanks for your swift reply. It would be my pleasure to collaborate with > you. > > I've noticed that there is are extensive guides and tutorials regarding > training tesseract 4.x, and I wanted to switch to 4.x version. > I wanted to ask what would be the trade off if I used tesseract 4.x > instead of 5.x ? > > Thanks for your time!!! > > > On Saturday, January 13, 2024 at 12:49:36 PM UTC+3 elvi...@gmail.com > wrote: > >> I spend some time trying to improve the default model of Amharic. I >> default model has a couple of characters missing. As i have noted in many >> posts in this forum, training by removing the top layer is the best method >> to introduce new characters. >> >> But i really struggled because the training is deteriotating the base >> (default) model. I also have the shortage of processing power. >> Tesseract 5.3 also has some flaws which made it hard to use in the third >> countries ( electric blackouts) >> >> Dear Menilik, we might need to put out hands together on this. >> >> On Sat, Jan 13, 2024, 11:21 AM Menelik Berhan <meneli...@gmail.com> >> wrote: >> >>> *Background* >>> I'm trying to use tesseract 5.3.3 on scanned old books written in >>> Amharic (which uses Ethiopic script). >>> >>> *Major Shortcomings of amh.traineddata from tesseract* >>> >>> *Difference in type of Ethiopic script:* there are Ethiopic script >>> characters in old Amharic texts that are not used in the unicharset of >>> amh.traineddata. >>> >>> *Difference in punctuation styles:* the old texts use some punctuations >>> not used in modern Amharic, and also for some that are used in modern >>> Amharic, the old texts have d/t pattern (mostly space b/n word and >>> punctuation character --- while the old texts always put space b/n >>> punctuation chars and both preceding and following words, in modern times >>> these punctuation chars doesn't have space b/n them and the preceding word). >>> >>> *Very narrow training_text & wordlist (based on tesseract/langdata_lstm)* >>> The amh.training_text & amh.wordlist text files used by tesseract (the >>> one from langdata_lstm) is very small. (to give you an Idea: for >>> tir.traineddata (another language which uses Ethiopic script) the >>> tir.training_text from langdata_lstm has more than 400,000 lines while the >>> amh.training_text has only around 400 lines) >>> >>> *Other challenges* >>> >>> - The old Amharic books use a font that's not in use (or available). >>> - The old Amharic books contain many Ge'ez words (a liturgical >>> language like latin which uses Ethiopic script). >>> - The old Amharic books mostly use Ge'ez numbers, while modern >>> Amharic texts use Arabic numbers. >>> >>> *WHAT I'VE DONE SO FAR* >>> As an experiment I've tried to fine tune amh.traineddata_best (using >>> `make training`) with close to 300 line images & texts (from sample pages >>> of some old Amharic books) and using files from langdata_lstm (for 10,000 >>> iterations). >>> >>> The resulting traineddata has a very satisfactory improvement in >>> addressing some of the challenges mentioned above, especially those >>> regarding punctuation chars. >>> >>> But it still fails to solve the problems I've with some characters (the >>> ones not present in the unicharset of amh.traineddata) and fails for almost >>> all Ge'ez numbers (eventhough the training sample pages have many Ge'ez >>> nums). >>> >>> *WHAT I'M PLANNING TO DO* >>> First I want to train tesseract with a large training_text & wordlist >>> files, and also a complete unicharset file , >>> Then fine tune the resulting traineddata based on sample line images >>> from the old books. >>> >>> *QUESTIONS (for now. I'll definitely add more questions later)* >>> Is there another path I should take that would get me to where I want? >>> >>> *Regarding training tesseract with large training_text & wordlist files, >>> and also a complete unicharset file:* >>> >>> - How to prepare the training_text & wordlist file? (What the text >>> files should contain) >>> - How to prepare the unicharset file, and also how to pass it to the >>> `make training` command ? >>> >>> >>> *Regarding generating a text, image(tif) and box file from >>> training_text:* >>> >>> I've looked up python scripts to do this job, but have question about >>> the proper values for these params in text2image: >>> --font (what criteria should I use to select the list of fonts), >>> --leading, --xsize, --ysize, --char_spacing, --exposure, >>> --unicharset_file and --margin. >>> >>> I've noticed from tesstrain repo for tesseract 5 that the line images >>> are tightly cropped (with minimal margin around text line). Is the same >>> property (minimal margins) required/desired of the line images generated >>> using text2image from the training_text? >>> >>> *THANKS FOR YOUR TIME !!!* >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/bf4d57dc-a4ea-4157-8782-0acca178c9dan%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/bf4d57dc-a4ea-4157-8782-0acca178c9dan%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kCMGQmhKJ0pFCTvLeY22nq3T6fwx4VViOJAsnuq5ZwS9w%40mail.gmail.com.