Segmentation fault is usually if you are not using the tessdata_best model as Start_model
On Thu, Jan 7, 2021, 20:13 Soumik Ranjan Dasgupta <ranjansou...@gmail.com> wrote: > Sorry, I attached the wrong log file. Please find the new one attached. > > On Thu, Jan 7, 2021 at 8:09 PM Soumik Ranjan Dasgupta < > ranjansou...@gmail.com> wrote: > >> Hi Shree, >> >> I installed the bidi module. The error went away, but the training does >> not happen again. Please find the log and training script attached. >> FYI I am using the makefile from the master branch. Do I need to change >> it to the makefile from ben branch instead? >> >> On Thu, Jan 7, 2021 at 5:26 PM Shree Devi Kumar <shreesh...@gmail.com> >> wrote: >> >>> ModuleNotFoundError: No module named 'bidi >>> >>> Install python-bidi >>> >>> On Thu, Jan 7, 2021, 15:45 Soumik Ranjan Dasgupta < >>> ranjansou...@gmail.com> wrote: >>> >>>> Hi Shreeshrii, >>>> >>>> I took your command exactly as it is and ran it (made sure the >>>> tessdata_best directory is present in $HOME >>>> with best ben.traineddata) and ran into an extremely weird error. >>>> Here is the log: >>>> >>>> find data/ben-ground-truth -name '*.gt.txt' | xargs cat | sort | uniq > >>>> "data/ben/all-gt" >>>> combine_tessdata -u /root/tessdata_best/ben.traineddata data/ben/ben >>>> Version >>>> string:4.00.00alpha:ben:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx64Lrx64Lfx512O1c1] >>>> 0:config:size=377, offset=192 >>>> 17:lstm:size=10605707, offset=569 >>>> 18:lstm-punc-dawg:size=3154, offset=10606276 >>>> 19:lstm-word-dawg:size=427618, offset=10609430 >>>> 20:lstm-number-dawg:size=426, offset=11037048 >>>> 21:lstm-unicharset:size=6866, offset=11037474 >>>> 22:lstm-recoder:size=1003, offset=11044340 >>>> 23:version:size=80, offset=11045343 >>>> Extracting tessdata components from /root/tessdata_best/ben.traineddata >>>> Wrote data/ben/ben.config >>>> Wrote data/ben/ben.lstm >>>> Wrote data/ben/ben.lstm-punc-dawg >>>> Wrote data/ben/ben.lstm-word-dawg >>>> Wrote data/ben/ben.lstm-number-dawg >>>> Wrote data/ben/ben.lstm-unicharset >>>> Wrote data/ben/ben.lstm-recoder >>>> Wrote data/ben/ben.version >>>> unicharset_extractor --output_unicharset "data/ben/my.unicharset" >>>> --norm_mode 2 "data/ben/all-gt" >>>> Bad box coordinates in boxfile string! কি জানি কেন প্রদ্যুম্নের বার >>>> বার মনে আসছিল সেই জীর্ণ পরিচ্ছদপরা >>>> Extracting unicharset from plain text file data/ben/all-gt >>>> Wrote unicharset file data/ben/my.unicharset >>>> merge_unicharsets data/ben/ben.lstm-unicharset data/ben/my.unicharset >>>> "data/ben/unicharset" >>>> Loaded unicharset of size 111 from file data/ben/ben.lstm-unicharset >>>> Loaded unicharset of size 76 from file data/ben/my.unicharset >>>> Wrote unicharset file data/ben/unicharset. >>>> PYTHONIOENCODING=utf-8 python3 generate_wordstr_box.py -i >>>> "data/ben-ground-truth/24-022.tif" -t "data/ben-ground-truth/24-022.gt.txt" >>>> > "data/ben-ground-truth/24-022.box" >>>> Traceback (most recent call last): >>>> File "generate_wordstr_box.py", line 7, in <module> >>>> import bidi.algorithm >>>> ModuleNotFoundError: No module named 'bidi' >>>> Makefile:207: recipe for target 'data/ben-ground-truth/24-022.box' >>>> failed >>>> make: *** [data/ben-ground-truth/24-022.box] Error 1 >>>> >>>> I should mention I double checked the 24-022.gt.txt and 24-022.tif >>>> files and both of them are valid. Any reason why this might be happening? >>>> How can I fix this? >>>> On Saturday, January 2, 2021 at 11:01:27 AM UTC+5:30 shree wrote: >>>> >>>>> Soumik, >>>>> >>>>> I have uploaded the bash scripts and the generated reports and graphs >>>>> to `ben` branch in my fork of tesstrain repo. See >>>>> >>>>> https://github.com/Shreeshrii/tesstrain/tree/ben >>>>> and >>>>> >>>>> https://github.com/Shreeshrii/tesstrain/commit/a6474ef2dbbac47803d13b6f92fdcf8c9dc3107b >>>>> >>>>> Results for the validation data (not seen by lstmtraining either for >>>>> training or eval, shows an improvement over both ben and script/Bengali. >>>>> >>>>> To improve results further, check groundtruth transcription for any >>>>> missing words, normalize the text and try with some more training data. >>>>> >>>>> >>>>> On Fri, Jan 1, 2021 at 6:41 PM Shree Devi Kumar <shree...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> nohup make MODEL_NAME=ben START_MODEL=ben LANG_TYPE=Indic >>>>>> GROUND_TRUTH_DIR=data/ben-ground-truth TESSDATA=$HOME/tessdata_best >>>>>> DEBUG_INTERVAL=-1 training MAX_ITERATIONS=50000 >> data/ben.log & >>>>>> >>>>>> Graphs are created using the training log file as well as validation >>>>>> log files. Some of these require using PRs which have not yet been merged >>>>>> in tesstrain repo. >>>>>> >>>>>> See >>>>>> https://github.com/tesseract-ocr/tesstrain/pulls >>>>>> >>>>>> For Evaluation reports, I used >>>>>> https://github.com/eddieantonio/ocreval >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jan 1, 2021 at 12:09 PM Soumik Ranjan Dasgupta < >>>>>> ranjan...@gmail.com> wrote: >>>>>> >>>>>>> Hi Shreeshrii, >>>>>>> >>>>>>> Can you please tell me the training command used? Also, how can I >>>>>>> create the graphs and these other documents? >>>>>>> >>>>>>> On Sat, 26 Dec 2020, 18:37 Shree Devi Kumar, <shree...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Soumik, >>>>>>>> >>>>>>>> I used your groundtruth and trained using ben as the START_MODEL. >>>>>>>> I got best results on the validation set of images at around 5000 >>>>>>>> iterations. see attached Accuracy report and CER graph. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Dec 24, 2020 at 8:36 PM Soumik Ranjan Dasgupta < >>>>>>>> ranjan...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> I wanted to do fine-tune the ben.traineddata model by using some >>>>>>>>> ancient text that were supposedly printed with typeset. I have roughly >>>>>>>>> around 1k lines of text and tried the normal fine-tuning approach with >>>>>>>>> around 25k iterations. >>>>>>>>> The thing that surprised me the most was even after packing the >>>>>>>>> traineddata (character error was around 4%) and testing an unseen >>>>>>>>> image, >>>>>>>>> the performance was exactly the same. Not a single character was >>>>>>>>> different! >>>>>>>>> You can find the traineddata, training data, the logs and the >>>>>>>>> source code at this link: >>>>>>>>> >>>>>>>>> https://github.com/srdg/unarchived_ben_tess/releases/tag/v0.0.4-alpha >>>>>>>>> >>>>>>>>> Can anyone tell me exactly what I am doing wrong here? Do I need >>>>>>>>> to change any training parameter, increase my training data, or >>>>>>>>> anything >>>>>>>>> else completely? >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> Soumik >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1fc044d1-b0ae-45d5-9041-e6fbf8ec5089n%40googlegroups.com >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1fc044d1-b0ae-45d5-9041-e6fbf8ec5089n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> ____________________________________________________________ >>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZ3A7CUEqw29Gxu6r1-cLHPTLFt%3D%3D0C0109D_6x6C7Kw%40mail.gmail.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZ3A7CUEqw29Gxu6r1-cLHPTLFt%3D%3D0C0109D_6x6C7Kw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAM-%2BFN%3DZggnH4wV5vUhY9nsSqjKg9xZ5TQDoCMwSqf7H0oPogQ%40mail.gmail.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAM-%2BFN%3DZggnH4wV5vUhY9nsSqjKg9xZ5TQDoCMwSqf7H0oPogQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> ____________________________________________________________ >>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/9e188ca3-e477-4ce4-aaad-5c83d2fb5152n%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/9e188ca3-e477-4ce4-aaad-5c83d2fb5152n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWkU1CHbknyUWk2wG2Q7s_de_bEtUj3SWFZGnqFzdHQjg%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWkU1CHbknyUWk2wG2Q7s_de_bEtUj3SWFZGnqFzdHQjg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAM-%2BFN%3DXxqAFcuESoehaggxfFLvrYCYMuj8YN-955h3zk6eoLQ%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAM-%2BFN%3DXxqAFcuESoehaggxfFLvrYCYMuj8YN-955h3zk6eoLQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWcqXGPWj2h9Wj3xdQnFis%2BtGxTEt%3DpFC%2B1uraKLq7BcQ%40mail.gmail.com.