Re: [tesseract-ocr] Tesseract Performance

Shree Devi Kumar Thu, 07 Jan 2021 07:40:20 -0800

Segmentation fault is usually if you are not using the tessdata_best model
as Start_model


On Thu, Jan 7, 2021, 20:13 Soumik Ranjan Dasgupta <ranjansou...@gmail.com>
wrote:

> Sorry, I attached the wrong log file. Please find the new one attached.
>
> On Thu, Jan 7, 2021 at 8:09 PM Soumik Ranjan Dasgupta <
> ranjansou...@gmail.com> wrote:
>
>> Hi Shree,
>>
>> I installed the bidi module. The error went away, but the training does
>> not happen again. Please  find the log and training script attached.
>> FYI I am using the makefile from the master branch. Do I need to change
>> it to the makefile from ben branch instead?
>>
>> On Thu, Jan 7, 2021 at 5:26 PM Shree Devi Kumar <shreesh...@gmail.com>
>> wrote:
>>
>>> ModuleNotFoundError: No module named 'bidi
>>>
>>> Install python-bidi
>>>
>>> On Thu, Jan 7, 2021, 15:45 Soumik Ranjan Dasgupta <
>>> ranjansou...@gmail.com> wrote:
>>>
>>>> Hi Shreeshrii,
>>>>
>>>> I took your command exactly as it is and ran it (made sure the
>>>> tessdata_best directory is present in $HOME
>>>>  with best ben.traineddata) and ran into an extremely weird error.
>>>> Here is the log:
>>>>
>>>> find data/ben-ground-truth -name '*.gt.txt' | xargs cat | sort | uniq >
>>>> "data/ben/all-gt"
>>>> combine_tessdata -u /root/tessdata_best/ben.traineddata  data/ben/ben
>>>> Version
>>>> string:4.00.00alpha:ben:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx64Lrx64Lfx512O1c1]
>>>> 0:config:size=377, offset=192
>>>> 17:lstm:size=10605707, offset=569
>>>> 18:lstm-punc-dawg:size=3154, offset=10606276
>>>> 19:lstm-word-dawg:size=427618, offset=10609430
>>>> 20:lstm-number-dawg:size=426, offset=11037048
>>>> 21:lstm-unicharset:size=6866, offset=11037474
>>>> 22:lstm-recoder:size=1003, offset=11044340
>>>> 23:version:size=80, offset=11045343
>>>> Extracting tessdata components from /root/tessdata_best/ben.traineddata
>>>> Wrote data/ben/ben.config
>>>> Wrote data/ben/ben.lstm
>>>> Wrote data/ben/ben.lstm-punc-dawg
>>>> Wrote data/ben/ben.lstm-word-dawg
>>>> Wrote data/ben/ben.lstm-number-dawg
>>>> Wrote data/ben/ben.lstm-unicharset
>>>> Wrote data/ben/ben.lstm-recoder
>>>> Wrote data/ben/ben.version
>>>> unicharset_extractor --output_unicharset "data/ben/my.unicharset"
>>>> --norm_mode 2 "data/ben/all-gt"
>>>> Bad box coordinates in boxfile string!  কি জানি কেন প্রদ্যুম্নের বার
>>>> বার মনে আসছিল সেই জীর্ণ পরিচ্ছদপরা
>>>> Extracting unicharset from plain text file data/ben/all-gt
>>>> Wrote unicharset file data/ben/my.unicharset
>>>> merge_unicharsets data/ben/ben.lstm-unicharset data/ben/my.unicharset
>>>> "data/ben/unicharset"
>>>> Loaded unicharset of size 111 from file data/ben/ben.lstm-unicharset
>>>> Loaded unicharset of size 76 from file data/ben/my.unicharset
>>>> Wrote unicharset file data/ben/unicharset.
>>>> PYTHONIOENCODING=utf-8 python3 generate_wordstr_box.py -i
>>>> "data/ben-ground-truth/24-022.tif" -t "data/ben-ground-truth/24-022.gt.txt"
>>>> > "data/ben-ground-truth/24-022.box"
>>>> Traceback (most recent call last):
>>>>   File "generate_wordstr_box.py", line 7, in <module>
>>>>     import bidi.algorithm
>>>> ModuleNotFoundError: No module named 'bidi'
>>>> Makefile:207: recipe for target 'data/ben-ground-truth/24-022.box'
>>>> failed
>>>> make: *** [data/ben-ground-truth/24-022.box] Error 1
>>>>
>>>> I should mention I double checked the 24-022.gt.txt and 24-022.tif
>>>> files and both of them are valid. Any reason why this might be happening?
>>>> How can I fix this?
>>>> On Saturday, January 2, 2021 at 11:01:27 AM UTC+5:30 shree wrote:
>>>>
>>>>> Soumik,
>>>>>
>>>>> I have uploaded the bash scripts and the generated reports and graphs
>>>>> to `ben` branch in my fork of tesstrain repo. See
>>>>>
>>>>> https://github.com/Shreeshrii/tesstrain/tree/ben
>>>>> and
>>>>>
>>>>> https://github.com/Shreeshrii/tesstrain/commit/a6474ef2dbbac47803d13b6f92fdcf8c9dc3107b
>>>>>
>>>>> Results for the validation data (not seen by lstmtraining either for
>>>>> training or eval, shows an improvement over both ben and script/Bengali.
>>>>>
>>>>> To improve results further, check groundtruth transcription for any
>>>>> missing words, normalize the text and try with some more training data.
>>>>>
>>>>>
>>>>> On Fri, Jan 1, 2021 at 6:41 PM Shree Devi Kumar <shree...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> nohup make MODEL_NAME=ben START_MODEL=ben LANG_TYPE=Indic
>>>>>>  GROUND_TRUTH_DIR=data/ben-ground-truth TESSDATA=$HOME/tessdata_best
>>>>>> DEBUG_INTERVAL=-1 training MAX_ITERATIONS=50000 >> data/ben.log &
>>>>>>
>>>>>> Graphs are created using the training log file as well as validation
>>>>>> log files. Some of these require using PRs which have not yet been merged
>>>>>> in tesstrain repo.
>>>>>>
>>>>>> See
>>>>>> https://github.com/tesseract-ocr/tesstrain/pulls
>>>>>>
>>>>>> For Evaluation reports, I used
>>>>>> https://github.com/eddieantonio/ocreval
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 1, 2021 at 12:09 PM Soumik Ranjan Dasgupta <
>>>>>> ranjan...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Shreeshrii,
>>>>>>>
>>>>>>> Can you please tell me the training command  used? Also, how can I
>>>>>>> create the graphs and these other documents?
>>>>>>>
>>>>>>> On Sat, 26 Dec 2020, 18:37 Shree Devi Kumar, <shree...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Soumik,
>>>>>>>>
>>>>>>>> I used your groundtruth and trained using ben as the START_MODEL.
>>>>>>>> I got best results on the validation set of images at around 5000
>>>>>>>> iterations. see attached Accuracy report and CER graph.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 24, 2020 at 8:36 PM Soumik Ranjan Dasgupta <
>>>>>>>> ranjan...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>> I wanted to do fine-tune the ben.traineddata model by using some
>>>>>>>>> ancient text that were supposedly printed with typeset. I have roughly
>>>>>>>>> around 1k lines of text and tried the normal fine-tuning approach with
>>>>>>>>> around 25k iterations.
>>>>>>>>> The thing that surprised me the most was even after packing the
>>>>>>>>> traineddata (character error was around 4%) and testing an unseen 
>>>>>>>>> image,
>>>>>>>>> the performance was exactly the same. Not a single character was 
>>>>>>>>> different!
>>>>>>>>> You can find the traineddata, training data, the logs and the
>>>>>>>>> source code at this link:
>>>>>>>>>
>>>>>>>>> https://github.com/srdg/unarchived_ben_tess/releases/tag/v0.0.4-alpha
>>>>>>>>>
>>>>>>>>> Can anyone tell me exactly what I am doing wrong here? Do I need
>>>>>>>>> to change any training parameter, increase my training data, or 
>>>>>>>>> anything
>>>>>>>>> else completely?
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Soumik
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1fc044d1-b0ae-45d5-9041-e6fbf8ec5089n%40googlegroups.com
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1fc044d1-b0ae-45d5-9041-e6fbf8ec5089n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> ____________________________________________________________
>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZ3A7CUEqw29Gxu6r1-cLHPTLFt%3D%3D0C0109D_6x6C7Kw%40mail.gmail.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZ3A7CUEqw29Gxu6r1-cLHPTLFt%3D%3D0C0109D_6x6C7Kw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAM-%2BFN%3DZggnH4wV5vUhY9nsSqjKg9xZ5TQDoCMwSqf7H0oPogQ%40mail.gmail.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAM-%2BFN%3DZggnH4wV5vUhY9nsSqjKg9xZ5TQDoCMwSqf7H0oPogQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/9e188ca3-e477-4ce4-aaad-5c83d2fb5152n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/9e188ca3-e477-4ce4-aaad-5c83d2fb5152n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWkU1CHbknyUWk2wG2Q7s_de_bEtUj3SWFZGnqFzdHQjg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWkU1CHbknyUWk2wG2Q7s_de_bEtUj3SWFZGnqFzdHQjg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAM-%2BFN%3DXxqAFcuESoehaggxfFLvrYCYMuj8YN-955h3zk6eoLQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAM-%2BFN%3DXxqAFcuESoehaggxfFLvrYCYMuj8YN-955h3zk6eoLQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWcqXGPWj2h9Wj3xdQnFis%2BtGxTEt%3DpFC%2B1uraKLq7BcQ%40mail.gmail.com.

Re: [tesseract-ocr] Tesseract Performance

Reply via email to