Re: [tesseract-ocr] Re: v4.1.1 - Segmentation fault on train data generation; all .lstmf files are exactly 1GB

Sim Tov Thu, 23 Sep 2021 05:22:54 -0700

The reason I use v4.1.1 is because it is the version that is supplied with 
the recently released stable Debian 11. It will remain like this for the 
next 2 years (approx)..


So my question is - whether it is OK to use the .lstmf files I got so far 
for training, or must the process of their generation be finished 
properly?... In other words - if I stop the process myself in the middle is 
the .lstmf file OK? Is there a way to check its consistency?

On Wednesday, September 22, 2021 at 11:14:01 AM UTC+3 zdenop wrote:

> And what about testing the latest code?
> "tesstrain.sh" training is not supported anymore, and for creating issues 
> you must use the latest code anyway.
>
> Zdenko
>
>
> st 22. 9. 2021 o 9:20 Sim Tov <smn...@gmail.com> napísal(a):
>
>> Maybe it is just a bug I need to open an issue?
>>
>> On Monday, September 20, 2021 at 2:52:18 PM UTC+3 Sim Tov wrote:
>>
>>> Hello,
>>>
>>> I use v4.1.1 on Linux (Debian 11) and try to generate train and evaluate 
>>> data. The commands I used were:
>>>
>>> train:
>>>
>>> usr/share/tesseract-ocr/tesstrain.sh --fonts_dir FontsRashi/Working 
>>> --lang heb --linedata_only --noextract_font_properties --langdata_dir 
>>> ./langdata  --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/ 
>>> --output_dir output/train --fontlist 'BenOr Rashi' 'Guttman Rashi Bold'
>>>
>>> and
>>>
>>> evaluate:
>>>
>>> /usr/share/tesseract-ocr/tesstrain.sh --fonts_dir FontsRashi/Working 
>>> --lang heb --linedata_only --noextract_font_properties --langdata_dir 
>>> ./langdata  --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/ 
>>> --output_dir output/evaluate --fontlist 'Guttman Rashi'
>>>
>>> After several days of running both commands stopped with errors like 
>>> this for each of the 3 fonts:
>>>
>>> Page 8365
>>> Loaded 386170/386170 lines (1-386170) of document 
>>> /tmp/heb-2021-09-16.1dB/heb.Guttman_Rashi.exp0.lstmf
>>> Page 8366
>>> Loaded 386216/386216 lines (1-386216) of document 
>>> /tmp/heb-2021-09-16.1dB/heb.Guttman_Rashi.exp0.lstmf
>>> /usr/share/tesseract-ocr/tesstrain_utils.sh: line 72:  2271 Segmentation 
>>> fault      "${cmd}" "$@" 2>&1
>>>       2272 Done                    | tee -a ${LOG_FILE}
>>> ERROR: Program tesseract failed. Abort.
>>>
>>> Interestingly that heb.Guttman_Rashi.exp0.lstmf and both others .lstmf 
>>> files were exactly 1Gb big...
>>>
>>> Does it has something to do with what is written here:
>>>
>>>
>>> https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951
>>>
>>> "The text is divided by language automatically, so there is a separate 
>>> stream for each of the Devanagari-based languages (as there is for the 
>>> Latin-based languages) and *clipped to 1GB *for each language."
>>>
>>> 1. So is this Segmentation fault an expected behavior?
>>>
>>> 2. What should I do now? Should I rerun the commands hoping that they 
>>> will finish properly or should I copy those .lstmf files that I got so far 
>>> to the train/evaluate directories and start training?
>>>
>>> 3. Both output/evaluate and output/train directories remained empty 
>>> after the commands above failed. What files should be there at the end so I 
>>> can start the training process?
>>>
>>>
>>> Thank you in advance!
>>>
>>> tesseract --version
>>> tesseract 4.1.1
>>>  leptonica-1.79.0
>>>   libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : 
>>> libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
>>>  Found AVX
>>>  Found FMA
>>>  Found SSE
>>>  Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 
>>> liblz4/1.9.3 libzstd/1.4.8
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/cec596a9-cdfb-4a68-ab49-d275f27a82a5n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/cec596a9-cdfb-4a68-ab49-d275f27a82a5n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/098cb99f-bcf4-4571-9edb-b2b8b868ca42n%40googlegroups.com.

Re: [tesseract-ocr] Re: v4.1.1 - Segmentation fault on train data generation; all .lstmf files are exactly 1GB

Reply via email to