Do not create files manually. If "make training" does not work it means:
1. you miss some dependency or input data are wrong 2. also you miss error message for 1. I strongly suggest you to start training from the beginning (including cloning tesstraing) and pay attention to all messages: git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git cd tesstrain make tesseract-langdata mkdir tessdata_best wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata -P tessdata_best unzip ocrd-testset.zip -d data/ocrd-ground-truth make training MODEL_NAME=ocrd TESSDATA=tessdata_best MAX_ITERATIONS=10000 Zdenko po 5. 6. 2023 o 4:22 Madhav Pandey <mad.develope...@gmail.com> napísal(a): > Hi Zdenop, > > Apologies. I got your name wrong in the thread. > > Can you please help me in resolving this issue? Because make training > command was not creating the all-gt file. I manually created it and kept it > at the MODEL_NAME directory. > > The way I created it was by copy over all the single lines from the text > files and storing it in the all-gt file. I am not sure if this is the right > approach. Please correct me if I am wrong here. > > Now after doing this, i am getting this error: > > python3 shuffle.py 0 "data/Apex/all-lstmf" > Traceback (most recent call last): > File "/Users/madpande/Code/git/tesseract_tutorial/tesstrain/shuffle.py", > line 24, in <module> > fd0 = open(sys.argv[2], 'r') > FileNotFoundError: [Errno 2] No such file or directory: > 'data/Apex/all-lstmf' > > > I am pretty sure I am missing something here. Please help! > > Thanks! > > On Thursday, 1 June 2023 at 23:39:01 UTC-6 Madhav Pandey wrote: > >> Hi Zdenko, >> >> At what step in the make file the all-gt file is created? I am still >> unable to move forward with the custom model training. >> >> Any help would be greatly appreciated. Thanks! >> >> On Wednesday, 26 April 2023 at 09:47:55 UTC-6 zdenop wrote: >> >>> make training TESSDATA=./usr/local/share/tessdata >>> unicharset_extractor --output_unicharset "data/foo/unicharset" >>> --norm_mode 2 "data/foo/all-gt" >>> >>> Failed to read data from: data/foo/all-gt.... >>> >>> >>> This indicates you already run training that failed... >>> Clean your training and start it once again. Pay attention to why >>> "data/foo/all-gt" is not created (there will be an error message). >>> >>> Zdenko >>> >>> >>> st 26. 4. 2023 o 2:07 Madhav Pandey <mad.dev...@gmail.com> napísal(a): >>> >>>> @zdenop >>>> >>>> This is the entire training output: >>>> >>>> ```make training TESSDATA=./usr/local/share/tessdata >>>> unicharset_extractor --output_unicharset "data/foo/unicharset" >>>> --norm_mode 2 "data/foo/all-gt" >>>> Failed to read data from: data/foo/all-gt >>>> Wrote unicharset file data/foo/unicharset >>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i >>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" -t >>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.gt.txt" > >>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.box" >>>> set -x; \ >>>> tesseract >>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" >>>> data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train >>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif >>>> data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train >>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i >>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" -t >>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.gt.txt" > >>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.box" >>>> set -x; \ >>>> tesseract >>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" >>>> data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train >>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif >>>> data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train >>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i >>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" -t >>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.gt.txt" > >>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.box" >>>> set -x; \ >>>> tesseract >>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" >>>> data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train >>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif >>>> data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train >>>> python3 shuffle.py 0 "data/foo/all-lstmf" >>>> Traceback (most recent call last): >>>> File "/Users/m/Code/git/tesstrain/shuffle.py", line 24, in <module> >>>> fd0 = open(sys.argv[2], 'r') >>>> FileNotFoundError: [Errno 2] No such file or directory: >>>> 'data/foo/all-lstmf' >>>> make: *** [data/foo/all-lstmf] Error 1``` >>>> >>>> For this run, I just have 3 text and tif files. >>>> >>>> I did follow macos installation section from this page: >>>> https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos and >>>> installed everything that is mentioned here. >>>> >>>> Do I have to install anything else before running the training? >>>> >>>> On Tuesday, 25 April 2023 at 00:27:28 UTC-6 zdenop wrote: >>>> >>>>> Did you install all the necessary dependencies? >>>>> Did you check & fixed all errors (before this error) in training >>>>> output? >>>>> >>>>> Zdenko >>>>> >>>>> >>>>> ut 25. 4. 2023 o 8:21 Madhav Pandey <mad.dev...@gmail.com> napísal(a): >>>>> >>>>>> Hi Everyone, >>>>>> >>>>>> I am relatively new to tesseract and OCR as whole. >>>>>> >>>>>> I have been trying to training do the setup for training model >>>>>> locally using the guide >>>>>> https://github.com/tesseract-ocr/tesstrain/blob/main/README.md >>>>>> >>>>>> I have copied the sample training data into the `data/foo` directory >>>>>> but when I run `make training`, I will always end up getting this error: >>>>>> >>>>>> ```Failed to read data from: data/foo/all-gt >>>>>> Wrote unicharset file data/foo/unicharset >>>>>> python3 shuffle.py 0 "data/foo/all-lstmf" >>>>>> Traceback (most recent call last): >>>>>> File "shuffle.py", line 24, in <module> >>>>>> fd0 = open(sys.argv[2], 'r') >>>>>> FileNotFoundError: [Errno 2] No such file or directory: >>>>>> 'data/foo/all-lstmf' >>>>>> make: *** [data/foo/all-lstmf] Error 1 >>>>>> ``` >>>>>> >>>>>> Can someone please help resolve this error? >>>>>> >>>>>> Thank you! >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/249216fc-70e5-4e40-a630-d4202fd24a36n%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/249216fc-70e5-4e40-a630-d4202fd24a36n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/98ffe203-7d53-4b57-a5e8-3edd3ae271cen%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/98ffe203-7d53-4b57-a5e8-3edd3ae271cen%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/d044535b-ef13-4e07-8c1f-3cbab7098883n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/d044535b-ef13-4e07-8c1f-3cbab7098883n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w2JKVDzw%3DenofWuPG5fWcPbg81YzO7avwKBFPJo3CYQg%40mail.gmail.com.