>Compute CTC targets failed! That error is due to empty image files. The text2image script is clouded with bugs. it creates null boxes as well as null images.
On Thursday, June 8, 2023 at 4:31:24 AM UTC+3 tesseract-ocr wrote: > We were able to fix this issue. > > Our training set contained files with extension .tiff. Expectation was > .tif. > > But now are seeing this error: > > Compute CTC targets failed! > > Do you have any knowledge on what might be happening here? > > Thanks! > > On Wednesday, 7 June 2023 at 16:29:22 UTC-6 Madhav Pandey wrote: > >> Thank you so much for this. It works on the default dataset provide. >> >> However, when I try to work on hindi text, I get following error: >> >> unicharset_extractor --output_unicharset "data/ocr/unicharset" >> --norm_mode 2 "data/ocr/all-gt" >> Bad box coordinates in boxfile string! उसकी गाड़ी बड़ी थी और मेरी दाढ़ी >> Extracting unicharset from plain text file data/ocr/all-gt >> Wrote unicharset file data/ocr/unicharset >> make: * No rule to make target 'data/ocr-ground-truth/1.lstmf', needed by >> 'data/ocr/all-lstmf'. Stop. >> >> Going through some of the your responses on similar issue, you mentioned >> to check on the data format. Can you please specify what are the >> requirements for the grount-truth data? >> I have the text file and the tiff file for the single line text images. >> What are the other requirements? >> >> Thanks! >> >> On Tuesday, 6 June 2023 at 02:03:17 UTC-6 zdenop wrote: >> >>> Do not create files manually. >>> If "make training" does not work it means: >>> >>> 1. you miss some dependency or input data are wrong >>> 2. also you miss error message for 1. >>> >>> I strongly suggest you to start training from the beginning >>> (including cloning tesstraing) and pay attention to all messages: >>> >>> git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git >>> cd tesstrain >>> make tesseract-langdata >>> mkdir tessdata_best >>> wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata >>> -P tessdata_best >>> unzip ocrd-testset.zip -d data/ocrd-ground-truth >>> make training MODEL_NAME=ocrd TESSDATA=tessdata_best MAX_ITERATIONS=10000 >>> >>> >>> Zdenko >>> >>> >>> po 5. 6. 2023 o 4:22 Madhav Pandey <mad.dev...@gmail.com> napísal(a): >>> >>>> Hi Zdenop, >>>> >>>> Apologies. I got your name wrong in the thread. >>>> >>>> Can you please help me in resolving this issue? Because make training >>>> command was not creating the all-gt file. I manually created it and kept >>>> it >>>> at the MODEL_NAME directory. >>>> >>>> The way I created it was by copy over all the single lines from the >>>> text files and storing it in the all-gt file. I am not sure if this is the >>>> right approach. Please correct me if I am wrong here. >>>> >>>> Now after doing this, i am getting this error: >>>> >>>> python3 shuffle.py 0 "data/Apex/all-lstmf" >>>> Traceback (most recent call last): >>>> File >>>> "/Users/madpande/Code/git/tesseract_tutorial/tesstrain/shuffle.py", line >>>> 24, in <module> >>>> fd0 = open(sys.argv[2], 'r') >>>> FileNotFoundError: [Errno 2] No such file or directory: >>>> 'data/Apex/all-lstmf' >>>> >>>> >>>> I am pretty sure I am missing something here. Please help! >>>> >>>> Thanks! >>>> >>>> On Thursday, 1 June 2023 at 23:39:01 UTC-6 Madhav Pandey wrote: >>>> >>>>> Hi Zdenko, >>>>> >>>>> At what step in the make file the all-gt file is created? I am still >>>>> unable to move forward with the custom model training. >>>>> >>>>> Any help would be greatly appreciated. Thanks! >>>>> >>>>> On Wednesday, 26 April 2023 at 09:47:55 UTC-6 zdenop wrote: >>>>> >>>>>> make training TESSDATA=./usr/local/share/tessdata >>>>>> unicharset_extractor --output_unicharset "data/foo/unicharset" >>>>>> --norm_mode 2 "data/foo/all-gt" >>>>>> >>>>>> Failed to read data from: data/foo/all-gt.... >>>>>> >>>>>> >>>>>> This indicates you already run training that failed... >>>>>> Clean your training and start it once again. Pay attention to why >>>>>> "data/foo/all-gt" is not created (there will be an error message). >>>>>> >>>>>> Zdenko >>>>>> >>>>>> >>>>>> st 26. 4. 2023 o 2:07 Madhav Pandey <mad.dev...@gmail.com> >>>>>> napísal(a): >>>>>> >>>>>>> @zdenop >>>>>>> >>>>>>> This is the entire training output: >>>>>>> >>>>>>> ```make training TESSDATA=./usr/local/share/tessdata >>>>>>> unicharset_extractor --output_unicharset "data/foo/unicharset" >>>>>>> --norm_mode 2 "data/foo/all-gt" >>>>>>> Failed to read data from: data/foo/all-gt >>>>>>> Wrote unicharset file data/foo/unicharset >>>>>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" -t >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.gt.txt" > >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.box" >>>>>>> set -x; \ >>>>>>> tesseract >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" >>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train >>>>>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif >>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train >>>>>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" -t >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.gt.txt" > >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.box" >>>>>>> set -x; \ >>>>>>> tesseract >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" >>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train >>>>>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif >>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train >>>>>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" -t >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.gt.txt" > >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.box" >>>>>>> set -x; \ >>>>>>> tesseract >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" >>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train >>>>>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif >>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train >>>>>>> python3 shuffle.py 0 "data/foo/all-lstmf" >>>>>>> Traceback (most recent call last): >>>>>>> File "/Users/m/Code/git/tesstrain/shuffle.py", line 24, in <module> >>>>>>> fd0 = open(sys.argv[2], 'r') >>>>>>> FileNotFoundError: [Errno 2] No such file or directory: >>>>>>> 'data/foo/all-lstmf' >>>>>>> make: *** [data/foo/all-lstmf] Error 1``` >>>>>>> >>>>>>> For this run, I just have 3 text and tif files. >>>>>>> >>>>>>> I did follow macos installation section from this page: >>>>>>> https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos and >>>>>>> installed everything that is mentioned here. >>>>>>> >>>>>>> Do I have to install anything else before running the training? >>>>>>> >>>>>>> On Tuesday, 25 April 2023 at 00:27:28 UTC-6 zdenop wrote: >>>>>>> >>>>>>>> Did you install all the necessary dependencies? >>>>>>>> Did you check & fixed all errors (before this error) in training >>>>>>>> output? >>>>>>>> >>>>>>>> Zdenko >>>>>>>> >>>>>>>> >>>>>>>> ut 25. 4. 2023 o 8:21 Madhav Pandey <mad.dev...@gmail.com> >>>>>>>> napísal(a): >>>>>>>> >>>>>>>>> Hi Everyone, >>>>>>>>> >>>>>>>>> I am relatively new to tesseract and OCR as whole. >>>>>>>>> >>>>>>>>> I have been trying to training do the setup for training model >>>>>>>>> locally using the guide >>>>>>>>> https://github.com/tesseract-ocr/tesstrain/blob/main/README.md >>>>>>>>> >>>>>>>>> I have copied the sample training data into the `data/foo` >>>>>>>>> directory but when I run `make training`, I will always end up >>>>>>>>> getting this >>>>>>>>> error: >>>>>>>>> >>>>>>>>> ```Failed to read data from: data/foo/all-gt >>>>>>>>> Wrote unicharset file data/foo/unicharset >>>>>>>>> python3 shuffle.py 0 "data/foo/all-lstmf" >>>>>>>>> Traceback (most recent call last): >>>>>>>>> File "shuffle.py", line 24, in <module> >>>>>>>>> fd0 = open(sys.argv[2], 'r') >>>>>>>>> FileNotFoundError: [Errno 2] No such file or directory: >>>>>>>>> 'data/foo/all-lstmf' >>>>>>>>> make: *** [data/foo/all-lstmf] Error 1 >>>>>>>>> ``` >>>>>>>>> >>>>>>>>> Can someone please help resolve this error? >>>>>>>>> >>>>>>>>> Thank you! >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/249216fc-70e5-4e40-a630-d4202fd24a36n%40googlegroups.com >>>>>>>>> >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/249216fc-70e5-4e40-a630-d4202fd24a36n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>> >>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/98ffe203-7d53-4b57-a5e8-3edd3ae271cen%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/98ffe203-7d53-4b57-a5e8-3edd3ae271cen%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/d044535b-ef13-4e07-8c1f-3cbab7098883n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/d044535b-ef13-4e07-8c1f-3cbab7098883n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6afa82a6-3466-4360-ab25-7aacabc3a0f9n%40googlegroups.com.