Can I train my custom images? I'm going to build France Receipts scanner. So I need to train these all to increase accuracy. How do you suggest? Zdenop
On Saturday, October 28, 2023 at 11:58:10 AM UTC+2 zdenop wrote: > It does not work on windows (directly) but it works on linux => use WSL if > you really need training. > Or wait until somebody find a fix for windows (or send the fix - this is > an open source project so everybody should contribute ;-) ) > > Zdenko > > > pi 27. 10. 2023 o 17:32 Dev Solution <develop...@gmail.com> napísal(a): > >> >> I just tried to run these all commands, but I got error >> https://prnt.sc/lLHeR27J2U65 >> >> On Tuesday, June 6, 2023 at 10:03:17 AM UTC+2 zdenop wrote: >> >>> Do not create files manually. >>> If "make training" does not work it means: >>> >>> 1. you miss some dependency or input data are wrong >>> 2. also you miss error message for 1. >>> >>> I strongly suggest you to start training from the beginning >>> (including cloning tesstraing) and pay attention to all messages: >>> >>> git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git >>> cd tesstrain >>> make tesseract-langdata >>> mkdir tessdata_best >>> wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata >>> -P tessdata_best >>> unzip ocrd-testset.zip -d data/ocrd-ground-truth >>> make training MODEL_NAME=ocrd TESSDATA=tessdata_best MAX_ITERATIONS=10000 >>> >>> >>> Zdenko >>> >>> >>> po 5. 6. 2023 o 4:22 Madhav Pandey <mad.dev...@gmail.com> napísal(a): >>> >>>> Hi Zdenop, >>>> >>>> Apologies. I got your name wrong in the thread. >>>> >>>> Can you please help me in resolving this issue? Because make training >>>> command was not creating the all-gt file. I manually created it and kept >>>> it >>>> at the MODEL_NAME directory. >>>> >>>> The way I created it was by copy over all the single lines from the >>>> text files and storing it in the all-gt file. I am not sure if this is the >>>> right approach. Please correct me if I am wrong here. >>>> >>>> Now after doing this, i am getting this error: >>>> >>>> python3 shuffle.py 0 "data/Apex/all-lstmf" >>>> Traceback (most recent call last): >>>> File >>>> "/Users/madpande/Code/git/tesseract_tutorial/tesstrain/shuffle.py", line >>>> 24, in <module> >>>> fd0 = open(sys.argv[2], 'r') >>>> FileNotFoundError: [Errno 2] No such file or directory: >>>> 'data/Apex/all-lstmf' >>>> >>>> >>>> I am pretty sure I am missing something here. Please help! >>>> >>>> Thanks! >>>> >>>> On Thursday, 1 June 2023 at 23:39:01 UTC-6 Madhav Pandey wrote: >>>> >>>>> Hi Zdenko, >>>>> >>>>> At what step in the make file the all-gt file is created? I am still >>>>> unable to move forward with the custom model training. >>>>> >>>>> Any help would be greatly appreciated. Thanks! >>>>> >>>>> On Wednesday, 26 April 2023 at 09:47:55 UTC-6 zdenop wrote: >>>>> >>>>>> make training TESSDATA=./usr/local/share/tessdata >>>>>> unicharset_extractor --output_unicharset "data/foo/unicharset" >>>>>> --norm_mode 2 "data/foo/all-gt" >>>>>> >>>>>> Failed to read data from: data/foo/all-gt.... >>>>>> >>>>>> >>>>>> This indicates you already run training that failed... >>>>>> Clean your training and start it once again. Pay attention to why >>>>>> "data/foo/all-gt" is not created (there will be an error message). >>>>>> >>>>>> Zdenko >>>>>> >>>>>> >>>>>> st 26. 4. 2023 o 2:07 Madhav Pandey <mad.dev...@gmail.com> >>>>>> napísal(a): >>>>>> >>>>>>> @zdenop >>>>>>> >>>>>>> This is the entire training output: >>>>>>> >>>>>>> ```make training TESSDATA=./usr/local/share/tessdata >>>>>>> unicharset_extractor --output_unicharset "data/foo/unicharset" >>>>>>> --norm_mode 2 "data/foo/all-gt" >>>>>>> Failed to read data from: data/foo/all-gt >>>>>>> Wrote unicharset file data/foo/unicharset >>>>>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" -t >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.gt.txt" > >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.box" >>>>>>> set -x; \ >>>>>>> tesseract >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" >>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train >>>>>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif >>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train >>>>>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" -t >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.gt.txt" > >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.box" >>>>>>> set -x; \ >>>>>>> tesseract >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" >>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train >>>>>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif >>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train >>>>>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" -t >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.gt.txt" > >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.box" >>>>>>> set -x; \ >>>>>>> tesseract >>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" >>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train >>>>>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif >>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train >>>>>>> python3 shuffle.py 0 "data/foo/all-lstmf" >>>>>>> Traceback (most recent call last): >>>>>>> File "/Users/m/Code/git/tesstrain/shuffle.py", line 24, in <module> >>>>>>> fd0 = open(sys.argv[2], 'r') >>>>>>> FileNotFoundError: [Errno 2] No such file or directory: >>>>>>> 'data/foo/all-lstmf' >>>>>>> make: *** [data/foo/all-lstmf] Error 1``` >>>>>>> >>>>>>> For this run, I just have 3 text and tif files. >>>>>>> >>>>>>> I did follow macos installation section from this page: >>>>>>> https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos and >>>>>>> installed everything that is mentioned here. >>>>>>> >>>>>>> Do I have to install anything else before running the training? >>>>>>> >>>>>>> On Tuesday, 25 April 2023 at 00:27:28 UTC-6 zdenop wrote: >>>>>>> >>>>>>>> Did you install all the necessary dependencies? >>>>>>>> Did you check & fixed all errors (before this error) in training >>>>>>>> output? >>>>>>>> >>>>>>>> Zdenko >>>>>>>> >>>>>>>> >>>>>>>> ut 25. 4. 2023 o 8:21 Madhav Pandey <mad.dev...@gmail.com> >>>>>>>> napísal(a): >>>>>>>> >>>>>>>>> Hi Everyone, >>>>>>>>> >>>>>>>>> I am relatively new to tesseract and OCR as whole. >>>>>>>>> >>>>>>>>> I have been trying to training do the setup for training model >>>>>>>>> locally using the guide >>>>>>>>> https://github.com/tesseract-ocr/tesstrain/blob/main/README.md >>>>>>>>> >>>>>>>>> I have copied the sample training data into the `data/foo` >>>>>>>>> directory but when I run `make training`, I will always end up >>>>>>>>> getting this >>>>>>>>> error: >>>>>>>>> >>>>>>>>> ```Failed to read data from: data/foo/all-gt >>>>>>>>> Wrote unicharset file data/foo/unicharset >>>>>>>>> python3 shuffle.py 0 "data/foo/all-lstmf" >>>>>>>>> Traceback (most recent call last): >>>>>>>>> File "shuffle.py", line 24, in <module> >>>>>>>>> fd0 = open(sys.argv[2], 'r') >>>>>>>>> FileNotFoundError: [Errno 2] No such file or directory: >>>>>>>>> 'data/foo/all-lstmf' >>>>>>>>> make: *** [data/foo/all-lstmf] Error 1 >>>>>>>>> ``` >>>>>>>>> >>>>>>>>> Can someone please help resolve this error? >>>>>>>>> >>>>>>>>> Thank you! >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/249216fc-70e5-4e40-a630-d4202fd24a36n%40googlegroups.com >>>>>>>>> >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/249216fc-70e5-4e40-a630-d4202fd24a36n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>> >>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/98ffe203-7d53-4b57-a5e8-3edd3ae271cen%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/98ffe203-7d53-4b57-a5e8-3edd3ae271cen%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/d044535b-ef13-4e07-8c1f-3cbab7098883n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/d044535b-ef13-4e07-8c1f-3cbab7098883n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/22a2db2c-0738-4d5c-99de-f7761d40ddeen%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/22a2db2c-0738-4d5c-99de-f7761d40ddeen%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f50e814c-3edf-45ef-aed6-bb379b2d1ef0n%40googlegroups.com.