You can try custom images - see the example ocrd-testset.zip <https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip> And follow the example from https://github.com/tesseract-ocr/tesstrain/blob/main/README.md :
unzip ocrd-testset.zip -d data/ocrd-ground-truth make training MODEL_NAME=ocrd START_MODEL=deu_latf TESSDATA=~/tessdata_best MAX_ITERATIONS=10000 Zdenko so 28. 10. 2023 o 17:37 Dev Solution <developermax...@gmail.com> napísal(a): > Can I train my custom images? I'm going to build France Receipts scanner. > So I need to train these all to increase accuracy. How do you suggest? > Zdenop > > On Saturday, October 28, 2023 at 11:58:10 AM UTC+2 zdenop wrote: > >> It does not work on windows (directly) but it works on linux => use WSL >> if you really need training. >> Or wait until somebody find a fix for windows (or send the fix - this is >> an open source project so everybody should contribute ;-) ) >> >> Zdenko >> >> >> pi 27. 10. 2023 o 17:32 Dev Solution <develop...@gmail.com> napísal(a): >> >>> >>> I just tried to run these all commands, but I got error >>> https://prnt.sc/lLHeR27J2U65 >>> >>> On Tuesday, June 6, 2023 at 10:03:17 AM UTC+2 zdenop wrote: >>> >>>> Do not create files manually. >>>> If "make training" does not work it means: >>>> >>>> 1. you miss some dependency or input data are wrong >>>> 2. also you miss error message for 1. >>>> >>>> I strongly suggest you to start training from the beginning >>>> (including cloning tesstraing) and pay attention to all messages: >>>> >>>> git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git >>>> cd tesstrain >>>> make tesseract-langdata >>>> mkdir tessdata_best >>>> wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata >>>> -P tessdata_best >>>> unzip ocrd-testset.zip -d data/ocrd-ground-truth >>>> make training MODEL_NAME=ocrd TESSDATA=tessdata_best >>>> MAX_ITERATIONS=10000 >>>> >>>> >>>> Zdenko >>>> >>>> >>>> po 5. 6. 2023 o 4:22 Madhav Pandey <mad.dev...@gmail.com> napísal(a): >>>> >>>>> Hi Zdenop, >>>>> >>>>> Apologies. I got your name wrong in the thread. >>>>> >>>>> Can you please help me in resolving this issue? Because make training >>>>> command was not creating the all-gt file. I manually created it and kept >>>>> it >>>>> at the MODEL_NAME directory. >>>>> >>>>> The way I created it was by copy over all the single lines from the >>>>> text files and storing it in the all-gt file. I am not sure if this is the >>>>> right approach. Please correct me if I am wrong here. >>>>> >>>>> Now after doing this, i am getting this error: >>>>> >>>>> python3 shuffle.py 0 "data/Apex/all-lstmf" >>>>> Traceback (most recent call last): >>>>> File >>>>> "/Users/madpande/Code/git/tesseract_tutorial/tesstrain/shuffle.py", line >>>>> 24, in <module> >>>>> fd0 = open(sys.argv[2], 'r') >>>>> FileNotFoundError: [Errno 2] No such file or directory: >>>>> 'data/Apex/all-lstmf' >>>>> >>>>> >>>>> I am pretty sure I am missing something here. Please help! >>>>> >>>>> Thanks! >>>>> >>>>> On Thursday, 1 June 2023 at 23:39:01 UTC-6 Madhav Pandey wrote: >>>>> >>>>>> Hi Zdenko, >>>>>> >>>>>> At what step in the make file the all-gt file is created? I am still >>>>>> unable to move forward with the custom model training. >>>>>> >>>>>> Any help would be greatly appreciated. Thanks! >>>>>> >>>>>> On Wednesday, 26 April 2023 at 09:47:55 UTC-6 zdenop wrote: >>>>>> >>>>>>> make training TESSDATA=./usr/local/share/tessdata >>>>>>> unicharset_extractor --output_unicharset "data/foo/unicharset" >>>>>>> --norm_mode 2 "data/foo/all-gt" >>>>>>> >>>>>>> Failed to read data from: data/foo/all-gt.... >>>>>>> >>>>>>> >>>>>>> This indicates you already run training that failed... >>>>>>> Clean your training and start it once again. Pay attention to why >>>>>>> "data/foo/all-gt" is not created (there will be an error message). >>>>>>> >>>>>>> Zdenko >>>>>>> >>>>>>> >>>>>>> st 26. 4. 2023 o 2:07 Madhav Pandey <mad.dev...@gmail.com> >>>>>>> napísal(a): >>>>>>> >>>>>>>> @zdenop >>>>>>>> >>>>>>>> This is the entire training output: >>>>>>>> >>>>>>>> ```make training TESSDATA=./usr/local/share/tessdata >>>>>>>> unicharset_extractor --output_unicharset "data/foo/unicharset" >>>>>>>> --norm_mode 2 "data/foo/all-gt" >>>>>>>> Failed to read data from: data/foo/all-gt >>>>>>>> Wrote unicharset file data/foo/unicharset >>>>>>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i >>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" -t >>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.gt.txt" > >>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.box" >>>>>>>> set -x; \ >>>>>>>> tesseract >>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" >>>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train >>>>>>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif >>>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train >>>>>>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i >>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" -t >>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.gt.txt" > >>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.box" >>>>>>>> set -x; \ >>>>>>>> tesseract >>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" >>>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train >>>>>>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif >>>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train >>>>>>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i >>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" -t >>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.gt.txt" > >>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.box" >>>>>>>> set -x; \ >>>>>>>> tesseract >>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" >>>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train >>>>>>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif >>>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train >>>>>>>> python3 shuffle.py 0 "data/foo/all-lstmf" >>>>>>>> Traceback (most recent call last): >>>>>>>> File "/Users/m/Code/git/tesstrain/shuffle.py", line 24, in >>>>>>>> <module> >>>>>>>> fd0 = open(sys.argv[2], 'r') >>>>>>>> FileNotFoundError: [Errno 2] No such file or directory: >>>>>>>> 'data/foo/all-lstmf' >>>>>>>> make: *** [data/foo/all-lstmf] Error 1``` >>>>>>>> >>>>>>>> For this run, I just have 3 text and tif files. >>>>>>>> >>>>>>>> I did follow macos installation section from this page: >>>>>>>> https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos and >>>>>>>> installed everything that is mentioned here. >>>>>>>> >>>>>>>> Do I have to install anything else before running the training? >>>>>>>> >>>>>>>> On Tuesday, 25 April 2023 at 00:27:28 UTC-6 zdenop wrote: >>>>>>>> >>>>>>>>> Did you install all the necessary dependencies? >>>>>>>>> Did you check & fixed all errors (before this error) in training >>>>>>>>> output? >>>>>>>>> >>>>>>>>> Zdenko >>>>>>>>> >>>>>>>>> >>>>>>>>> ut 25. 4. 2023 o 8:21 Madhav Pandey <mad.dev...@gmail.com> >>>>>>>>> napísal(a): >>>>>>>>> >>>>>>>>>> Hi Everyone, >>>>>>>>>> >>>>>>>>>> I am relatively new to tesseract and OCR as whole. >>>>>>>>>> >>>>>>>>>> I have been trying to training do the setup for training model >>>>>>>>>> locally using the guide >>>>>>>>>> https://github.com/tesseract-ocr/tesstrain/blob/main/README.md >>>>>>>>>> >>>>>>>>>> I have copied the sample training data into the `data/foo` >>>>>>>>>> directory but when I run `make training`, I will always end up >>>>>>>>>> getting this >>>>>>>>>> error: >>>>>>>>>> >>>>>>>>>> ```Failed to read data from: data/foo/all-gt >>>>>>>>>> Wrote unicharset file data/foo/unicharset >>>>>>>>>> python3 shuffle.py 0 "data/foo/all-lstmf" >>>>>>>>>> Traceback (most recent call last): >>>>>>>>>> File "shuffle.py", line 24, in <module> >>>>>>>>>> fd0 = open(sys.argv[2], 'r') >>>>>>>>>> FileNotFoundError: [Errno 2] No such file or directory: >>>>>>>>>> 'data/foo/all-lstmf' >>>>>>>>>> make: *** [data/foo/all-lstmf] Error 1 >>>>>>>>>> ``` >>>>>>>>>> >>>>>>>>>> Can someone please help resolve this error? >>>>>>>>>> >>>>>>>>>> Thank you! >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/249216fc-70e5-4e40-a630-d4202fd24a36n%40googlegroups.com >>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/249216fc-70e5-4e40-a630-d4202fd24a36n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>> >>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/98ffe203-7d53-4b57-a5e8-3edd3ae271cen%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/98ffe203-7d53-4b57-a5e8-3edd3ae271cen%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> >>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/d044535b-ef13-4e07-8c1f-3cbab7098883n%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d044535b-ef13-4e07-8c1f-3cbab7098883n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/22a2db2c-0738-4d5c-99de-f7761d40ddeen%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/22a2db2c-0738-4d5c-99de-f7761d40ddeen%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/f50e814c-3edf-45ef-aed6-bb379b2d1ef0n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/f50e814c-3edf-45ef-aed6-bb379b2d1ef0n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yHy-t5jc4aC95eVKbcH6Xmkfc_Z8c%3Du1AsuBn3Y%2B%2B60w%40mail.gmail.com.