all-lstmf'

Zdenko Podobny Wed, 27 Mar 2024 09:54:35 -0700

You can try custom images - see the example  ocrd-testset.zip
<https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip> And
follow the example from
https://github.com/tesseract-ocr/tesstrain/blob/main/README.md :


unzip ocrd-testset.zip -d data/ocrd-ground-truth
make training MODEL_NAME=ocrd START_MODEL=deu_latf
TESSDATA=~/tessdata_best MAX_ITERATIONS=10000


Zdenko


so 28. 10. 2023 o 17:37 Dev Solution <developermax...@gmail.com> napísal(a):

> Can I train my custom images? I'm going to build France Receipts scanner.
> So I need to train these all to increase accuracy. How do you suggest?
> Zdenop
>
> On Saturday, October 28, 2023 at 11:58:10 AM UTC+2 zdenop wrote:
>
>> It does not work on windows (directly) but it works on linux => use WSL
>> if you really need training.
>> Or wait until somebody find a fix for windows (or send the fix - this is
>> an open source project so everybody should contribute ;-) )
>>
>> Zdenko
>>
>>
>> pi 27. 10. 2023 o 17:32 Dev Solution <develop...@gmail.com> napísal(a):
>>
>>>
>>> I just tried to run these all commands, but I got error
>>> https://prnt.sc/lLHeR27J2U65
>>>
>>> On Tuesday, June 6, 2023 at 10:03:17 AM UTC+2 zdenop wrote:
>>>
>>>> Do not create files manually.
>>>> If "make training" does not work it means:
>>>>
>>>>    1. you miss some dependency or input data are wrong
>>>>    2. also you miss error message for 1.
>>>>
>>>> I strongly suggest you to start training from the beginning
>>>> (including cloning tesstraing) and pay attention to all messages:
>>>>
>>>> git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git
>>>> cd tesstrain
>>>> make tesseract-langdata
>>>> mkdir tessdata_best
>>>> wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
>>>> -P tessdata_best
>>>> unzip ocrd-testset.zip -d data/ocrd-ground-truth
>>>> make training MODEL_NAME=ocrd TESSDATA=tessdata_best
>>>> MAX_ITERATIONS=10000
>>>>
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> po 5. 6. 2023 o 4:22 Madhav Pandey <mad.dev...@gmail.com> napísal(a):
>>>>
>>>>> Hi Zdenop,
>>>>>
>>>>> Apologies. I got your name wrong in the thread.
>>>>>
>>>>> Can you please help me in resolving this issue? Because make training
>>>>> command was not creating the all-gt file. I manually created it and kept 
>>>>> it
>>>>> at the MODEL_NAME directory.
>>>>>
>>>>> The way I created it was by copy over all the single lines from the
>>>>> text files and storing it in the all-gt file. I am not sure if this is the
>>>>> right approach. Please correct me if I am wrong here.
>>>>>
>>>>> Now after doing this, i am getting this error:
>>>>>
>>>>> python3 shuffle.py 0 "data/Apex/all-lstmf"
>>>>> Traceback (most recent call last):
>>>>>   File
>>>>> "/Users/madpande/Code/git/tesseract_tutorial/tesstrain/shuffle.py", line
>>>>> 24, in <module>
>>>>>     fd0 = open(sys.argv[2], 'r')
>>>>> FileNotFoundError: [Errno 2] No such file or directory:
>>>>> 'data/Apex/all-lstmf'
>>>>>
>>>>>
>>>>> I am pretty sure I am missing something here. Please help!
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Thursday, 1 June 2023 at 23:39:01 UTC-6 Madhav Pandey wrote:
>>>>>
>>>>>> Hi Zdenko,
>>>>>>
>>>>>> At what step in the make file the all-gt file is created? I am still
>>>>>> unable to move forward with the custom model training.
>>>>>>
>>>>>> Any help would be greatly appreciated. Thanks!
>>>>>>
>>>>>> On Wednesday, 26 April 2023 at 09:47:55 UTC-6 zdenop wrote:
>>>>>>
>>>>>>> make training TESSDATA=./usr/local/share/tessdata
>>>>>>> unicharset_extractor --output_unicharset "data/foo/unicharset"
>>>>>>> --norm_mode 2 "data/foo/all-gt"
>>>>>>>
>>>>>>> Failed to read data from: data/foo/all-gt....
>>>>>>>
>>>>>>>
>>>>>>> This indicates you already run training that failed...
>>>>>>> Clean your training and start it once again. Pay attention to why
>>>>>>> "data/foo/all-gt" is not created (there will be an error message).
>>>>>>>
>>>>>>> Zdenko
>>>>>>>
>>>>>>>
>>>>>>> st 26. 4. 2023 o 2:07 Madhav Pandey <mad.dev...@gmail.com>
>>>>>>> napísal(a):
>>>>>>>
>>>>>>>> @zdenop
>>>>>>>>
>>>>>>>> This is the entire training output:
>>>>>>>>
>>>>>>>> ```make training TESSDATA=./usr/local/share/tessdata
>>>>>>>> unicharset_extractor --output_unicharset "data/foo/unicharset"
>>>>>>>> --norm_mode 2 "data/foo/all-gt"
>>>>>>>> Failed to read data from: data/foo/all-gt
>>>>>>>> Wrote unicharset file data/foo/unicharset
>>>>>>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
>>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" -t
>>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.gt.txt" >
>>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.box"
>>>>>>>> set -x; \
>>>>>>>>         tesseract
>>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif"
>>>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train
>>>>>>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif
>>>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train
>>>>>>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
>>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" -t
>>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.gt.txt" >
>>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.box"
>>>>>>>> set -x; \
>>>>>>>>         tesseract
>>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif"
>>>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train
>>>>>>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif
>>>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train
>>>>>>>> PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i
>>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" -t
>>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.gt.txt" >
>>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.box"
>>>>>>>> set -x; \
>>>>>>>>         tesseract
>>>>>>>> "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif"
>>>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train
>>>>>>>> + tesseract data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif
>>>>>>>> data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train
>>>>>>>> python3 shuffle.py 0 "data/foo/all-lstmf"
>>>>>>>> Traceback (most recent call last):
>>>>>>>>   File "/Users/m/Code/git/tesstrain/shuffle.py", line 24, in
>>>>>>>> <module>
>>>>>>>>     fd0 = open(sys.argv[2], 'r')
>>>>>>>> FileNotFoundError: [Errno 2] No such file or directory:
>>>>>>>> 'data/foo/all-lstmf'
>>>>>>>> make: *** [data/foo/all-lstmf] Error 1```
>>>>>>>>
>>>>>>>> For this run, I just have 3 text and tif files.
>>>>>>>>
>>>>>>>> I did follow macos installation section from this page:
>>>>>>>> https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos and
>>>>>>>> installed everything that is mentioned here.
>>>>>>>>
>>>>>>>> Do I have to install anything else before running the training?
>>>>>>>>
>>>>>>>> On Tuesday, 25 April 2023 at 00:27:28 UTC-6 zdenop wrote:
>>>>>>>>
>>>>>>>>> Did you install all the necessary dependencies?
>>>>>>>>> Did you check & fixed all errors (before this error) in training
>>>>>>>>> output?
>>>>>>>>>
>>>>>>>>> Zdenko
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ut 25. 4. 2023 o 8:21 Madhav Pandey <mad.dev...@gmail.com>
>>>>>>>>> napísal(a):
>>>>>>>>>
>>>>>>>>>> Hi Everyone,
>>>>>>>>>>
>>>>>>>>>> I am relatively new to tesseract and OCR as whole.
>>>>>>>>>>
>>>>>>>>>> I have been trying to training do the setup for training model
>>>>>>>>>> locally using the guide
>>>>>>>>>> https://github.com/tesseract-ocr/tesstrain/blob/main/README.md
>>>>>>>>>>
>>>>>>>>>> I have copied the sample training data into the `data/foo`
>>>>>>>>>> directory but when I run `make training`, I will always end up 
>>>>>>>>>> getting this
>>>>>>>>>> error:
>>>>>>>>>>
>>>>>>>>>> ```Failed to read data from: data/foo/all-gt
>>>>>>>>>> Wrote unicharset file data/foo/unicharset
>>>>>>>>>> python3 shuffle.py 0 "data/foo/all-lstmf"
>>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>>   File "shuffle.py", line 24, in <module>
>>>>>>>>>>     fd0 = open(sys.argv[2], 'r')
>>>>>>>>>> FileNotFoundError: [Errno 2] No such file or directory:
>>>>>>>>>> 'data/foo/all-lstmf'
>>>>>>>>>> make: *** [data/foo/all-lstmf] Error 1
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> Can someone please help resolve this error?
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/249216fc-70e5-4e40-a630-d4202fd24a36n%40googlegroups.com
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/249216fc-70e5-4e40-a630-d4202fd24a36n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>
>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/98ffe203-7d53-4b57-a5e8-3edd3ae271cen%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/98ffe203-7d53-4b57-a5e8-3edd3ae271cen%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>>
>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d044535b-ef13-4e07-8c1f-3cbab7098883n%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d044535b-ef13-4e07-8c1f-3cbab7098883n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/22a2db2c-0738-4d5c-99de-f7761d40ddeen%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/22a2db2c-0738-4d5c-99de-f7761d40ddeen%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f50e814c-3edf-45ef-aed6-bb379b2d1ef0n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/f50e814c-3edf-45ef-aed6-bb379b2d1ef0n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yHy-t5jc4aC95eVKbcH6Xmkfc_Z8c%3Du1AsuBn3Y%2B%2B60w%40mail.gmail.com.

Re: [tesseract-ocr] Getting Error: No such file or directory: 'data/foo/all-lstmf'

Reply via email to