Shreeshrii
<https://github.com/tesseract-ocr/tesstrain/commits?author=Shreeshrii>,
bertky <https://github.com/tesseract-ocr/tesstrain/commits?author=bertsky> and
many others from the tesseract community invested a lot of time to improve
training and documentation (e.g. tesstrain.sh was abandoned and
replaced with python training). This is a community project so any
improvements (code, documentation) is welcomed. We try to collect and keep
the best information in our github repositories.

IMO training requires understanding of the OCR process and training process
(e.g. why do I need to run training?). For example - training for images
like alphabet-numbs.png is useless and it is quite common that users after
retraining have worse results as with standard trained data from tesseract
repository.

Zdenko


st 31. 8. 2022 o 15:57 Adrian Paul Ciobanita <adrian.cioban...@gmail.com>
napísal(a):

> I don't think the github link is helpful too much, tbh. I've had this
> issue with training something particular for my case since 2020. I've not
> had much time lately, but there's still no clean and easy tutorial to
> retrain something, that correctly describes how to create and use the
> ground truth files with the jstextesitor boxes, which one is which. The
> wording on the documentation is written by people that are biased towards
> the tool and know the ins and outs of it.
>
> With this context in mind, it's no wonder why there are so many questions
> and ask for help on "how to retrain/fine tune".
>
> Thank you for taking the time to respond. I am more than happy to write
> such a tutorial/example and what not, but it's hard even for me to do it,
> understand it, let alone have the knowledge to pull that off. If someone
> would be interested in showing me the "starting", as explained earlier for
> the groundtruth, unpacking the trained data, repackage it, correctly use
> the boundary boxes I'm willing to help out others and answer any questions
> others might have.
>
> Thank you for your contribution and help, genuinely!
>
> On Wed, Aug 31, 2022, 16:40 John Alway <jal...@gmail.com> wrote:
>
>> "First of all: if you follow any tutorial on internet - report the
>> problem to the author of the tutorial."
>> Next: use official documentation for training. I see there are a bunch of
>> folks just "generating content" - to gain an audience. Without insight and
>> therefore also without support, using old/outdated information..."
>>
>>   People are trying to find a nice, easy tutorial to help them get
>> through the forest.   I think that's the bottom line.   Thanks for the link.
>>
>> 'Tesseract 4 was released 29 Oct 2018. Almost 4 year ago! The recent
>> tesseract version is 5.2 and training process was also improved:
>> https://github.com/tesseract-ocr/tesstrain";
>>
>>   I understand this, but I'm using C# .Net, and I don't think version 5
>> is available in C#.  Unless I'm mistaken?  There are costly packages, such
>> as IronOcr which uses tesseract 5, but there is no way I can take that
>> route.
>>
>>   Regards,
>>  ...John
>>
>>
>>
>> [image: width=]
>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>> Virus-free.www.avg.com
>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>> <#m_9003439231438509257_m_8439422095301688817_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>
>> On Wed, Aug 31, 2022 at 3:27 AM Zdenko Podobny <zde...@gmail.com> wrote:
>>
>>> First of all: if you follow any tutorial on internet - report the
>>> problem to the author of the tutorial.
>>> Next: use official documentation for training. I see there are a bunch
>>> of folks just "generating content" - to gain an audience. Without insight
>>> and therefore also without support, using old/outdated information...
>>> Tesseract 4 was released 29 Oct 2018. Almost 4 year ago! The recent
>>> tesseract version is 5.2 and training process was also improved:
>>> https://github.com/tesseract-ocr/tesstrain
>>>
>>> Zdenko
>>>
>>>
>>> st 31. 8. 2022 o 0:18 John Alway <jal...@gmail.com> napísal(a):
>>>
>>>> Hello,
>>>>
>>>> I've been following a tutorial on youtube titled "Tesseract OCR -
>>>> Lesson 2: Training Tesseract for new font" here:
>>>> https://www.youtube.com/watch?v=1v8BPw0Dn0I&ab_channel=TheCode
>>>>
>>>> I'm using tesseract 4.0 on Window 10.
>>>>
>>>> I went through the steps he used, and everything seems to go smoothly
>>>> until I get to the actual training.    When I run "mftraining" the program
>>>> hangs. It seems to get stuck and doesn't indicate why are what it's doing.
>>>>
>>>> I'm using a set of fonts in an image. I have the full alphabet upper
>>>> and lower case and the numbers 0 to 9 on the png image.   I've attached the
>>>> image.  Unlike him, I'm using the English.  I don't know the font, so I'm
>>>> just calling it tiktok to give it a name.    My training file is called 
>>>> *eng.tiktok.exp0.tr
>>>> <http://eng.tiktok.exp0.tr>  *
>>>>
>>>> I used* jTessBoxEditor* to correct mistakes and set the box sizes and
>>>> positions precisely.
>>>>
>>>>
>>>> When I run this command:
>>>>  *mftraining -F font_properties -U unicharset -O eng.unicharset
>>>> eng.tiktok.exp0.tr <http://eng.tiktok.exp0.tr>*
>>>>
>>>> The program just hangs. I've waited over twenty minutes.
>>>>
>>>> Should I wait longer?   What could cause it to hang?
>>>>
>>>>
>>>>
>>>> Thanks!
>>>> ...John
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/534c3f74-420b-4c96-83dd-609bcb002f81n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/534c3f74-420b-4c96-83dd-609bcb002f81n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w5MX1XchQp6jfu2Vz06zWp82HxbDHrgp7%2BQ_Neh%2BDeug%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w5MX1XchQp6jfu2Vz06zWp82HxbDHrgp7%2BQ_Neh%2BDeug%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAN7TTkEYLexQwmN9y3qO6cwqGoW72HoaEj8XksAgTwb3qVSNPA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAN7TTkEYLexQwmN9y3qO6cwqGoW72HoaEj8XksAgTwb3qVSNPA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CADB4xchbnxoX%3D0GvF3jKk_%3Dje_twnYaEZemyoxFiTL9X8H%3DPew%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CADB4xchbnxoX%3D0GvF3jKk_%3Dje_twnYaEZemyoxFiTL9X8H%3DPew%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zY%2B%2BpX1hb%2BA7rf0a8pWgG17scGvJMFx3cp%2B3ii7%2Bggyw%40mail.gmail.com.

Reply via email to