Re: [tesseract-ocr] Re: Training from Scratch

Zdenko Podobny Thu, 23 Nov 2023 09:58:53 -0800

št 23. 11. 2023 o 10:28 Des Bw <[email protected]> napísal(a):

> If the original model lacks the ∠ symbol, fine tuning is not going to add
> it for you.



Really???
Tesseract documentation
<https://github.com/tesseract-ocr/tessdoc/blob/2f4d1e47094acbe3e046144573c928d740595f55/tess4/TrainingTesseract-4.00.md#fine-tuning-for-impact>:
Fine tuning is the process of training an existing model on new data
without changing any part of the network, although you *can* now add
characters to the character set. (See Fine Tuning for ± a few characters
<https://github.com/tesseract-ocr/tessdoc/blob/2f4d1e47094acbe3e046144573c928d740595f55/tess4/TrainingTesseract-4.00.md#fine-tuning-for--a-few-characters>
).



> We have all went through that process. To introduce a new character,
> removing the top layer and train from there is the most
> effective approach.
>
> On Thursday, November 23, 2023 at 12:15:56 PM UTC+3 [email protected]
> wrote:
>
>> If I need to train new characters that are not recognized by a default
>> model, is fine tuning in this case the right approach?
>> One of these characters ist the one for angularity:  ∠
>>
>> This symbols appear in technical drawings and should be recognised in
>> those. E.g. for the scenario in the following picture tesseract should
>> reconize this symbol.
>>
>>
>>
>> [image: angularity.png]
>>
>> Also here is one of the pngs I tried to train with:
>> [image: angularity_0_r0.jpg]
>> They all look pretty similar to this one. Things that change are the
>> angle, the propotion and the thickness of the lines. All examples have this
>> 64x64 pixel box around it.
>>
>>
>> Is Fine Tuning for this scenario the right approach as I only find
>> information for fine tuning for specific fonts. For fine tune also the
>> "tesstrain" repository would not be needed as it is used for training from
>> scratch, correct?
>> [email protected] schrieb am Mittwoch, 22. November 2023 um 15:27:02
>> UTC+1:
>>
>>> From my limited experience, you need a lot more data than that to train
>>> from scratch. If you can't make more than that data, you might first try to
>>> fine tune:and then train by removing the top layer of the best model.
>>>
>>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 [email protected]
>>> wrote:
>>>
>>>> As it is not properly possible to combine my traineddata from scratch
>>>> with an existing one, I have decided to also train my traineddata model
>>>> numbers. Therefore I wrote a script which synthetically generates
>>>> groundtruth data with text2image.
>>>> This script uses dozens of different fonts and creates numbers for the
>>>> following formats.
>>>> X.XXX
>>>> X.XX
>>>> X,XX
>>>> X,XXX
>>>> I generated 10,000 files to train the numbers. But unfortunately
>>>> numbers get recognized pretty poorly with the best model. (most of times
>>>> only "0."; "0" or "0," gets recognized)
>>>> So I wanted to ask if It is not enough training (ground truth data) for
>>>> proper recognition when I train several fonts.
>>>> Thanks in advance for you help.
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/fb4a1b27-db44-49a6-adfa-ada9e13030aan%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/fb4a1b27-db44-49a6-adfa-ada9e13030aan%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wt2bNNDBQoBBDGezC_UCScqeaGXS6eyTFf8boam5s%2Bgg%40mail.gmail.com.

Re: [tesseract-ocr] Re: Training from Scratch

Reply via email to