Re: [tesseract-ocr] Re: Armenian.traineddata hye language tesseract

Des Bw Fri, 20 Oct 2023 05:49:09 -0700

I have exactly the same problem as you have: and neither am I a specialist 
in Tesseract. I have been experimenting with various setups. 
Training from a layer seems to offer the best option for introducing a 
missing character. But, I am still struggling because I am not getting the 
same accuracy the default Best model.
- I have been training using 400,000 text lines. It is giving good accuracy 
on the synthetic data; but terrible output on scanned documents.  
Training Tesseract is very daunting task. I spend many weeks on it; and got 
not satisfactory results. You need to experiment with various set ups and 
see the outcomes.


On Friday, October 20, 2023 at 3:43:04 PM UTC+3 Des Bw wrote:

>
>    - Fine tune. Starting with an existing trained language, train on your 
>    specific additional data. This may work for problems that are close to the 
>    existing training data, but different in some subtle way, like a 
>    particularly unusual font. May work with even a small amount of training 
>    data.
>    - Cut off the top layer (or some arbitrary number of layers) from the 
>    network and retrain a new top layer using the new data. If fine tuning 
>    doesn’t work, this is most likely the next best option. Cutting off the 
> top 
>    layer could still work for training a completely new language or script, 
> if 
>    you start with the most similar looking script.
>    - Retrain from scratch. This is a daunting task, unless you have a 
>    very representative and sufficiently large training set for your problem. 
>    If not, you are likely to end up with an over-fitted network that does 
>    really well on the training data, but not on the actual data.
>
> https://tesseract-ocr.github.io/tessdoc/tess5/TrainingTesseract-5.html
>
>
> On Friday, October 20, 2023 at 1:44:40 PM UTC+3 [email protected] wrote:
>
>> I have no idea what do you mean with 'cut off the top layer'   ?
>> Can I find a documentation about this process somewhere ? 
>> I am a tesseract user not (yet) a tesseract specialist.
>>
>> Le dim. 15 oct. 2023 à 08:39, Des Bw <[email protected]> a écrit :
>>
>>> Check the conversation in this forum where Schree trained the Norwegian 
>>> data to include the missing letter Æ. I used this method to train for 
>>> Amharic; and worked for me. 
>>> Basically, the method is to cut off the top layer of the network and 
>>> train from there. 
>>> Fine tuning doesn't work for adding missing letters. 
>>>
>>> On Sunday, October 8, 2023 at 9:38:57 PM UTC+3 [email protected] wrote:
>>>
>>>> I experienced that the official hye.traineddata does not include the և 
>>>> letter. 
>>>> Does someone experience the same problem if yes, what is the turnaround 
>>>> ?
>>>>
>>>> Thanks for an answer 
>>>>
>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/8b4a3db2-ef4b-4323-95a7-c62feb92937an%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/8b4a3db2-ef4b-4323-95a7-c62feb92937an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d14ae4ff-81bb-4596-b442-02f2cab982e4n%40googlegroups.com.

Re: [tesseract-ocr] Re: Armenian.traineddata hye language tesseract

Reply via email to