Re: [tesseract-ocr] Re: Would training Tesseract with different binarization filters affect eng.traineddata?

Mitya Sat, 05 Apr 2025 09:02:41 -0700

Hi *Lorenzo*, thanks for reaching me out!
I decided to train one source image  (without any filters), but still 
getting major issue, assumable with set of commands to train model or 
(Highly Likely) in area where we update eng.trainedadata or interfere with 
checkpoints!
Could you please take a look?


*To All:* please take a look also and kindly reply in derived topic:

https://groups.google.com/u/1/g/tesseract-ocr/c/X0dWjze9twc

Best Regards,
Mitya
пятница, 28 марта 2025 г. в 22:08:08 UTC+7, Lorenzo Blz: 

> Hi Mitya,
> tesseract is trained black on white so I think it is not a good idea to 
> use inverted samples (it is usually quite simple to invert the source image 
> in case it is negative).
>
> All the tesseract models, the .traineddata files, are independent from 
> each other so when you train a new model the base model is not affected.
>
> Otsu maybe be a good pre-processing step, just check visually if it is 
> working as expected. A simple thresholding might be better, it really 
> depends on the input.
>
> The important thing is to use training samples that are as similar as 
> possible to the real text that you will process and apply exactly the same 
> preprocessing. Both as images and as text content i.e. do not train all on 
> upper case or random text if your real text is lowercase in a specific 
> language.
>
> If I understand correctly, you are using only three samples just for 
> testing the workflow. In this case I would use exactly the same samples for 
> training and evaluation. If you use 3 samples for training and three 
> different ones for eval the model will focus too much on the three training 
> samples (overfitting badly) and the eval result will get worse than the 
> original model.
>
> For real training use as many samples as possible (1000? 10000?) and 
> randomly sample from these a subset to use for eval.
>
>
> Bye
>
> Lorenzo
>
>
> Il giorno ven 28 mar 2025 alle ore 08:13 Mitya <mityah...@gmail.com> ha 
> scritto:
>
>> I'd try to summarize here, I'm asking if its good idea to train lstm 
>> model using preprocessed images with applied filters like OTSU, Binary and 
>> others I've also lacked to find guideline for exact sample and its 
>> corresponding image. Should it be black fond and white text or reverse. 
>> Also any pointers are maximum appreciated
>>
>>
>> пятница, 28 марта 2025 г. в 03:28:12 UTC+7, Mitya: 
>>
>>> I am working with Tesseract OCR and want to experiment with different 
>>> binarization methods, such as Otsu's thresholding and other custom filters, 
>>> to improve text recognition accuracy.
>>>
>>> However, I am concerned that training with these different preprocessing 
>>> techniques might modify or overwrite eng.traineddata, which I want to keep 
>>> intact.
>>>
>>> *My questions are:*
>>> Does training a new model affect the existing eng.traineddata file? How 
>>> can I safely train Tesseract with new filters without modifying the default 
>>> English model? Is there a recommended approach to train Tesseract on 
>>> preprocessed images while keeping eng.traineddata unchanged? 
>>>
>>> *What I've tried:*
>>>
>>> updated my current eng_new.traineddata with three samples, each sample 
>>> had applied filter Otsu, Otsu_Tresh_Binary, Otsu_Tresh_Binary_Inv After 
>>> first 1000 iterations I got difference between initial and target 
>>> trained.data But target trained.data got slightly worse results.
>>> lstmtraining --continue_from 
>>> /home/j/trainingCurrentEng/data/checkpoints/eng_trained --traineddata 
>>> /home/j/trainingCurrentEng/data/eng.traineddata --train_listfile 
>>> /home/j/trainingCurrentEng/data/list.train --eval_listfile 
>>> /home/j/trainingCurrentEng/data/list.eval --model_output 
>>> /home/j/trainingCurrentEng/data/checkpoints/eng_trained --learning_rate 
>>> 0.0001 --debug_interval 10 --max_iterations 600 tesseract 
>>> otsu_tresh_binary_inv.tiff output_text -l eng --tessdata-dir 
>>> /home/j/trainingCurrentEng/data --psm 7 
>>>
>>> cat output_text.txt
>>>
>>> Abcd123
>>> tesseract otsu_tresh_binary_inv.tiff output_text_1 -l eng_trained 
>>> --tessdata-dir /home/j/trainingCurrentEng/data --psm 7 
>>>
>>> cat output_text_1.txt Abc
>>>
>>> I would appreciate any guidance or best practices for training custom 
>>> models without interfering with existing ones.
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2661c455-a141-4398-9542-10321a319510n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/2661c455-a141-4398-9542-10321a319510n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/79e81512-e7cd-4cc6-830f-a41cd32d0a5bn%40googlegroups.com.

Re: [tesseract-ocr] Re: Would training Tesseract with different binarization filters affect eng.traineddata?

Reply via email to