Re: [tesseract-ocr] Re: Would training Tesseract with different binarization filters affect eng.traineddata?

Fish Money Sat, 29 Mar 2025 02:05:07 -0700

Yes, please message me with whatsap
+66820510893

On Sat, Mar 29, 2025, 12:15 محمود محمد <mahmoudmm55...@gmail.com> wrote:


> Can we hold an online meeting with a general invitation to those
> interested to discuss how to do this?
>
> في الجمعة، ٢٨ مارس ٢٠٢٥، ٧:٠٧ م Lorenzo Bolzani <l.bolz...@gmail.com> كتب:
>
>> Hi Mitya,
>> tesseract is trained black on white so I think it is not a good idea to
>> use inverted samples (it is usually quite simple to invert the source image
>> in case it is negative).
>>
>> All the tesseract models, the .traineddata files, are independent from
>> each other so when you train a new model the base model is not affected.
>>
>> Otsu maybe be a good pre-processing step, just check visually if it is
>> working as expected. A simple thresholding might be better, it really
>> depends on the input.
>>
>> The important thing is to use training samples that are as similar as
>> possible to the real text that you will process and apply exactly the same
>> preprocessing. Both as images and as text content i.e. do not train all on
>> upper case or random text if your real text is lowercase in a specific
>> language.
>>
>> If I understand correctly, you are using only three samples just for
>> testing the workflow. In this case I would use exactly the same samples for
>> training and evaluation. If you use 3 samples for training and three
>> different ones for eval the model will focus too much on the three training
>> samples (overfitting badly) and the eval result will get worse than the
>> original model.
>>
>> For real training use as many samples as possible (1000? 10000?) and
>> randomly sample from these a subset to use for eval.
>>
>>
>> Bye
>>
>> Lorenzo
>>
>>
>> Il giorno ven 28 mar 2025 alle ore 08:13 Mitya <mityaholi...@gmail.com>
>> ha scritto:
>>
>>> I'd try to summarize here, I'm asking if its good idea to train lstm
>>> model using preprocessed images with applied filters like OTSU, Binary and
>>> others I've also lacked to find guideline for exact sample and its
>>> corresponding image. Should it be black fond and white text or reverse.
>>> Also any pointers are maximum appreciated
>>>
>>>
>>> пятница, 28 марта 2025 г. в 03:28:12 UTC+7, Mitya:
>>>
>>>> I am working with Tesseract OCR and want to experiment with different
>>>> binarization methods, such as Otsu's thresholding and other custom filters,
>>>> to improve text recognition accuracy.
>>>>
>>>> However, I am concerned that training with these different
>>>> preprocessing techniques might modify or overwrite eng.traineddata, which I
>>>> want to keep intact.
>>>>
>>>> *My questions are:*
>>>> Does training a new model affect the existing eng.traineddata file? How
>>>> can I safely train Tesseract with new filters without modifying the default
>>>> English model? Is there a recommended approach to train Tesseract on
>>>> preprocessed images while keeping eng.traineddata unchanged?
>>>>
>>>> *What I've tried:*
>>>>
>>>> updated my current eng_new.traineddata with three samples, each sample
>>>> had applied filter Otsu, Otsu_Tresh_Binary, Otsu_Tresh_Binary_Inv After
>>>> first 1000 iterations I got difference between initial and target
>>>> trained.data But target trained.data got slightly worse results.
>>>> lstmtraining --continue_from
>>>> /home/j/trainingCurrentEng/data/checkpoints/eng_trained --traineddata
>>>> /home/j/trainingCurrentEng/data/eng.traineddata --train_listfile
>>>> /home/j/trainingCurrentEng/data/list.train --eval_listfile
>>>> /home/j/trainingCurrentEng/data/list.eval --model_output
>>>> /home/j/trainingCurrentEng/data/checkpoints/eng_trained --learning_rate
>>>> 0.0001 --debug_interval 10 --max_iterations 600 tesseract
>>>> otsu_tresh_binary_inv.tiff output_text -l eng --tessdata-dir
>>>> /home/j/trainingCurrentEng/data --psm 7
>>>>
>>>> cat output_text.txt
>>>>
>>>> Abcd123
>>>> tesseract otsu_tresh_binary_inv.tiff output_text_1 -l eng_trained
>>>> --tessdata-dir /home/j/trainingCurrentEng/data --psm 7
>>>>
>>>> cat output_text_1.txt Abc
>>>>
>>>> I would appreciate any guidance or best practices for training custom
>>>> models without interfering with existing ones.
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/2661c455-a141-4398-9542-10321a319510n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/2661c455-a141-4398-9542-10321a319510n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxOAswRkmqkEtzmHcCWxipDh78xx2J4WMJe-TD68NAw3g%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxOAswRkmqkEtzmHcCWxipDh78xx2J4WMJe-TD68NAw3g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXsnWZ75YFHP7Upa9GoM8LMrXo18bEm9p95wX9rZLcfgRoA%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXsnWZ75YFHP7Upa9GoM8LMrXo18bEm9p95wX9rZLcfgRoA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAPV0CCUn3A%2Bdzs0mV37MZ5KEpNVoV5WE5hffP%3Dwwt5OeYHCggw%40mail.gmail.com.

Re: [tesseract-ocr] Re: Would training Tesseract with different binarization filters affect eng.traineddata?

Reply via email to