Re: [tesseract-ocr] Re: Would training Tesseract with different binarization filters affect eng.traineddata?

محمود محمد Fri, 28 Mar 2025 23:54:18 -0700

Can we hold an online meeting with a general invitation to those interested
to discuss how to do this?


في الجمعة، ٢٨ مارس ٢٠٢٥، ٧:٠٧ م Lorenzo Bolzani <l.bolz...@gmail.com> كتب:

> Hi Mitya,
> tesseract is trained black on white so I think it is not a good idea to
> use inverted samples (it is usually quite simple to invert the source image
> in case it is negative).
>
> All the tesseract models, the .traineddata files, are independent from
> each other so when you train a new model the base model is not affected.
>
> Otsu maybe be a good pre-processing step, just check visually if it is
> working as expected. A simple thresholding might be better, it really
> depends on the input.
>
> The important thing is to use training samples that are as similar as
> possible to the real text that you will process and apply exactly the same
> preprocessing. Both as images and as text content i.e. do not train all on
> upper case or random text if your real text is lowercase in a specific
> language.
>
> If I understand correctly, you are using only three samples just for
> testing the workflow. In this case I would use exactly the same samples for
> training and evaluation. If you use 3 samples for training and three
> different ones for eval the model will focus too much on the three training
> samples (overfitting badly) and the eval result will get worse than the
> original model.
>
> For real training use as many samples as possible (1000? 10000?) and
> randomly sample from these a subset to use for eval.
>
>
> Bye
>
> Lorenzo
>
>
> Il giorno ven 28 mar 2025 alle ore 08:13 Mitya <mityaholi...@gmail.com>
> ha scritto:
>
>> I'd try to summarize here, I'm asking if its good idea to train lstm
>> model using preprocessed images with applied filters like OTSU, Binary and
>> others I've also lacked to find guideline for exact sample and its
>> corresponding image. Should it be black fond and white text or reverse.
>> Also any pointers are maximum appreciated
>>
>>
>> пятница, 28 марта 2025 г. в 03:28:12 UTC+7, Mitya:
>>
>>> I am working with Tesseract OCR and want to experiment with different
>>> binarization methods, such as Otsu's thresholding and other custom filters,
>>> to improve text recognition accuracy.
>>>
>>> However, I am concerned that training with these different preprocessing
>>> techniques might modify or overwrite eng.traineddata, which I want to keep
>>> intact.
>>>
>>> *My questions are:*
>>> Does training a new model affect the existing eng.traineddata file? How
>>> can I safely train Tesseract with new filters without modifying the default
>>> English model? Is there a recommended approach to train Tesseract on
>>> preprocessed images while keeping eng.traineddata unchanged?
>>>
>>> *What I've tried:*
>>>
>>> updated my current eng_new.traineddata with three samples, each sample
>>> had applied filter Otsu, Otsu_Tresh_Binary, Otsu_Tresh_Binary_Inv After
>>> first 1000 iterations I got difference between initial and target
>>> trained.data But target trained.data got slightly worse results.
>>> lstmtraining --continue_from
>>> /home/j/trainingCurrentEng/data/checkpoints/eng_trained --traineddata
>>> /home/j/trainingCurrentEng/data/eng.traineddata --train_listfile
>>> /home/j/trainingCurrentEng/data/list.train --eval_listfile
>>> /home/j/trainingCurrentEng/data/list.eval --model_output
>>> /home/j/trainingCurrentEng/data/checkpoints/eng_trained --learning_rate
>>> 0.0001 --debug_interval 10 --max_iterations 600 tesseract
>>> otsu_tresh_binary_inv.tiff output_text -l eng --tessdata-dir
>>> /home/j/trainingCurrentEng/data --psm 7
>>>
>>> cat output_text.txt
>>>
>>> Abcd123
>>> tesseract otsu_tresh_binary_inv.tiff output_text_1 -l eng_trained
>>> --tessdata-dir /home/j/trainingCurrentEng/data --psm 7
>>>
>>> cat output_text_1.txt Abc
>>>
>>> I would appreciate any guidance or best practices for training custom
>>> models without interfering with existing ones.
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion visit
>> https://groups.google.com/d/msgid/tesseract-ocr/2661c455-a141-4398-9542-10321a319510n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/2661c455-a141-4398-9542-10321a319510n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxOAswRkmqkEtzmHcCWxipDh78xx2J4WMJe-TD68NAw3g%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxOAswRkmqkEtzmHcCWxipDh78xx2J4WMJe-TD68NAw3g%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXsnWZ75YFHP7Upa9GoM8LMrXo18bEm9p95wX9rZLcfgRoA%40mail.gmail.com.

Re: [tesseract-ocr] Re: Would training Tesseract with different binarization filters affect eng.traineddata?

Reply via email to