Can we hold an online meeting with a general invitation to those interested to discuss how to do this?
في الجمعة، ٢٨ مارس ٢٠٢٥، ٧:٠٧ م Lorenzo Bolzani <l.bolz...@gmail.com> كتب: > Hi Mitya, > tesseract is trained black on white so I think it is not a good idea to > use inverted samples (it is usually quite simple to invert the source image > in case it is negative). > > All the tesseract models, the .traineddata files, are independent from > each other so when you train a new model the base model is not affected. > > Otsu maybe be a good pre-processing step, just check visually if it is > working as expected. A simple thresholding might be better, it really > depends on the input. > > The important thing is to use training samples that are as similar as > possible to the real text that you will process and apply exactly the same > preprocessing. Both as images and as text content i.e. do not train all on > upper case or random text if your real text is lowercase in a specific > language. > > If I understand correctly, you are using only three samples just for > testing the workflow. In this case I would use exactly the same samples for > training and evaluation. If you use 3 samples for training and three > different ones for eval the model will focus too much on the three training > samples (overfitting badly) and the eval result will get worse than the > original model. > > For real training use as many samples as possible (1000? 10000?) and > randomly sample from these a subset to use for eval. > > > Bye > > Lorenzo > > > Il giorno ven 28 mar 2025 alle ore 08:13 Mitya <mityaholi...@gmail.com> > ha scritto: > >> I'd try to summarize here, I'm asking if its good idea to train lstm >> model using preprocessed images with applied filters like OTSU, Binary and >> others I've also lacked to find guideline for exact sample and its >> corresponding image. Should it be black fond and white text or reverse. >> Also any pointers are maximum appreciated >> >> >> пятница, 28 марта 2025 г. в 03:28:12 UTC+7, Mitya: >> >>> I am working with Tesseract OCR and want to experiment with different >>> binarization methods, such as Otsu's thresholding and other custom filters, >>> to improve text recognition accuracy. >>> >>> However, I am concerned that training with these different preprocessing >>> techniques might modify or overwrite eng.traineddata, which I want to keep >>> intact. >>> >>> *My questions are:* >>> Does training a new model affect the existing eng.traineddata file? How >>> can I safely train Tesseract with new filters without modifying the default >>> English model? Is there a recommended approach to train Tesseract on >>> preprocessed images while keeping eng.traineddata unchanged? >>> >>> *What I've tried:* >>> >>> updated my current eng_new.traineddata with three samples, each sample >>> had applied filter Otsu, Otsu_Tresh_Binary, Otsu_Tresh_Binary_Inv After >>> first 1000 iterations I got difference between initial and target >>> trained.data But target trained.data got slightly worse results. >>> lstmtraining --continue_from >>> /home/j/trainingCurrentEng/data/checkpoints/eng_trained --traineddata >>> /home/j/trainingCurrentEng/data/eng.traineddata --train_listfile >>> /home/j/trainingCurrentEng/data/list.train --eval_listfile >>> /home/j/trainingCurrentEng/data/list.eval --model_output >>> /home/j/trainingCurrentEng/data/checkpoints/eng_trained --learning_rate >>> 0.0001 --debug_interval 10 --max_iterations 600 tesseract >>> otsu_tresh_binary_inv.tiff output_text -l eng --tessdata-dir >>> /home/j/trainingCurrentEng/data --psm 7 >>> >>> cat output_text.txt >>> >>> Abcd123 >>> tesseract otsu_tresh_binary_inv.tiff output_text_1 -l eng_trained >>> --tessdata-dir /home/j/trainingCurrentEng/data --psm 7 >>> >>> cat output_text_1.txt Abc >>> >>> I would appreciate any guidance or best practices for training custom >>> models without interfering with existing ones. >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion visit >> https://groups.google.com/d/msgid/tesseract-ocr/2661c455-a141-4398-9542-10321a319510n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/2661c455-a141-4398-9542-10321a319510n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxOAswRkmqkEtzmHcCWxipDh78xx2J4WMJe-TD68NAw3g%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxOAswRkmqkEtzmHcCWxipDh78xx2J4WMJe-TD68NAw3g%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXsnWZ75YFHP7Upa9GoM8LMrXo18bEm9p95wX9rZLcfgRoA%40mail.gmail.com.