Yes, please message me with whatsap +66820510893 On Sat, Mar 29, 2025, 12:15 محمود محمد <mahmoudmm55...@gmail.com> wrote:
> Can we hold an online meeting with a general invitation to those > interested to discuss how to do this? > > في الجمعة، ٢٨ مارس ٢٠٢٥، ٧:٠٧ م Lorenzo Bolzani <l.bolz...@gmail.com> كتب: > >> Hi Mitya, >> tesseract is trained black on white so I think it is not a good idea to >> use inverted samples (it is usually quite simple to invert the source image >> in case it is negative). >> >> All the tesseract models, the .traineddata files, are independent from >> each other so when you train a new model the base model is not affected. >> >> Otsu maybe be a good pre-processing step, just check visually if it is >> working as expected. A simple thresholding might be better, it really >> depends on the input. >> >> The important thing is to use training samples that are as similar as >> possible to the real text that you will process and apply exactly the same >> preprocessing. Both as images and as text content i.e. do not train all on >> upper case or random text if your real text is lowercase in a specific >> language. >> >> If I understand correctly, you are using only three samples just for >> testing the workflow. In this case I would use exactly the same samples for >> training and evaluation. If you use 3 samples for training and three >> different ones for eval the model will focus too much on the three training >> samples (overfitting badly) and the eval result will get worse than the >> original model. >> >> For real training use as many samples as possible (1000? 10000?) and >> randomly sample from these a subset to use for eval. >> >> >> Bye >> >> Lorenzo >> >> >> Il giorno ven 28 mar 2025 alle ore 08:13 Mitya <mityaholi...@gmail.com> >> ha scritto: >> >>> I'd try to summarize here, I'm asking if its good idea to train lstm >>> model using preprocessed images with applied filters like OTSU, Binary and >>> others I've also lacked to find guideline for exact sample and its >>> corresponding image. Should it be black fond and white text or reverse. >>> Also any pointers are maximum appreciated >>> >>> >>> пятница, 28 марта 2025 г. в 03:28:12 UTC+7, Mitya: >>> >>>> I am working with Tesseract OCR and want to experiment with different >>>> binarization methods, such as Otsu's thresholding and other custom filters, >>>> to improve text recognition accuracy. >>>> >>>> However, I am concerned that training with these different >>>> preprocessing techniques might modify or overwrite eng.traineddata, which I >>>> want to keep intact. >>>> >>>> *My questions are:* >>>> Does training a new model affect the existing eng.traineddata file? How >>>> can I safely train Tesseract with new filters without modifying the default >>>> English model? Is there a recommended approach to train Tesseract on >>>> preprocessed images while keeping eng.traineddata unchanged? >>>> >>>> *What I've tried:* >>>> >>>> updated my current eng_new.traineddata with three samples, each sample >>>> had applied filter Otsu, Otsu_Tresh_Binary, Otsu_Tresh_Binary_Inv After >>>> first 1000 iterations I got difference between initial and target >>>> trained.data But target trained.data got slightly worse results. >>>> lstmtraining --continue_from >>>> /home/j/trainingCurrentEng/data/checkpoints/eng_trained --traineddata >>>> /home/j/trainingCurrentEng/data/eng.traineddata --train_listfile >>>> /home/j/trainingCurrentEng/data/list.train --eval_listfile >>>> /home/j/trainingCurrentEng/data/list.eval --model_output >>>> /home/j/trainingCurrentEng/data/checkpoints/eng_trained --learning_rate >>>> 0.0001 --debug_interval 10 --max_iterations 600 tesseract >>>> otsu_tresh_binary_inv.tiff output_text -l eng --tessdata-dir >>>> /home/j/trainingCurrentEng/data --psm 7 >>>> >>>> cat output_text.txt >>>> >>>> Abcd123 >>>> tesseract otsu_tresh_binary_inv.tiff output_text_1 -l eng_trained >>>> --tessdata-dir /home/j/trainingCurrentEng/data --psm 7 >>>> >>>> cat output_text_1.txt Abc >>>> >>>> I would appreciate any guidance or best practices for training custom >>>> models without interfering with existing ones. >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>> To view this discussion visit >>> https://groups.google.com/d/msgid/tesseract-ocr/2661c455-a141-4398-9542-10321a319510n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/2661c455-a141-4398-9542-10321a319510n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxOAswRkmqkEtzmHcCWxipDh78xx2J4WMJe-TD68NAw3g%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxOAswRkmqkEtzmHcCWxipDh78xx2J4WMJe-TD68NAw3g%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXsnWZ75YFHP7Upa9GoM8LMrXo18bEm9p95wX9rZLcfgRoA%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXsnWZ75YFHP7Upa9GoM8LMrXo18bEm9p95wX9rZLcfgRoA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAPV0CCUn3A%2Bdzs0mV37MZ5KEpNVoV5WE5hffP%3Dwwt5OeYHCggw%40mail.gmail.com.