I'd try to summarize here, I'm asking if its good idea to train lstm model using preprocessed images with applied filters like OTSU, Binary and others I've also lacked to find guideline for exact sample and its corresponding image. Should it be black fond and white text or reverse. Also any pointers are maximum appreciated
пятница, 28 марта 2025 г. в 03:28:12 UTC+7, Mitya: > I am working with Tesseract OCR and want to experiment with different > binarization methods, such as Otsu's thresholding and other custom filters, > to improve text recognition accuracy. > > However, I am concerned that training with these different preprocessing > techniques might modify or overwrite eng.traineddata, which I want to keep > intact. > > *My questions are:* > Does training a new model affect the existing eng.traineddata file? How > can I safely train Tesseract with new filters without modifying the default > English model? Is there a recommended approach to train Tesseract on > preprocessed images while keeping eng.traineddata unchanged? > > *What I've tried:* > > updated my current eng_new.traineddata with three samples, each sample had > applied filter Otsu, Otsu_Tresh_Binary, Otsu_Tresh_Binary_Inv After first > 1000 iterations I got difference between initial and target trained.data > But target trained.data got slightly worse results. > lstmtraining --continue_from > /home/j/trainingCurrentEng/data/checkpoints/eng_trained --traineddata > /home/j/trainingCurrentEng/data/eng.traineddata --train_listfile > /home/j/trainingCurrentEng/data/list.train --eval_listfile > /home/j/trainingCurrentEng/data/list.eval --model_output > /home/j/trainingCurrentEng/data/checkpoints/eng_trained --learning_rate > 0.0001 --debug_interval 10 --max_iterations 600 tesseract > otsu_tresh_binary_inv.tiff output_text -l eng --tessdata-dir > /home/j/trainingCurrentEng/data --psm 7 > > cat output_text.txt > > Abcd123 > tesseract otsu_tresh_binary_inv.tiff output_text_1 -l eng_trained > --tessdata-dir /home/j/trainingCurrentEng/data --psm 7 > > cat output_text_1.txt Abc > > I would appreciate any guidance or best practices for training custom > models without interfering with existing ones. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/2661c455-a141-4398-9542-10321a319510n%40googlegroups.com.