I have exactly the same problem as you have: and neither am I a specialist in Tesseract. I have been experimenting with various setups. Training from a layer seems to offer the best option for introducing a missing character. But, I am still struggling because I am not getting the same accuracy the default Best model. - I have been training using 400,000 text lines. It is giving good accuracy on the synthetic data; but terrible output on scanned documents. Training Tesseract is very daunting task. I spend many weeks on it; and got not satisfactory results. You need to experiment with various set ups and see the outcomes.
On Friday, October 20, 2023 at 3:43:04 PM UTC+3 Des Bw wrote: > > - Fine tune. Starting with an existing trained language, train on your > specific additional data. This may work for problems that are close to the > existing training data, but different in some subtle way, like a > particularly unusual font. May work with even a small amount of training > data. > - Cut off the top layer (or some arbitrary number of layers) from the > network and retrain a new top layer using the new data. If fine tuning > doesn’t work, this is most likely the next best option. Cutting off the > top > layer could still work for training a completely new language or script, > if > you start with the most similar looking script. > - Retrain from scratch. This is a daunting task, unless you have a > very representative and sufficiently large training set for your problem. > If not, you are likely to end up with an over-fitted network that does > really well on the training data, but not on the actual data. > > https://tesseract-ocr.github.io/tessdoc/tess5/TrainingTesseract-5.html > > > On Friday, October 20, 2023 at 1:44:40 PM UTC+3 renec...@gmail.com wrote: > >> I have no idea what do you mean with 'cut off the top layer' ? >> Can I find a documentation about this process somewhere ? >> I am a tesseract user not (yet) a tesseract specialist. >> >> Le dim. 15 oct. 2023 à 08:39, Des Bw <desal...@gmail.com> a écrit : >> >>> Check the conversation in this forum where Schree trained the Norwegian >>> data to include the missing letter Æ. I used this method to train for >>> Amharic; and worked for me. >>> Basically, the method is to cut off the top layer of the network and >>> train from there. >>> Fine tuning doesn't work for adding missing letters. >>> >>> On Sunday, October 8, 2023 at 9:38:57 PM UTC+3 renec...@gmail.com wrote: >>> >>>> I experienced that the official hye.traineddata does not include the և >>>> letter. >>>> Does someone experience the same problem if yes, what is the turnaround >>>> ? >>>> >>>> Thanks for an answer >>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/8b4a3db2-ef4b-4323-95a7-c62feb92937an%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/8b4a3db2-ef4b-4323-95a7-c62feb92937an%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d14ae4ff-81bb-4596-b442-02f2cab982e4n%40googlegroups.com.