Hello everyone, I currently plan on creating a language pack for a new language that isn't in the existing language packs. I don't want a new font, since my language is latin-based. Is there a way of training a new model with just a plain training text / a language corpus and usage of existing fonts of other latin-based languages? Which would be the steps I need to follow for this project?
I found this <https://tesseract-ocr.github.io/tessdoc/tess5/TrainingTesseract-5.html> and this <https://github.com/tesseract-ocr/tesstrain> already, but I'm not sure if these are what I need (or which parts of these description I need). For example, it says I should provide a ground truth with single-line images and transcriptions. Is this really necessary when it is a language that doesn't contain new scripts? Or can I somehow generate "fake" training images? I also found a list of langdata folders <https://github.com/tesseract-ocr/langdata> -- how do I write one for my language and is there anything I should pay attention to while doing so? I'm sorry that this question is pretty unspecific, since I am still a noobie when it comes to Tesseract training. I hope you can help me either way or you know any useful links! Tim -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/31efb5fd-e824-4189-90ef-57bf71eed0c4n%40googlegroups.com.