[tesseract-ocr] Creating a new language pack

TiMauzi Wed, 22 Jun 2022 10:26:36 -0700

Hello everyone,

I currently plan on creating a language pack for a new language that isn't 
in the existing language packs. I don't want a new font, since my language 
is latin-based. Is there a way of training a new model with just a plain 
training text / a language corpus and usage of existing fonts of other 
latin-based languages? Which would be the steps I need to follow for this 
project?


I found this 
<https://tesseract-ocr.github.io/tessdoc/tess5/TrainingTesseract-5.html> and 
this <https://github.com/tesseract-ocr/tesstrain> already, but I'm not sure 
if these are what I need (or which parts of these description I need). For 
example, it says I should provide a ground truth with single-line images 
and transcriptions. Is this really necessary when it is a language that 
doesn't contain new scripts? Or can I somehow generate "fake" training 
images?

I also found a list of langdata folders 
<https://github.com/tesseract-ocr/langdata> -- how do I write one for my 
language and is there anything I should pay attention to while doing so?

I'm sorry that this question is pretty unspecific, since I am still a 
noobie when it comes to Tesseract training. I hope you can help me either 
way or you know any useful links!

Tim

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/31efb5fd-e824-4189-90ef-57bf71eed0c4n%40googlegroups.com.

[tesseract-ocr] Creating a new language pack

Reply via email to