On Friday, September 20, 2024 at 12:03:47 PM UTC-4 cdok...@gmail.com wrote:
I'm looking into Filipino support by Tesseract OCR. It appears that at least Ñ/ñ is not supported. They should as you can see here <https://en.wikipedia.org/wiki/Filipino_alphabet#Alphabet>. I'm being told that other latin characters are also used, like those in Spanish. Is this true? The Filipino support definitely looks incomplete. Neither fil.unicharset [1] nor the training text [2] includes. Since it sounds like they are principally used for Spanish loan words, one solution might be to use both languages (ie fil+esp). You could also try the generic Latin script data. Tom [1] https://github.com/tesseract-ocr/langdata_lstm/blob/main/fil/fil.unicharset [2] https://github.com/tesseract-ocr/langdata_lstm/blob/main/fil/fil.training_text -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a2ba692a-fe69-4888-94a2-738eec65a71dn%40googlegroups.com.