Thanks for the feedback. I've already tried with "fil+spa" with no success :(
One thing that worries me is that I cannot find *one* sample filipino text image with Ñ/ñ on it, just to have an independently produced sample. All I have is a couple of small snippets of text which produce the plain characters only. C.D. On Sunday, September 22, 2024 at 9:29:13 AM UTC+3 tfmo...@gmail.com wrote: > On Friday, September 20, 2024 at 12:03:47 PM UTC-4 cdok...@gmail.com > wrote: > > > I'm looking into Filipino support by Tesseract OCR. It appears that at > least Ñ/ñ is not supported. They should as you can see here > <https://en.wikipedia.org/wiki/Filipino_alphabet#Alphabet>. > > I'm being told that other latin characters are also used, like those in > Spanish. Is this true? > > > The Filipino support definitely looks incomplete. Neither fil.unicharset > [1] nor the training text [2] includes. Since it sounds like they are > principally used for Spanish loan words, one solution might be to use both > languages (ie fil+esp). You could also try the generic Latin script data. > > Tom > > [1] > https://github.com/tesseract-ocr/langdata_lstm/blob/main/fil/fil.unicharset > [2] > https://github.com/tesseract-ocr/langdata_lstm/blob/main/fil/fil.training_text > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/246aaa6d-f971-444d-9faf-50b189e4cf0cn%40googlegroups.com.