[ https://issues.apache.org/jira/browse/TIKA-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311795#comment-17311795 ]
Tim Allison commented on TIKA-3340: ----------------------------------- Ah, interesting. Thank you for the background. Are there other languages available in https://wortschatz.uni-leipzig.de/en/download that I should add while I'm adding Burmese? It would be easiest to stick with that corpus because that's what we've been using. We currently cover these languages: https://github.com/apache/tika/tree/main/tika-eval/tika-eval-core/src/main/resources/common_tokens However, I recently came across this: https://oscar-corpus.com/, whose quality was recently assessed here: https://arxiv.org/abs/2103.12028 > LanguageProfile for Myanmar > --------------------------- > > Key: TIKA-3340 > URL: https://issues.apache.org/jira/browse/TIKA-3340 > Project: Tika > Issue Type: Improvement > Components: languageidentifier > Reporter: Arky > Priority: Major > > A language profile for detecting Myanmar/Burmese (my). -- This message was sent by Atlassian Jira (v8.3.4#803005)