[ 
https://issues.apache.org/jira/browse/TIKA-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311795#comment-17311795
 ] 

Tim Allison commented on TIKA-3340:
-----------------------------------

Ah, interesting.  Thank you for the background.  Are there other languages 
available in https://wortschatz.uni-leipzig.de/en/download that I should add 
while I'm adding Burmese?

It would be easiest to stick with that corpus because that's what we've been 
using.  We currently cover these languages: 
https://github.com/apache/tika/tree/main/tika-eval/tika-eval-core/src/main/resources/common_tokens

However, I recently came across this: https://oscar-corpus.com/, whose quality 
was recently assessed here: https://arxiv.org/abs/2103.12028

> LanguageProfile for Myanmar
> ---------------------------
>
>                 Key: TIKA-3340
>                 URL: https://issues.apache.org/jira/browse/TIKA-3340
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>            Reporter: Arky
>            Priority: Major
>
> A language profile for detecting Myanmar/Burmese (my).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to