RE: Adding another language detector into the toolset of Solr Langid module

Alex Z. Fri, 29 Nov 2024 12:18:11 -0800

Hello Solr Community,

I would like to propose the addition of another language detector to the
Solr Langid module, specifically one that utilizes the Lingua library
<https://github.com/pemistahl/lingua>by Peter M. Stahl.


Have a look at its README, Peter has provided a lot of information about
its library. One of the aspects that makes it different is that Lingua
leverages n-grams of sizes ranging from 1 to 5
<https://github.com/pemistahl/lingua?tab=readme-ov-file#5-why-is-it-better-than-other-libraries>,
which results in more accurate language predictions compared to the current
default language detection method in Solr (that utilizes the 3-gram only
approach using the https://github.com/shuyo/language-detection).

To note, I do not suggest replacing the current default language detector
but rather adding Lingua detector as an alternative, alongside the existing
Apache Tika and OpenNLP detectors.

If there is interest in this proposal, I would be happy to create a ticket
and submit a pull request.

Thank you,
Alex

RE: Adding another language detector into the toolset of Solr Langid module

Reply via email to