Hello Solr Community,

I would like to propose the addition of another language detector to the
Solr Langid module, specifically one that utilizes the Lingua library
<https://github.com/pemistahl/lingua>by Peter M. Stahl.

Have a look at its README, Peter has provided a lot of information about
its library. One of the aspects that makes it different is that Lingua
leverages n-grams of sizes ranging from 1 to 5
<https://github.com/pemistahl/lingua?tab=readme-ov-file#5-why-is-it-better-than-other-libraries>,
which results in more accurate language predictions compared to the current
default language detection method in Solr (that utilizes the 3-gram only
approach using the https://github.com/shuyo/language-detection).

To note, I do not suggest replacing the current default language detector
but rather adding Lingua detector as an alternative, alongside the existing
Apache Tika and OpenNLP detectors.

If there is interest in this proposal, I would be happy to create a ticket
and submit a pull request.

Thank you,
Alex

Reply via email to