Hello Solr Community, I would like to propose the addition of another language detector to the Solr Langid module, specifically one that utilizes the Lingua library <https://github.com/pemistahl/lingua>by Peter M. Stahl.
Have a look at its README, Peter has provided a lot of information about its library. One of the aspects that makes it different is that Lingua leverages n-grams of sizes ranging from 1 to 5 <https://github.com/pemistahl/lingua?tab=readme-ov-file#5-why-is-it-better-than-other-libraries>, which results in more accurate language predictions compared to the current default language detection method in Solr (that utilizes the 3-gram only approach using the https://github.com/shuyo/language-detection). To note, I do not suggest replacing the current default language detector but rather adding Lingua detector as an alternative, alongside the existing Apache Tika and OpenNLP detectors. If there is interest in this proposal, I would be happy to create a ticket and submit a pull request. Thank you, Alex