Hi, Thanks for finding this. Although I have not checked the code paths you mention, I think this warrants a JIRA issue and a bug fix. Would you lke to file a JIRA issue for us, and perhaps also attempt a GitHub Pull Request with a fix. Ideally the PR would add a unit test that fails due to the bug but passes after the fix. If you're not able to contribute a PR that's ok as well.
Jan > 26. nov. 2024 kl. 21:57 skrev Alex Z. <azagnio...@gmail.com>: > > Hello Solr Community, > > I’m seeking your feedback regarding an issue I’ve encountered when > configuring the Solr Langid module, specifically when using the deprecated > langid.whitelist property instead of Solr’s newer langid.allowlist property > to define allowed language codes. > > As you are likely aware, the langid.whitelist property has been deprecated > since Solr 9.0.0, and the recommended approach is to use langid.allowlist > instead. I am indeed using the langid.allowlist property, but I would like > to bring attention to an issue I’ve observed with the legacy support for > langid.whitelist. I believe there is a bug in the backward compatibility > code that could cause unintended behavior when the langid.whitelist > property is configured. > > To illustrate the problem, I’ll provide a detailed example based on the > code: > > 1. > > *The check for legacyAllowList*: In the Solr code, specifically in the > > https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L123-L127, > there is a check for the length of the legacyAllowList string. However, > the legacyAllowList is never actually used after the length check in the > code. Instead, an empty string ("") is used as the default value when > fetching the LANG_ALLOWLIST parameter. > 2. > > *Resulting issue with the langAllowlist set*: As a result, the Set<String> > langAllowlist is populated with a single element: an empty string (""). > This causes an issue when the code checks if the langAllowlist is empty > in the later part of the code ( > > https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L385-L405) > , specifically in this section. The check langAllowlist.isEmpty() > incorrectly returns false because the set does contain an element - the > empty string. > 3. > > *Unexpected fallback behavior*: Consequently, even though the language > of the document might be correctly detected (for instance, if the document > is identified as being in German), the flow incorrectly enters the "else" > clause. This results in the log message: *"Detected a language not in > allowlist (de), using fallback en"* and the fallback language is set to > English (en), even though the document language was correctly identified > as German. > > I believe this behavior stems from a bug in the backwards compatibility > handling for the deprecated langid.whitelist property. If the > legacyAllowList value is not being properly used or passed to the > langAllowlist set, it leads to incorrect fallback behavior. > > I’d appreciate any insights or thoughts from the community on this issue. > Thank you in advance for your time! > > Alex