Hi Jan, Thank you. I just applied for the ASF JIRA account. I will raise a ticket once my account is approved. I can try attempt a PR as well.
Regards On Tue, Nov 26, 2024 at 1:43 PM Jan Høydahl <jan....@cominvent.com> wrote: > Hi, > > Thanks for finding this. Although I have not checked the code paths you > mention, I think this warrants a JIRA issue and a bug fix. Would you lke to > file a JIRA issue for us, and perhaps also attempt a GitHub Pull Request > with a fix. Ideally the PR would add a unit test that fails due to the bug > but passes after the fix. If you're not able to contribute a PR that's ok > as well. > > Jan > > > 26. nov. 2024 kl. 21:57 skrev Alex Z. <azagnio...@gmail.com>: > > > > Hello Solr Community, > > > > I’m seeking your feedback regarding an issue I’ve encountered when > > configuring the Solr Langid module, specifically when using the > deprecated > > langid.whitelist property instead of Solr’s newer langid.allowlist > property > > to define allowed language codes. > > > > As you are likely aware, the langid.whitelist property has been > deprecated > > since Solr 9.0.0, and the recommended approach is to use langid.allowlist > > instead. I am indeed using the langid.allowlist property, but I would > like > > to bring attention to an issue I’ve observed with the legacy support for > > langid.whitelist. I believe there is a bug in the backward compatibility > > code that could cause unintended behavior when the langid.whitelist > > property is configured. > > > > To illustrate the problem, I’ll provide a detailed example based on the > > code: > > > > 1. > > > > *The check for legacyAllowList*: In the Solr code, specifically in the > > > https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L123-L127 > , > > there is a check for the length of the legacyAllowList string. However, > > the legacyAllowList is never actually used after the length check in > the > > code. Instead, an empty string ("") is used as the default value when > > fetching the LANG_ALLOWLIST parameter. > > 2. > > > > *Resulting issue with the langAllowlist set*: As a result, the > Set<String> > > langAllowlist is populated with a single element: an empty string (""). > > This causes an issue when the code checks if the langAllowlist is empty > > in the later part of the code ( > > > https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L385-L405 > ) > > , specifically in this section. The check langAllowlist.isEmpty() > > incorrectly returns false because the set does contain an element - the > > empty string. > > 3. > > > > *Unexpected fallback behavior*: Consequently, even though the language > > of the document might be correctly detected (for instance, if the > document > > is identified as being in German), the flow incorrectly enters the > "else" > > clause. This results in the log message: *"Detected a language not in > > allowlist (de), using fallback en"* and the fallback language is set to > > English (en), even though the document language was correctly > identified > > as German. > > > > I believe this behavior stems from a bug in the backwards compatibility > > handling for the deprecated langid.whitelist property. If the > > legacyAllowList value is not being properly used or passed to the > > langAllowlist set, it leads to incorrect fallback behavior. > > > > I’d appreciate any insights or thoughts from the community on this issue. > > Thank you in advance for your time! > > > > Alex > >