I have a PR ready with changes (locally). I am just waiting for my JIRA account to arrive.
On Tue, Nov 26, 2024 at 2:37 PM Alex Z. <azagnio...@gmail.com> wrote: > Hi Jan, > > Thank you. I just applied for the ASF JIRA account. I will raise a ticket > once my account is approved. I can try attempt a PR as well. > > Regards > > On Tue, Nov 26, 2024 at 1:43 PM Jan Høydahl <jan....@cominvent.com> wrote: > >> Hi, >> >> Thanks for finding this. Although I have not checked the code paths you >> mention, I think this warrants a JIRA issue and a bug fix. Would you lke to >> file a JIRA issue for us, and perhaps also attempt a GitHub Pull Request >> with a fix. Ideally the PR would add a unit test that fails due to the bug >> but passes after the fix. If you're not able to contribute a PR that's ok >> as well. >> >> Jan >> >> > 26. nov. 2024 kl. 21:57 skrev Alex Z. <azagnio...@gmail.com>: >> > >> > Hello Solr Community, >> > >> > I’m seeking your feedback regarding an issue I’ve encountered when >> > configuring the Solr Langid module, specifically when using the >> deprecated >> > langid.whitelist property instead of Solr’s newer langid.allowlist >> property >> > to define allowed language codes. >> > >> > As you are likely aware, the langid.whitelist property has been >> deprecated >> > since Solr 9.0.0, and the recommended approach is to use >> langid.allowlist >> > instead. I am indeed using the langid.allowlist property, but I would >> like >> > to bring attention to an issue I’ve observed with the legacy support for >> > langid.whitelist. I believe there is a bug in the backward compatibility >> > code that could cause unintended behavior when the langid.whitelist >> > property is configured. >> > >> > To illustrate the problem, I’ll provide a detailed example based on the >> > code: >> > >> > 1. >> > >> > *The check for legacyAllowList*: In the Solr code, specifically in the >> > >> https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L123-L127 >> , >> > there is a check for the length of the legacyAllowList string. >> However, >> > the legacyAllowList is never actually used after the length check in >> the >> > code. Instead, an empty string ("") is used as the default value when >> > fetching the LANG_ALLOWLIST parameter. >> > 2. >> > >> > *Resulting issue with the langAllowlist set*: As a result, the >> Set<String> >> > langAllowlist is populated with a single element: an empty string >> (""). >> > This causes an issue when the code checks if the langAllowlist is >> empty >> > in the later part of the code ( >> > >> https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L385-L405 >> ) >> > , specifically in this section. The check langAllowlist.isEmpty() >> > incorrectly returns false because the set does contain an element - >> the >> > empty string. >> > 3. >> > >> > *Unexpected fallback behavior*: Consequently, even though the language >> > of the document might be correctly detected (for instance, if the >> document >> > is identified as being in German), the flow incorrectly enters the >> "else" >> > clause. This results in the log message: *"Detected a language not in >> > allowlist (de), using fallback en"* and the fallback language is set >> to >> > English (en), even though the document language was correctly >> identified >> > as German. >> > >> > I believe this behavior stems from a bug in the backwards compatibility >> > handling for the deprecated langid.whitelist property. If the >> > legacyAllowList value is not being properly used or passed to the >> > langAllowlist set, it leads to incorrect fallback behavior. >> > >> > I’d appreciate any insights or thoughts from the community on this >> issue. >> > Thank you in advance for your time! >> > >> > Alex >> >>