Hi,

Thanks for finding this. Although I have not checked the code paths you 
mention, I think this warrants a JIRA issue and a bug fix. Would you lke to 
file a JIRA issue for us, and perhaps also attempt a GitHub Pull Request with a 
fix. Ideally the PR would add a unit test that fails due to the bug but passes 
after the fix. If you're not able to contribute a PR that's ok as well.

Jan

> 26. nov. 2024 kl. 21:57 skrev Alex Z. <azagnio...@gmail.com>:
> 
> Hello Solr Community,
> 
> I’m seeking your feedback regarding an issue I’ve encountered when
> configuring the Solr Langid module, specifically when using the deprecated
> langid.whitelist property instead of Solr’s newer langid.allowlist property
> to define allowed language codes.
> 
> As you are likely aware, the langid.whitelist property has been deprecated
> since Solr 9.0.0, and the recommended approach is to use langid.allowlist
> instead. I am indeed using the langid.allowlist property, but I would like
> to bring attention to an issue I’ve observed with the legacy support for
> langid.whitelist. I believe there is a bug in the backward compatibility
> code that could cause unintended behavior when the langid.whitelist
> property is configured.
> 
> To illustrate the problem, I’ll provide a detailed example based on the
> code:
> 
>   1.
> 
>   *The check for legacyAllowList*: In the Solr code, specifically in the
>   
> https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L123-L127,
>   there is a check for the length of the legacyAllowList string. However,
>   the legacyAllowList is never actually used after the length check in the
>   code. Instead, an empty string ("") is used as the default value when
>   fetching the LANG_ALLOWLIST parameter.
>   2.
> 
>   *Resulting issue with the langAllowlist set*: As a result, the Set<String>
>   langAllowlist is populated with a single element: an empty string ("").
>   This causes an issue when the code checks if the langAllowlist is empty
>   in the later part of the code (
>   
> https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L385-L405)
>   , specifically in this section. The check langAllowlist.isEmpty()
>   incorrectly returns false because the set does contain an element - the
>   empty string.
>   3.
> 
>   *Unexpected fallback behavior*: Consequently, even though the language
>   of the document might be correctly detected (for instance, if the document
>   is identified as being in German), the flow incorrectly enters the "else"
>   clause. This results in the log message: *"Detected a language not in
>   allowlist (de), using fallback en"* and the fallback language is set to
>   English (en), even though the document language was correctly identified
>   as German.
> 
> I believe this behavior stems from a bug in the backwards compatibility
> handling for the deprecated langid.whitelist property. If the
> legacyAllowList value is not being properly used or passed to the
> langAllowlist set, it leads to incorrect fallback behavior.
> 
> I’d appreciate any insights or thoughts from the community on this issue.
> Thank you in advance for your time!
> 
> Alex

Reply via email to