I have a PR ready with changes (locally). I am just waiting for my JIRA
account to arrive.

On Tue, Nov 26, 2024 at 2:37 PM Alex Z. <azagnio...@gmail.com> wrote:

> Hi Jan,
>
> Thank you. I just applied for the ASF JIRA account. I will raise a ticket
> once my account is approved. I can try attempt a PR as well.
>
> Regards
>
> On Tue, Nov 26, 2024 at 1:43 PM Jan Høydahl <jan....@cominvent.com> wrote:
>
>> Hi,
>>
>> Thanks for finding this. Although I have not checked the code paths you
>> mention, I think this warrants a JIRA issue and a bug fix. Would you lke to
>> file a JIRA issue for us, and perhaps also attempt a GitHub Pull Request
>> with a fix. Ideally the PR would add a unit test that fails due to the bug
>> but passes after the fix. If you're not able to contribute a PR that's ok
>> as well.
>>
>> Jan
>>
>> > 26. nov. 2024 kl. 21:57 skrev Alex Z. <azagnio...@gmail.com>:
>> >
>> > Hello Solr Community,
>> >
>> > I’m seeking your feedback regarding an issue I’ve encountered when
>> > configuring the Solr Langid module, specifically when using the
>> deprecated
>> > langid.whitelist property instead of Solr’s newer langid.allowlist
>> property
>> > to define allowed language codes.
>> >
>> > As you are likely aware, the langid.whitelist property has been
>> deprecated
>> > since Solr 9.0.0, and the recommended approach is to use
>> langid.allowlist
>> > instead. I am indeed using the langid.allowlist property, but I would
>> like
>> > to bring attention to an issue I’ve observed with the legacy support for
>> > langid.whitelist. I believe there is a bug in the backward compatibility
>> > code that could cause unintended behavior when the langid.whitelist
>> > property is configured.
>> >
>> > To illustrate the problem, I’ll provide a detailed example based on the
>> > code:
>> >
>> >   1.
>> >
>> >   *The check for legacyAllowList*: In the Solr code, specifically in the
>> >
>> https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L123-L127
>> ,
>> >   there is a check for the length of the legacyAllowList string.
>> However,
>> >   the legacyAllowList is never actually used after the length check in
>> the
>> >   code. Instead, an empty string ("") is used as the default value when
>> >   fetching the LANG_ALLOWLIST parameter.
>> >   2.
>> >
>> >   *Resulting issue with the langAllowlist set*: As a result, the
>> Set<String>
>> >   langAllowlist is populated with a single element: an empty string
>> ("").
>> >   This causes an issue when the code checks if the langAllowlist is
>> empty
>> >   in the later part of the code (
>> >
>> https://github.com/apache/solr/blob/main/solr/modules/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java#L385-L405
>> )
>> >   , specifically in this section. The check langAllowlist.isEmpty()
>> >   incorrectly returns false because the set does contain an element -
>> the
>> >   empty string.
>> >   3.
>> >
>> >   *Unexpected fallback behavior*: Consequently, even though the language
>> >   of the document might be correctly detected (for instance, if the
>> document
>> >   is identified as being in German), the flow incorrectly enters the
>> "else"
>> >   clause. This results in the log message: *"Detected a language not in
>> >   allowlist (de), using fallback en"* and the fallback language is set
>> to
>> >   English (en), even though the document language was correctly
>> identified
>> >   as German.
>> >
>> > I believe this behavior stems from a bug in the backwards compatibility
>> > handling for the deprecated langid.whitelist property. If the
>> > legacyAllowList value is not being properly used or passed to the
>> > langAllowlist set, it leads to incorrect fallback behavior.
>> >
>> > I’d appreciate any insights or thoughts from the community on this
>> issue.
>> > Thank you in advance for your time!
>> >
>> > Alex
>>
>>

Reply via email to