Hi,

I am writing to confirm a bug that I believe I am facing in Solr 9.8.0 with Lucene 9.11.1 related to the SynonymGraphFilterFactory. When trying to configure my SynonymGraphFilterFactory with a JapaneseTokenizerFactory, it seems the SynonymGraphFilterFactory is not able to properly use the tokenizerFactory.userDictionary that I specified in the arguments. From my understanding, it seems to be a similar issue to the one mentioned in this bug ticket https://issues.apache.org/jira/browse/SOLR-13861, which uses the SimplePatternTokenizerFactory instead. The fieldType definition is included at the bottom of the email.

When I attached a debugger to my local Solr instance and added a breakpoint to the SynonymGraphFilterFactory's inform() method https://github.com/apache/lucene/blob/releases/lucene/9.11.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymGraphFilterFactory.java#L135, it ran twice and in the first execution, the JapaneseTokenizerFactory looked as expected with the userDictionary and the mode matching my config. However, when it stopped again at the same breakpoint, the arguments were all gone and the JapaneseTokenizerFactory was using the default values (userDictionary was null and mode was set to "SEARCH").

Please let me know if you would like more details and if I should create a new ticket for this issue.

Thank you,

Chunyoku Takahashi

P.S.

Here is the fieldType definition:

```
<fieldType name="text_ja_ma" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="query">
      <tokenizer class="solr.JapaneseTokenizerFactory"
        mode="normal"
        discardPunctuation="true"
        userDictionary="lang/userdict_ja_1.txt"/>

      <filter class="solr.SynonymGraphFilterFactory"
        synonyms="lang/synonyms_ja_1.txt"
        expand="true"
        ignoreCase="true"
        tokenizerFactory="solr.JapaneseTokenizerFactory"
        tokenizerFactory.mode="normal"
        tokenizerFactory.userDictionary="lang/userdict_ja_2.txt"
        tokenizerFactory.userDictionaryEncoding="UTF-8"
        />
      <filter class="solr.JapaneseBaseFormFilterFactory"/>
      <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt"/>       <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>
```

The userdict_ja_2.txt contains:

```
コールセンター,コールセンター,コールセンター,カスタム名詞
予約センター,予約センター,予約センター,カスタム名詞
スカイスイート767,スカイスイート767,スカイスイート767,カスタム名詞

シーマン,シーマン,シーマン,カスタム名詞
```

The synonyms_ja_1.txt contains:

```
コールセンター,予約センター
コルセンタ,予約センタ
```


Reply via email to