I have a field title in my solr schema:
<field name="title" type="text_en" termVectors="true" indexed="true"
required="true" stored="true" />
text_en is defined as follows:
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100" docValues="false" multiValued="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory"
preserveOriginal="true" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms_en.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.PorterStemFilterFactory" />
</analyzer>
</fieldType>
I'm encountering strange behaviour when using multi-word synonyms which
contain stopwords.
If the stopwords appear in the middle, it works fine. For example, if I
have the following in my synonyms file (where i is a stopword):
iphone, apple i phone
And if I query: /select?q=iphone&qf=title&defType=edismax
The parsed query is: +DisjunctionMaxQuery(((((+title:appl +title:phone)
title:iphon))))
Same for query: /select?q=apple i phone&qf=title&defType=edismax
But if stopwords appear at the start or end, then behaviour is
unpredictable.
In most of the cases, the entire synonym is dropped. For example, if I
change my synonyms file to:
iphone, i phone
and do the same query again (with iphone), I get:
+DisjunctionMaxQuery(((title:iphon)))
I was expecting iphon and phone (as i would be dropped) in my dismax query.
In some cases, behaviour is even more weird.
For example, if my synonyms file is:
between two ferns,netflix comedy,zach galifianakis show,netflix 2019 best
and I have ferns and best as my stopwords. If I do the following query:
/select?q=netflix comedy&qf=title&defType=edismax
I get this:
+DisjunctionMaxQuery((((+title:between +title:two +title:galifianaki
+title:show) (+title:netflix +title:2019 +title:comedi))))
which is kind of a very weird combinations.
I'm not able to understand this behaviour and have not found anything
related to this in documentation or internet. Maybe I'm missing something.
Any help/pointers is highly appreciated.
Solr version: 8.4.1