Stopwords removal strange behavior

Ricardo Soto Estévez Thu, 03 Mar 2022 03:27:23 -0800

Hi,
at my workplace we have been facing a curious dilemma. We have a very
problematic word that we want to remove from the queries. We decided to add
it to our stopwords list, but querying the word alone would not remove the
word as otherwise the query would be empty. This is understandable
behavior, if you only have stopwords in the query then they hold value for
the query so we should search against them. However, we really need to get
rid of this particular word.


We found that the stopfilter does indeed remove single tokens, but the
token survives when queried with the eDisMax. Looking at the documentation
of this query parser we found that it kind of overrides the behavior of the
stop filter giving us the behavior detailed above. We tried to use the
stopwords flag to specify that we don't want that overriding but it doesn't
work.

So, we tried to make our custom stopwords filter and as we were doing it,
we found that using two consecutive stopfilters would indeed remove the
word. We can even ingest different lists of words to those filters so we
only always delete the problematic word, leaving alone the others when
queried alone. *Why does this work like this? *

Over here I will let our query pipeline.

<analyzer type="query">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
  <filter class="solr.StopFilterFactory" ignoreCase="true"
words="dropwords.txt"/>
  <filter class="solr.ASCIIFoldingFilterFactory"/>
  <filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>

stopwords.txt have the common english stopwords and dropwords.txt have just
"the" (for example). So any query with "of" or "a" would keep the token,
but using "the" will not. We are using Solr 7.7 btw.

Thank you so much, I would like to know your input on this

-- 
*Ricardo Soto Estévez* <ricar...@empathy.co>
Backend Engineer
[image: Empathy Logo]
Privacy Policy <https://www.empathy.co/privacy-policy/>

Stopwords removal strange behavior

Reply via email to