Solr 8.6 & autoGeneratePhraseQueries

Vincent Brillault Thu, 05 Aug 2021 01:33:18 -0700

Hi!

TL;DR: I'm trying to understand what `autoGeneratePhraseQueries` exactly
does and why it's triggering in my context.



I'm running a Solr 8.6 cluster, used as a Full Text Search backend for
Dovecot. Queries are generated based on IMAP search as per
https://github.com/dovecot/core/blob/master/src/plugins/fts-solr/fts-backend-solr.c
and seems to be using a very small subset of the query capabilities of
Solr. For example searching for `covid19 update` trigger the following
query/log in one of the servers:
```
webapp=/solr path=/select
params={df=_text_&distrib=false&fl=id&fl=score&shards.purpose=16388&start=0&fsv=true&sort=uid+asc&fq=%2Bbox:<ID>+%2Buser:<USER>&shard.url=<SHARD_URL>&rows=5763&version=2&q={!lucene+q.op%3DAND}body:covid19+AND+body:update&omitHeader=false&NOW=1628151432177&isShard=true&wt=javabin}
hits=21 status=0 QTime=62
```

What is weird/interesting to me is that this query, if
`autoGeneratePhraseQueries` is enabled, seems to generate a phrase query
for `(phrase=body:"covid 19")`, as I learned when setting
omitTermFreqAndPositions and omitPositions. I'm guessing this is due to
the analyser I have configured (I've attached the schema I'm using), but
confirmation & recommendations would be appreciated!

If I understand things properly, search for `covid19` with
autoGeneratePhraseQueries split that into the phrase "covid 19". Now
without autoGeneratePhraseQueries, is it correct that my analyser
configuration split it into the terms `covid` and `19` and match them
independently? That's what I understood from looking for `covid23`,
which seems to match text with `covid` and `23` in random
order/places... Any recommendations to improve the analyser, without
enabling position data?


A bit of context: I'm reviewing the disk usage of this Solr
installation, before fully scaling it up to the requirements and I
discovered that the majority of the disk (12G/22G for a core taken at
random) was used by `.pos` files, which I understand (from outdated
documentation) contain position data that doesn't seem that
important/critical to our use given the limited subject of the query
options used by Dovecot.

Let me know if you need more information about the configuration of the
cluster or the collection (e.g. solrconfig.xml, stopwords.txt or
synonyms.txt).

Thanks in advance,
Vincent

<?xml version="1.0" encoding="UTF-8"?>

<schema name="dovecot" version="2.0">
  <fieldType name="string" class="solr.StrField" omitNorms="true" sortMissingLast="true"/>
  <fieldType name="long" class="solr.LongPointField" positionIncrementGap="0"/>

  <fieldType name="text_basic" class="solr.TextField" omitTermFreqAndPositions="true" omitPositions="true">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" splitOnNumerics="1" catenateAll="1" catenateWords="1"/>
      <filter class="solr.FlattenGraphFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

  <fieldType name="text" class="solr.TextField" omitTermFreqAndPositions="true" omitPositions="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" splitOnNumerics="1" catenateAll="1" catenateWords="1"/>
      <filter class="solr.FlattenGraphFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.FlattenGraphFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" splitOnNumerics="1" catenateAll="1" catenateWords="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
  </fieldType>

  <field name="id" type="string" indexed="true" required="true" stored="true"/>
  <field name="uid" type="long" indexed="true" required="true" stored="true"/>
  <field name="box" type="string" indexed="true" required="true" stored="true"/>
  <field name="user" type="string" indexed="true" required="true" stored="true"/>

  <field name="hdr" type="text_basic" indexed="true" stored="false"/>
  <field name="body" type="text" indexed="true" stored="false"/>

  <field name="from" type="text_basic" indexed="true" stored="false"/>
  <field name="to" type="text_basic" indexed="true" stored="false"/>
  <field name="cc" type="text_basic" indexed="true" stored="false"/>
  <field name="bcc" type="text_basic" indexed="true" stored="false"/>
  <field name="subject" type="text" indexed="true" stored="false"/>

  <!-- Used by Solr internally: -->
  <field name="_version_" type="long" indexed="true" stored="true"/>

  <uniqueKey>id</uniqueKey>
</schema>

OpenPGP_signature
Description: OpenPGP digital signature

Solr 8.6 & autoGeneratePhraseQueries

Reply via email to