Hi! TL;DR: I'm trying to understand what `autoGeneratePhraseQueries` exactly does and why it's triggering in my context.
I'm running a Solr 8.6 cluster, used as a Full Text Search backend for Dovecot. Queries are generated based on IMAP search as per https://github.com/dovecot/core/blob/master/src/plugins/fts-solr/fts-backend-solr.c and seems to be using a very small subset of the query capabilities of Solr. For example searching for `covid19 update` trigger the following query/log in one of the servers: ``` webapp=/solr path=/select params={df=_text_&distrib=false&fl=id&fl=score&shards.purpose=16388&start=0&fsv=true&sort=uid+asc&fq=%2Bbox:<ID>+%2Buser:<USER>&shard.url=<SHARD_URL>&rows=5763&version=2&q={!lucene+q.op%3DAND}body:covid19+AND+body:update&omitHeader=false&NOW=1628151432177&isShard=true&wt=javabin} hits=21 status=0 QTime=62 ``` What is weird/interesting to me is that this query, if `autoGeneratePhraseQueries` is enabled, seems to generate a phrase query for `(phrase=body:"covid 19")`, as I learned when setting omitTermFreqAndPositions and omitPositions. I'm guessing this is due to the analyser I have configured (I've attached the schema I'm using), but confirmation & recommendations would be appreciated! If I understand things properly, search for `covid19` with autoGeneratePhraseQueries split that into the phrase "covid 19". Now without autoGeneratePhraseQueries, is it correct that my analyser configuration split it into the terms `covid` and `19` and match them independently? That's what I understood from looking for `covid23`, which seems to match text with `covid` and `23` in random order/places... Any recommendations to improve the analyser, without enabling position data? A bit of context: I'm reviewing the disk usage of this Solr installation, before fully scaling it up to the requirements and I discovered that the majority of the disk (12G/22G for a core taken at random) was used by `.pos` files, which I understand (from outdated documentation) contain position data that doesn't seem that important/critical to our use given the limited subject of the query options used by Dovecot. Let me know if you need more information about the configuration of the cluster or the collection (e.g. solrconfig.xml, stopwords.txt or synonyms.txt). Thanks in advance, Vincent
<?xml version="1.0" encoding="UTF-8"?> <schema name="dovecot" version="2.0"> <fieldType name="string" class="solr.StrField" omitNorms="true" sortMissingLast="true"/> <fieldType name="long" class="solr.LongPointField" positionIncrementGap="0"/> <fieldType name="text_basic" class="solr.TextField" omitTermFreqAndPositions="true" omitPositions="true"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" splitOnNumerics="1" catenateAll="1" catenateWords="1"/> <filter class="solr.FlattenGraphFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType name="text" class="solr.TextField" omitTermFreqAndPositions="true" omitPositions="true"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" splitOnNumerics="1" catenateAll="1" catenateWords="1"/> <filter class="solr.FlattenGraphFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/> <filter class="solr.FlattenGraphFilterFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" splitOnNumerics="1" catenateAll="1" catenateWords="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> <field name="id" type="string" indexed="true" required="true" stored="true"/> <field name="uid" type="long" indexed="true" required="true" stored="true"/> <field name="box" type="string" indexed="true" required="true" stored="true"/> <field name="user" type="string" indexed="true" required="true" stored="true"/> <field name="hdr" type="text_basic" indexed="true" stored="false"/> <field name="body" type="text" indexed="true" stored="false"/> <field name="from" type="text_basic" indexed="true" stored="false"/> <field name="to" type="text_basic" indexed="true" stored="false"/> <field name="cc" type="text_basic" indexed="true" stored="false"/> <field name="bcc" type="text_basic" indexed="true" stored="false"/> <field name="subject" type="text" indexed="true" stored="false"/> <!-- Used by Solr internally: --> <field name="_version_" type="long" indexed="true" stored="true"/> <uniqueKey>id</uniqueKey> </schema>
OpenPGP_signature
Description: OpenPGP digital signature