The `body` field that you’re using here has the WordDelimiterGraphFilterFactory enabled, which is what’s splitting the term “covid19” into “covid 19”. This filter splits terms on various compound word delimiters, and one delimiter it uses is a transition from alpha to numeric characters (this is configurable: https://solr.apache.org/guide/8_6/filter-descriptions.html#word-delimiter-graph-filter).
(Side note, the Analysis screen in the Admin UI is really good at showing what happens on every step of a field type’s analysis chain, so will show you exactly where “covid19” becomes “covid 19”.) The autoGeneratePhraseQueries parameter simply tells Solr to turn multi-term queries in to a phrase query. I believe in this case it would not kick in, because the input term did not include spaces, but I’m not sure what happens with the 2nd term in the query (which it looks like is being defined as a separate fielded term query). The debug output for a query (add &debug=true) would show you the parsed query and that might help. However, if you don’t want it to split on word delimiters, remove that filter from the query analysis chain or disable the splitting on numeric characters. Positional data does consume a lot of space in an index, particularly with large fields like “body” fields usually are. They should only be used when necessary on fields that need them to support features that require knowing the location of the terms in the document in order to work properly (highlighting comes to mind). By default positions are not enabled for text fields, but it’s not clear if they were enabled before and what you’ve sent us is an edited schema. If they were previously enabled and you now want to get rid of them, you need to reindex your data - simply modifying the schema does nothing to change the data in the actual index (see also https://solr.apache.org/guide/8_6/reindexing.html). Cassandra On Aug 5, 2021, 3:33 AM -0500, Vincent Brillault <vincent.brilla...@cern.ch>, wrote: > Hi! > > TL;DR: I'm trying to understand what `autoGeneratePhraseQueries` exactly > does and why it's triggering in my context. > > > I'm running a Solr 8.6 cluster, used as a Full Text Search backend for > Dovecot. Queries are generated based on IMAP search as per > https://github.com/dovecot/core/blob/master/src/plugins/fts-solr/fts-backend-solr.c > and seems to be using a very small subset of the query capabilities of > Solr. For example searching for `covid19 update` trigger the following > query/log in one of the servers: > ``` > webapp=/solr path=/select > params={df=_text_&distrib=false&fl=id&fl=score&shards.purpose=16388&start=0&fsv=true&sort=uid+asc&fq=%2Bbox:<ID>+%2Buser:<USER>&shard.url=<SHARD_URL>&rows=5763&version=2&q={!lucene+q.op%3DAND}body:covid19+AND+body:update&omitHeader=false&NOW=1628151432177&isShard=true&wt=javabin} > hits=21 status=0 QTime=62 > ``` > > What is weird/interesting to me is that this query, if > `autoGeneratePhraseQueries` is enabled, seems to generate a phrase query > for `(phrase=body:"covid 19")`, as I learned when setting > omitTermFreqAndPositions and omitPositions. I'm guessing this is due to > the analyser I have configured (I've attached the schema I'm using), but > confirmation & recommendations would be appreciated! > > If I understand things properly, search for `covid19` with > autoGeneratePhraseQueries split that into the phrase "covid 19". Now > without autoGeneratePhraseQueries, is it correct that my analyser > configuration split it into the terms `covid` and `19` and match them > independently? That's what I understood from looking for `covid23`, > which seems to match text with `covid` and `23` in random > order/places... Any recommendations to improve the analyser, without > enabling position data? > > > A bit of context: I'm reviewing the disk usage of this Solr > installation, before fully scaling it up to the requirements and I > discovered that the majority of the disk (12G/22G for a core taken at > random) was used by `.pos` files, which I understand (from outdated > documentation) contain position data that doesn't seem that > important/critical to our use given the limited subject of the query > options used by Dovecot. > > Let me know if you need more information about the configuration of the > cluster or the collection (e.g. solrconfig.xml, stopwords.txt or > synonyms.txt). > > Thanks in advance, > Vincent