The `body` field that you’re using here has the WordDelimiterGraphFilterFactory 
enabled, which is what’s splitting the term “covid19” into “covid 19”. This 
filter splits terms on various compound word delimiters, and one delimiter it 
uses is a transition from alpha to numeric characters (this is configurable: 
https://solr.apache.org/guide/8_6/filter-descriptions.html#word-delimiter-graph-filter).

(Side note, the Analysis screen in the Admin UI is really good at showing what 
happens on every step of a field type’s analysis chain, so will show you 
exactly where “covid19” becomes “covid 19”.)

The autoGeneratePhraseQueries parameter simply tells Solr to turn multi-term 
queries in to a phrase query. I believe in this case it would not kick in, 
because the input term did not include spaces, but I’m not sure what happens 
with the 2nd term in the query (which it looks like is being defined as a 
separate fielded term query). The debug output for a query (add &debug=true) 
would show you the parsed query and that might help.

However, if you don’t want it to split on word delimiters, remove that filter 
from the query analysis chain or disable the splitting on numeric characters.

Positional data does consume a lot of space in an index, particularly with 
large fields like “body” fields usually are. They should only be used when 
necessary on fields that need them to support features that require knowing the 
location of the terms in the document in order to work properly (highlighting 
comes to mind).

By default positions are not enabled for text fields, but it’s not clear if 
they were enabled before and what you’ve sent us is an edited schema. If they 
were previously enabled and you now want to get rid of them, you need to 
reindex your data - simply modifying the schema does nothing to change the data 
in the actual index (see also 
https://solr.apache.org/guide/8_6/reindexing.html).

Cassandra
On Aug 5, 2021, 3:33 AM -0500, Vincent Brillault <vincent.brilla...@cern.ch>, 
wrote:
> Hi!
>
> TL;DR: I'm trying to understand what `autoGeneratePhraseQueries` exactly
> does and why it's triggering in my context.
>
>
> I'm running a Solr 8.6 cluster, used as a Full Text Search backend for
> Dovecot. Queries are generated based on IMAP search as per
> https://github.com/dovecot/core/blob/master/src/plugins/fts-solr/fts-backend-solr.c
> and seems to be using a very small subset of the query capabilities of
> Solr. For example searching for `covid19 update` trigger the following
> query/log in one of the servers:
> ```
> webapp=/solr path=/select
> params={df=_text_&distrib=false&fl=id&fl=score&shards.purpose=16388&start=0&fsv=true&sort=uid+asc&fq=%2Bbox:<ID>+%2Buser:<USER>&shard.url=<SHARD_URL>&rows=5763&version=2&q={!lucene+q.op%3DAND}body:covid19+AND+body:update&omitHeader=false&NOW=1628151432177&isShard=true&wt=javabin}
> hits=21 status=0 QTime=62
> ```
>
> What is weird/interesting to me is that this query, if
> `autoGeneratePhraseQueries` is enabled, seems to generate a phrase query
> for `(phrase=body:"covid 19")`, as I learned when setting
> omitTermFreqAndPositions and omitPositions. I'm guessing this is due to
> the analyser I have configured (I've attached the schema I'm using), but
> confirmation & recommendations would be appreciated!
>
> If I understand things properly, search for `covid19` with
> autoGeneratePhraseQueries split that into the phrase "covid 19". Now
> without autoGeneratePhraseQueries, is it correct that my analyser
> configuration split it into the terms `covid` and `19` and match them
> independently? That's what I understood from looking for `covid23`,
> which seems to match text with `covid` and `23` in random
> order/places... Any recommendations to improve the analyser, without
> enabling position data?
>
>
> A bit of context: I'm reviewing the disk usage of this Solr
> installation, before fully scaling it up to the requirements and I
> discovered that the majority of the disk (12G/22G for a core taken at
> random) was used by `.pos` files, which I understand (from outdated
> documentation) contain position data that doesn't seem that
> important/critical to our use given the limited subject of the query
> options used by Dovecot.
>
> Let me know if you need more information about the configuration of the
> cluster or the collection (e.g. solrconfig.xml, stopwords.txt or
> synonyms.txt).
>
> Thanks in advance,
> Vincent

Reply via email to