Dear Cassandra,

Thanks for the detailed explanations!

> The `body` field that you’re using here has the
> WordDelimiterGraphFilterFactory enabled, which is what’s splitting
> the term “covid19” into “covid 19”. This filter splits terms on
> various compound word delimiters, and one delimiter it uses is a
> transition from alpha to numeric characters (this is configurable:
> https://solr.apache.org/guide/8_6/filter-descriptions.html#word-delimiter-graph-filter).

Thanks, that makes sense.

>  (Side note, the Analysis screen in the Admin UI is really good at
> showing what happens on every step of a field type’s analysis chain,
> so will show you exactly where “covid19” becomes “covid 19”.)

Indeed! That's quite helpful for testing both the analysis and the
querying, I remember using it in the past, but I had forgotten about it...

> The autoGeneratePhraseQueries parameter simply tells Solr to turn
> multi-term queries in to a phrase query. I believe in this case it
> would not kick in, because the input term did not include spaces, but
> I’m not sure what happens with the 2nd term in the query (which it
> looks like is being defined as a separate fielded term query). The
> debug output for a query (add &debug=true) would show you the parsed
> query and that might help.

Yes, the 2nd term is already split by the underlying system (not sure if
it's the IMAP client or dovecot itself) as far as I understand. I'm more
and more surprised by the transformations of that system itself...

I've ran
`/select?wt=xml&fl=uid,score&rows=9572&sort=uid+asc&q=%7b!lucene+q.op%3dAND%7dbody:covid19&fq=%2Bbox:XXXX+%2Buser:vbrillau&debug=true'`
on the cluster with the autoGeneratePhraseQueries set to yes and the
answer contained `debug={rawquerystring={!lucene
q.op=AND}body:covid19,querystring={!lucene
q.op=AND}body:covid19,parsedquery=+(+(body:covid19
PhraseQuery(body:"covid 19"))),parsedquery_toString=+(+(body:covid19
body:"covid 19")),explain={......`. So there something clearly triggered
a phrase query, which I understand as WordDelimiterGraphFilterFactory
producing both covid19 (pos 1) and "covid 19" (covid pos 1 & 19 pos 2)
that was transformed into a phrase query by autoGeneratePhraseQueries?

> However, if you don’t want it to split on word delimiters, remove
> that filter from the query analysis chain or disable the splitting on
> numeric characters.

Indeed. And now that I think about it, if I disable it from the query
analysis but keep in enabled in the index analysis:
- `covid19` with be indexed as `covid19`, `covid`, `19`.
- When searching, both `covid` and `covid19` will match it.

That sounds like the best option :)

> Positional data does consume a lot of space in an index, particularly
> with large fields like “body” fields usually are. They should only be
> used when necessary on fields that need them to support features that
> require knowing the location of the terms in the document in order to
> work properly (highlighting comes to mind).

Thanks for this confirmation. My main question is now if that's a
feature that's required for me or not, given that users don't interface
directly with Solr but through IMAP clients & dovecot, which include
their own transformations...

> By default positions are not enabled for text fields, but it’s not
> clear if they were enabled before and what you’ve sent us is an
> edited schema. 

Are you sure about this? The documentation
(https://solr.apache.org/guide/8_6/field-type-definitions-and-properties.html#field-default-properties)
says for omitTermFreqAndPositions: `This property defaults to true for
all field types that are not text fields.`

I initially had only had `autoGeneratePhraseQueries="true"
positionIncrementGap="100"` for text_basic & text on our two clusters
(test & prod). I now replaced it with `omitTermFreqAndPositions="true"
omitPositions="true"` in the test cluster where I'm testing it.

> If they were previously enabled and you now want to
> get rid of them, you need to reindex your data - simply modifying the
> schema does nothing to change the data in the actual index (see also
> https://solr.apache.org/guide/8_6/reindexing.html).

Yes, I saw. That's quite a painful operations. I ended up deleting &
re-creating the collection as even after deleting all documents as per
the documentation I was still getting `possible analysis error: cannot
change field "body" from index options=DOCS_AND_FREQS_AND_POSITIONS to
inconsistent index options=DOCS` somehow.


I'm now trying to understand the impact of IMAP clients and dovecot on
the queries, and it seems to be rather messy :/
- Searching for body set to `"covid 19"` became:
  - through Open-Xchange: `body:covid\+19`
  - through Thunderbird: `body:\"covid\+19\"`
- Searching for body set to `"covid vaccine"~20` became:
  - through Open-Xchange: `body:covid\+vaccine\~20`
  - through Thunderbird: `body:\"covid\+vaccine\"\~20`

Trying to reproduce the queries manually, I understand that none of them
manage to trigger the actual phrase query, which would need to have the
`"` without a `\` before... Well, that sounds wrong but that's helping
me getting rid of positional data: nothing can use it if I disable
autoGeneratePhraseQueries :)

Thanks again for your help,
Vincent Brillault

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to