Dear Cassandra, Thanks for the detailed explanations!
> The `body` field that you’re using here has the > WordDelimiterGraphFilterFactory enabled, which is what’s splitting > the term “covid19” into “covid 19”. This filter splits terms on > various compound word delimiters, and one delimiter it uses is a > transition from alpha to numeric characters (this is configurable: > https://solr.apache.org/guide/8_6/filter-descriptions.html#word-delimiter-graph-filter). Thanks, that makes sense. > (Side note, the Analysis screen in the Admin UI is really good at > showing what happens on every step of a field type’s analysis chain, > so will show you exactly where “covid19” becomes “covid 19”.) Indeed! That's quite helpful for testing both the analysis and the querying, I remember using it in the past, but I had forgotten about it... > The autoGeneratePhraseQueries parameter simply tells Solr to turn > multi-term queries in to a phrase query. I believe in this case it > would not kick in, because the input term did not include spaces, but > I’m not sure what happens with the 2nd term in the query (which it > looks like is being defined as a separate fielded term query). The > debug output for a query (add &debug=true) would show you the parsed > query and that might help. Yes, the 2nd term is already split by the underlying system (not sure if it's the IMAP client or dovecot itself) as far as I understand. I'm more and more surprised by the transformations of that system itself... I've ran `/select?wt=xml&fl=uid,score&rows=9572&sort=uid+asc&q=%7b!lucene+q.op%3dAND%7dbody:covid19&fq=%2Bbox:XXXX+%2Buser:vbrillau&debug=true'` on the cluster with the autoGeneratePhraseQueries set to yes and the answer contained `debug={rawquerystring={!lucene q.op=AND}body:covid19,querystring={!lucene q.op=AND}body:covid19,parsedquery=+(+(body:covid19 PhraseQuery(body:"covid 19"))),parsedquery_toString=+(+(body:covid19 body:"covid 19")),explain={......`. So there something clearly triggered a phrase query, which I understand as WordDelimiterGraphFilterFactory producing both covid19 (pos 1) and "covid 19" (covid pos 1 & 19 pos 2) that was transformed into a phrase query by autoGeneratePhraseQueries? > However, if you don’t want it to split on word delimiters, remove > that filter from the query analysis chain or disable the splitting on > numeric characters. Indeed. And now that I think about it, if I disable it from the query analysis but keep in enabled in the index analysis: - `covid19` with be indexed as `covid19`, `covid`, `19`. - When searching, both `covid` and `covid19` will match it. That sounds like the best option :) > Positional data does consume a lot of space in an index, particularly > with large fields like “body” fields usually are. They should only be > used when necessary on fields that need them to support features that > require knowing the location of the terms in the document in order to > work properly (highlighting comes to mind). Thanks for this confirmation. My main question is now if that's a feature that's required for me or not, given that users don't interface directly with Solr but through IMAP clients & dovecot, which include their own transformations... > By default positions are not enabled for text fields, but it’s not > clear if they were enabled before and what you’ve sent us is an > edited schema. Are you sure about this? The documentation (https://solr.apache.org/guide/8_6/field-type-definitions-and-properties.html#field-default-properties) says for omitTermFreqAndPositions: `This property defaults to true for all field types that are not text fields.` I initially had only had `autoGeneratePhraseQueries="true" positionIncrementGap="100"` for text_basic & text on our two clusters (test & prod). I now replaced it with `omitTermFreqAndPositions="true" omitPositions="true"` in the test cluster where I'm testing it. > If they were previously enabled and you now want to > get rid of them, you need to reindex your data - simply modifying the > schema does nothing to change the data in the actual index (see also > https://solr.apache.org/guide/8_6/reindexing.html). Yes, I saw. That's quite a painful operations. I ended up deleting & re-creating the collection as even after deleting all documents as per the documentation I was still getting `possible analysis error: cannot change field "body" from index options=DOCS_AND_FREQS_AND_POSITIONS to inconsistent index options=DOCS` somehow. I'm now trying to understand the impact of IMAP clients and dovecot on the queries, and it seems to be rather messy :/ - Searching for body set to `"covid 19"` became: - through Open-Xchange: `body:covid\+19` - through Thunderbird: `body:\"covid\+19\"` - Searching for body set to `"covid vaccine"~20` became: - through Open-Xchange: `body:covid\+vaccine\~20` - through Thunderbird: `body:\"covid\+vaccine\"\~20` Trying to reproduce the queries manually, I understand that none of them manage to trigger the actual phrase query, which would need to have the `"` without a `\` before... Well, that sounds wrong but that's helping me getting rid of positional data: nothing can use it if I disable autoGeneratePhraseQueries :) Thanks again for your help, Vincent Brillault
OpenPGP_signature
Description: OpenPGP digital signature