Hi Michael:

Thanks for your detailed response, which is much appreciated. We’ll take a look 
at some of your suggestions.

We know that if we drop back to using the old WordDelimiterFilter we do not see 
this behaviour, but we would prefer not to use a deprecated class if possible.

It would be great if there was some parameter or configuration that would allow 
some kind of limit or threshold to be set to avoid this behaviour. We’re 
concerned that specifying a low “maxbooleanclauses” setting may adversely 
affect other parts of the query processing (although we’re not sure if that 
would be the case).

Cheers,

Francis.

--
Francis Crimmins | Senior Software Engineer | National Library of Australia
M: +61 0433 545 884 | E: fcrimm...@nla.gov.au | nla.gov.au<http://nla.gov.au/>

The National Library of Australia (NLA) acknowledges Australia’s First Nations 
Peoples – the First Australians – as the Traditional Owners and Custodians of 
this land and gives respect to the Elders – past and present – and through them 
to all Australian Aboriginal and Torres Strait Islander people.


From: Michael Gibney <mich...@michaelgibney.net>
Date: Tuesday, 22 June 2021 at 12:50 pm
To: users@solr.apache.org <users@solr.apache.org>
Subject: Re: Excessive query expansion when using WordDelimiterGraphFilter
Hi Francis,

I have indeed encountered this problem -- though as you've discovered
it's dependent on specific types of analysis chain config.

You should be able to avoid this behavior by constructing your
analysis chains such that they don't in practice create graph
TokenStreams. This advice generally applies more strongly at
index-time, because the graph structure of the TokenStream is
discarded at index-time. But in a case like yours (a relatively large
index, and queries that have proven to be problematic) I think your
intuition is correct that you could both:
1. lower the maxBooleanClauses limit as a failsafe, and also
2. mitigate negative functional consequences by modifying your
analysis chains to reduce or eliminate this exponential expansion in
the first place

My recommendation would be to index into two separate fulltext fields:
one with wdgf configured to only split (i.e.,
"generate*Parts"/"splitOn", etc.), one with wdgf configured to only
concatenate (i.e., "preserveOriginal", "catenate*") This should
prevent graph token streams from being created, and should thus
prevent the kind of exponential expansion you're seeing here.

If you really want to have your cake and eat it too (to the extent
that's possible at the moment), you could use two fields configured as
mentioned above for _phrase_ searching (i.e.,
`pf=fulltext_split,fulltext_catenate`), and have a third field with
wgdf configured to split _and_ catenate (which I infer is the current
configuration?) and use that as the query field (i.e.,
`qf=fulltext_split_and_catenate`). The problem is really the implicit
phrase queries (`pf`) in this case; so in fact you might get passable
results by simply disabling `pf` altogether (setting it to empty) --
though that would be a very blunt instrument, so I wouldn't recommend
disabling `pf` unless absolutely necessary.

The bigger picture here is also interesting (to me, at least!): there
were some initial steps towards more general application of true
"positional graph queries" (e.g., LUCENE-7638 [1]). But due to
problems fundamental to positional graph queries (well described in
LUCENE-7398 [2], though not unique to span queries), much of this
functionality has been effectively removed from the most common query
parsers (e.g., LUCENE-8477 [3], LUCENE-9207 [4]). The timing of your
question resonates with me, as I've recently been working to add
benchmarks that illustrate the performance impact of this "exponential
expansion" behavior (see the comments tacked onto the end of
LUCENE-9204 [5]).

Michael

[1] https://issues.apache.org/jira/browse/LUCENE-7638
[2] https://issues.apache.org/jira/browse/LUCENE-7398
[3] https://issues.apache.org/jira/browse/LUCENE-8477
[4] https://issues.apache.org/jira/browse/LUCENE-9207
[5] https://issues.apache.org/jira/browse/LUCENE-9204

ps- Some further relevant issues/blog posts:
https://issues.apache.org/jira/browse/LUCENE-4312

https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
https://michaelgibney.net/lucene/graph/


On Mon, Jun 21, 2021 at 8:50 PM Francis Crimmins <fcrimm...@nla.gov.au> wrote:
>
> Hi:
>
> We are currently upgrading our Solr instance from 5.1.0. to 8.7.0, with an 
> index of over 250 million documents. This powers the search at:
>
>     https://trove.nla.gov.au/
>
> We are using the WordDelimiterGraphFilter in the filter chain for queries:
>
>     
> https://solr.apache.org/guide/8_7/filter-descriptions.html#word-delimiter-graph-filter
>
> and have found that for some queries it generates a very large number of 
> clauses, which causes excessive CPU load on our cluster. In some cases we 
> have had to restart the cluster.
>
> For example, when we look at the “parsedquery” for the original user query:
>
> (McGillan OR McGillon OR McGillion OR McGillian OR McGillin OR M'Gillin OR 
> M'Gillan OR M'Gillon)
>
> We see it contains:
>
> (+((fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan 
> mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin 
> mgillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin 
> mgillin m gillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian 
> mcgillin mgillin m gillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion 
> mcgillian mcgillin m gillin mgillan mgillon"~5 ….
>
> The above is an excerpt, with the actual query containing 512 of these 
> clauses for this field. Other fields also have the same expansion happening, 
> so the overall query may have a very large number of clauses being specified.
>
> From running through a debugger we can see that 
> QueryBuilder.analyzeGraphPhrase() is the method generating all the various 
> permutations. It has the comment:
>
> “Creates a boolean query from the graph token stream by extracting all the 
> finite strings from the graph and using them to create phrase queries with 
> the appropriate slop.”
>
> If we switch back to the WordDelimiterFilter (which is now deprecated), the 
> parsed query only gets the following added in:
>
> fulltext:"(mc mcgillan) gillan (mc mcgillon) gillon (mc mcgillion) gillion 
> (mc mcgillian) gillian (mc mcgillin) gillin (m mgillin) gillin (m mgillan) 
> gillan (m mgillon) gillon"~5)
>
> This does not generate the load seen in the other configuration.
>
> From what we can tell the WordDelimiterGraphFilter and how its output gets 
> parsed is working as expected/documented. We have come across the following 
> issue:
>
>     https://issues.apache.org/jira/browse/SOLR-13336
>
> but we’re not sure that setting:
>
>     
> https://solr.apache.org/guide/8_7/query-settings-in-solrconfig.html#maxbooleanclauses
>
> to a much lower number is the best solution here.
>
> We would like to find a way to avoid the excessive query expansion we are 
> seeing, and are wondering if anyone else has encountered this problem?
>
> Any advice or suggestions gratefully received.
>
> Thanks,
>
> Francis.
>
> --
> Francis Crimmins | Senior Software Engineer | National Library of Australia
> M: +61 0433 545 884 | E: fcrimm...@nla.gov.au | nla.gov.au<http://nla.gov.au/>
>
> The National Library of Australia (NLA) acknowledges Australia’s First 
> Nations Peoples – the First Australians – as the Traditional Owners and 
> Custodians of this land and gives respect to the Elders – past and present – 
> and through them to all Australian Aboriginal and Torres Strait Islander 
> people.
>

Reply via email to