Excessive query expansion when using WordDelimiterGraphFilter

Francis Crimmins Mon, 21 Jun 2021 17:50:43 -0700

Hi:

We are currently upgrading our Solr instance from 5.1.0. to 8.7.0, with an 
index of over 250 million documents. This powers the search at:


    https://trove.nla.gov.au/

We are using the WordDelimiterGraphFilter in the filter chain for queries:

    
https://solr.apache.org/guide/8_7/filter-descriptions.html#word-delimiter-graph-filter

and have found that for some queries it generates a very large number of 
clauses, which causes excessive CPU load on our cluster. In some cases we have 
had to restart the cluster.

For example, when we look at the “parsedquery” for the original user query:

(McGillan OR McGillon OR McGillion OR McGillian OR McGillin OR M'Gillin OR 
M'Gillan OR M'Gillon)

We see it contains:

(+((fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan 
mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin 
mgillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin 
mgillin m gillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian 
mcgillin mgillin m gillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion 
mcgillian mcgillin m gillin mgillan mgillon"~5 ….

The above is an excerpt, with the actual query containing 512 of these clauses 
for this field. Other fields also have the same expansion happening, so the 
overall query may have a very large number of clauses being specified.

>From running through a debugger we can see that 
>QueryBuilder.analyzeGraphPhrase() is the method generating all the various 
>permutations. It has the comment:

“Creates a boolean query from the graph token stream by extracting all the 
finite strings from the graph and using them to create phrase queries with the 
appropriate slop.”

If we switch back to the WordDelimiterFilter (which is now deprecated), the 
parsed query only gets the following added in:

fulltext:"(mc mcgillan) gillan (mc mcgillon) gillon (mc mcgillion) gillion (mc 
mcgillian) gillian (mc mcgillin) gillin (m mgillin) gillin (m mgillan) gillan 
(m mgillon) gillon"~5)

This does not generate the load seen in the other configuration.

>From what we can tell the WordDelimiterGraphFilter and how its output gets 
>parsed is working as expected/documented. We have come across the following 
>issue:

    https://issues.apache.org/jira/browse/SOLR-13336

but we’re not sure that setting:

    
https://solr.apache.org/guide/8_7/query-settings-in-solrconfig.html#maxbooleanclauses

to a much lower number is the best solution here.

We would like to find a way to avoid the excessive query expansion we are 
seeing, and are wondering if anyone else has encountered this problem?

Any advice or suggestions gratefully received.

Thanks,

Francis.

--
Francis Crimmins | Senior Software Engineer | National Library of Australia
M: +61 0433 545 884 | E: fcrimm...@nla.gov.au | nla.gov.au<http://nla.gov.au/>

The National Library of Australia (NLA) acknowledges Australia’s First Nations 
Peoples – the First Australians – as the Traditional Owners and Custodians of 
this land and gives respect to the Elders – past and present – and through them 
to all Australian Aboriginal and Torres Strait Islander people.

Excessive query expansion when using WordDelimiterGraphFilter

Reply via email to