Hi:
We are currently upgrading our Solr instance from 5.1.0. to 8.7.0, with an
index of over 250 million documents. This powers the search at:
https://trove.nla.gov.au/
We are using the WordDelimiterGraphFilter in the filter chain for queries:
https://solr.apache.org/guide/8_7/filter-descriptions.html#word-delimiter-graph-filter
and have found that for some queries it generates a very large number of
clauses, which causes excessive CPU load on our cluster. In some cases we have
had to restart the cluster.
For example, when we look at the “parsedquery” for the original user query:
(McGillan OR McGillon OR McGillion OR McGillian OR McGillin OR M'Gillin OR
M'Gillan OR M'Gillon)
We see it contains:
(+((fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan
mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin
mgillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin
mgillin m gillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian
mcgillin mgillin m gillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion
mcgillian mcgillin m gillin mgillan mgillon"~5 ….
The above is an excerpt, with the actual query containing 512 of these clauses
for this field. Other fields also have the same expansion happening, so the
overall query may have a very large number of clauses being specified.
>From running through a debugger we can see that
>QueryBuilder.analyzeGraphPhrase() is the method generating all the various
>permutations. It has the comment:
“Creates a boolean query from the graph token stream by extracting all the
finite strings from the graph and using them to create phrase queries with the
appropriate slop.”
If we switch back to the WordDelimiterFilter (which is now deprecated), the
parsed query only gets the following added in:
fulltext:"(mc mcgillan) gillan (mc mcgillon) gillon (mc mcgillion) gillion (mc
mcgillian) gillian (mc mcgillin) gillin (m mgillin) gillin (m mgillan) gillan
(m mgillon) gillon"~5)
This does not generate the load seen in the other configuration.
>From what we can tell the WordDelimiterGraphFilter and how its output gets
>parsed is working as expected/documented. We have come across the following
>issue:
https://issues.apache.org/jira/browse/SOLR-13336
but we’re not sure that setting:
https://solr.apache.org/guide/8_7/query-settings-in-solrconfig.html#maxbooleanclauses
to a much lower number is the best solution here.
We would like to find a way to avoid the excessive query expansion we are
seeing, and are wondering if anyone else has encountered this problem?
Any advice or suggestions gratefully received.
Thanks,
Francis.
--
Francis Crimmins | Senior Software Engineer | National Library of Australia
M: +61 0433 545 884 | E: [email protected] | nla.gov.au<http://nla.gov.au/>
The National Library of Australia (NLA) acknowledges Australia’s First Nations
Peoples – the First Australians – as the Traditional Owners and Custodians of
this land and gives respect to the Elders – past and present – and through them
to all Australian Aboriginal and Torres Strait Islander people.