Hi: We are currently upgrading our Solr instance from 5.1.0. to 8.7.0, with an index of over 250 million documents. This powers the search at:
https://trove.nla.gov.au/ We are using the WordDelimiterGraphFilter in the filter chain for queries: https://solr.apache.org/guide/8_7/filter-descriptions.html#word-delimiter-graph-filter and have found that for some queries it generates a very large number of clauses, which causes excessive CPU load on our cluster. In some cases we have had to restart the cluster. For example, when we look at the “parsedquery” for the original user query: (McGillan OR McGillon OR McGillion OR McGillian OR McGillin OR M'Gillin OR M'Gillan OR M'Gillon) We see it contains: (+((fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin mgillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin m gillan mgillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin mgillin m gillan m gillon"~5 fulltext:"mcgillan mcgillon mcgillion mcgillian mcgillin m gillin mgillan mgillon"~5 …. The above is an excerpt, with the actual query containing 512 of these clauses for this field. Other fields also have the same expansion happening, so the overall query may have a very large number of clauses being specified. >From running through a debugger we can see that >QueryBuilder.analyzeGraphPhrase() is the method generating all the various >permutations. It has the comment: “Creates a boolean query from the graph token stream by extracting all the finite strings from the graph and using them to create phrase queries with the appropriate slop.” If we switch back to the WordDelimiterFilter (which is now deprecated), the parsed query only gets the following added in: fulltext:"(mc mcgillan) gillan (mc mcgillon) gillon (mc mcgillion) gillion (mc mcgillian) gillian (mc mcgillin) gillin (m mgillin) gillin (m mgillan) gillan (m mgillon) gillon"~5) This does not generate the load seen in the other configuration. >From what we can tell the WordDelimiterGraphFilter and how its output gets >parsed is working as expected/documented. We have come across the following >issue: https://issues.apache.org/jira/browse/SOLR-13336 but we’re not sure that setting: https://solr.apache.org/guide/8_7/query-settings-in-solrconfig.html#maxbooleanclauses to a much lower number is the best solution here. We would like to find a way to avoid the excessive query expansion we are seeing, and are wondering if anyone else has encountered this problem? Any advice or suggestions gratefully received. Thanks, Francis. -- Francis Crimmins | Senior Software Engineer | National Library of Australia M: +61 0433 545 884 | E: fcrimm...@nla.gov.au | nla.gov.au<http://nla.gov.au/> The National Library of Australia (NLA) acknowledges Australia’s First Nations Peoples – the First Australians – as the Traditional Owners and Custodians of this land and gives respect to the Elders – past and present – and through them to all Australian Aboriginal and Torres Strait Islander people.