[
https://issues.apache.org/jira/browse/SOLR-11769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated SOLR-11769:
--------------------------------
Attachment: SOLR-11769_Optimize_MatchAllDocsQuery_more.patch
I couldn't help myself today but dig deeper here and figure out why there would
be a slow-down even without the above change. The result is a more broader
improvement for match-all-docs (*:*) scenarios that is not specific to the
particular useFilterForSortedQuery situation above. Essentially
SolrIndexSearcher.getProcessedFilter would return an empty pf if there are no
queries or filters (the semantics mean match-all-docs). But there is the
"answer" field that could be populated with getLiveDocs, and pf.answer is
examined by getDocSet so it can return early. I also optimized
DocSetUtil.createDocSet to check that the query arg is a MatchAllDocsQuery and
the live docs is "instantiated", allowing us to return that directly.
The only quirky thing about this was a test failure I fixed in
TestSolrQueryParser that checked the filter cache insert delta after executing
a query. The additional call to getLiveDocs in this patch by
getProcessedFilter occurred which got in the cache and increased the counter an
additional time. An assumption I make in the getProcessedFilter change is that
returning getLiveDocs is either cheap or a forlorn conclusion that it will
ultimately be instantiated at some point any way so might as well get it on
with. Alternatively, it's caller could instead check for this case (e.g.
filter == null && postFilter == null then return getLiveDocs)? But that seems
less clean since "answer" is there for a reason so why avoid it.
[[email protected]] can you please review this? It intersects with
modifications you've done in the past.
As an aside, I think it would be good if more DocSet methods in
SolrIndexSearcher move over to DocSetUtil so that we can keep the unwieldily
SolrIndexSearcher tamed. Essentially I suggest a similar change as I did for
the SolrDocumentFetcher refactoring but for DocSets. With such a change, it
would have access to the searcher and not need it in the methods.
> Sorting performance degrades when useFilterForSortedQuery is enabled and
> there is no filter query specified
> -----------------------------------------------------------------------------------------------------------
>
> Key: SOLR-11769
> URL: https://issues.apache.org/jira/browse/SOLR-11769
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: search
> Affects Versions: 4.10.4
> Environment: OS: macOS Sierra (version 10.12.4)
> Memory: 16GB
> CPU: 2.9 GHz Intel Core i7
> Java Version: 1.8
> Reporter: Betim Deva
> Assignee: David Smiley
> Labels: performance
> Attachments: SOLR-11769_Optimize_MatchAllDocsQuery_more.patch
>
>
> The performance of sorting degrades significantly when the
> {{useFilterForSortedQuery}} is enabled, and there's no filter query specified.
> *Steps to Reproduce:*
> 1. Set {{useFilterForSortedQuery=true}} in {{solrconfig.xml}}
> 2. Run a query to match and return a single document. Also add sorting
> - Example {{/select?q=foo:123&sort=bar+desc}}
> Having a large index (> 10 million documents), this yields to a slow response
> (a few hundreds of milliseconds on average) even when the resulting set
> consists of a single document.
> *Observation 1:*
> - Disabling {{useFilterForSortedQuery}} improves the performance to < 1ms
> *Observation 2:*
> - Removing the {{sort}} improves the performance to < 1ms
> *Observation 3:*
> - Keeping the {{sort}}, and adding any filter query (such as {{fq=\*:\*}})
> improves the performance to < 1 ms.
> After profiling
> [SolrIndexSearcher.java|https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blob;f=solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java;h=9ee5199bdf7511c70f2cc616c123292c97d36b5b;hb=HEAD#l1400]
> found that the bottleneck is on
> {{DocSet bigFilt = getDocSet(cmd.getFilterList());}}
> when {{cmd.getFilterList())}} is passed in as {{null}}. This is making
> {{getDocSet()}} function collect document ids every single time it is called
> without any caching.
> {code:java}
> 1394 if (useFilterCache) {
> 1395 // now actually use the filter cache.
> 1396 // for large filters that match few documents, this may be
> 1397 // slower than simply re-executing the query.
> 1398 if (out.docSet == null) {
> 1399 out.docSet = getDocSet(cmd.getQuery(), cmd.getFilter());
> 1400 DocSet bigFilt = getDocSet(cmd.getFilterList());
> 1401 if (bigFilt != null) out.docSet =
> out.docSet.intersection(bigFilt);
> 1402 }
> 1403 // todo: there could be a sortDocSet that could take a list of
> 1404 // the filters instead of anding them first...
> 1405 // perhaps there should be a multi-docset-iterator
> 1406 sortDocSet(qr, cmd);
> 1407 }
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]