[
https://issues.apache.org/jira/browse/SOLR-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258554#comment-14258554
]
Shalin Shekhar Mangar commented on SOLR-6810:
---------------------------------------------
bq. The main idea seems to be: you don't need IDs to merge the top docs from
each shard. Correct?
Yes, exactly.
bq. I'm still not quite groking it though... do you understand it well enough
to give a high level description for those who know Solr but who haven't looked
at the patch?
The idea is to:
# Get score for top N docs from each shard in the first pass, (say rows=3 and
shard1 returns scores 0.8, 0.5, 0.3 and shard2 returns 0.9, 0.6, 0.1)
# Merge them together to find the top N scores (0.9, 0.8, 0.6) and track number
of results from each shard in top N scores (shard1 has 1 docs in top 3 and
shard2 has 2 doc in top 3)
# Get corresponding docs (id and all return fields) from each shard in the
second pass. (retrieve top 1 docs from shard1 and top 2 doc from shard2)
bq. As in... what's the high level description of what this patch implements?
The patch implements this algorithm of course. It makes the algorithm
configurable using a new 'dqa' parameter. There are some refactorings in
ShardParams, ResponseBuilder to make this work. There are good randomized tests
such that all Solr tests switch between the new and old algorithms. The patch
also adds wrapper classes for SolrCore, SolrIndexSearcher and LeafReader which
are used only during tests to assert things like number of shard requests,
number of stored field accesses etc.
bq. Also, does this patch also improve things if docValues are used for the ID
field?
No.
> Faster searching limited but high rows across many shards all with many hits
> ----------------------------------------------------------------------------
>
> Key: SOLR-6810
> URL: https://issues.apache.org/jira/browse/SOLR-6810
> Project: Solr
> Issue Type: Improvement
> Components: search
> Reporter: Per Steffensen
> Assignee: Shalin Shekhar Mangar
> Labels: distributed_search, performance
> Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch,
> branch_5x_rev1645549.patch
>
>
> Searching "limited but high rows across many shards all with many hits" is
> slow
> E.g.
> * Query from outside client: q=something&rows=1000
> * Resulting in sub-requests to each shard something a-la this
> ** 1) q=something&rows=1000&fl=id,score
> ** 2) Request the full documents with ids in the global-top-1000 found among
> the top-1000 from each shard
> What does the subject mean
> * "limited but high rows" means 1000 in the example above
> * "many shards" means 200-1000 in our case
> * "all with many hits" means that each of the shards have a significant
> number of hits on the query
> The problem grows on all three factors above
> Doing such a query on our system takes between 5 min to 1 hour - depending on
> a lot of things. It ought to be much faster, so lets make it.
> Profiling show that the problem is that it takes lots of time to access the
> store to get id’s for (up to) 1000 docs (value of rows parameter) per shard.
> Having 1000 shards its up to 1 mio ids that has to be fetched. There is
> really no good reason to ever read information from store for more than the
> overall top-1000 documents, that has to be returned to the client.
> For further detail see mail-thread "Slow searching limited but high rows
> across many shards all with high hits" started 13/11-2014 on
> [email protected]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]