[
https://issues.apache.org/jira/browse/SOLR-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259447#comment-14259447
]
Per Steffensen commented on SOLR-6810:
--------------------------------------
TestDistributedQueryAlgorithm.testDocReads shows very well exactly how the
number of store accesses is reduced
{code}
// Test the number of documents read from store using
FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS
// vs FIND_ID_RELEVANCE_FETCH_BY_IDS. This demonstrates the advantage of
FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS
// over FIND_ID_RELEVANCE_FETCH_BY_IDS (and vice versa)
private void testDocReads() throws Exception {
for (int startValue = 0; startValue <= MAX_START; startValue++) {
// FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS (assuming skipGetIds
used - default)
// Only reads data (required fields) from store for "rows + (#shards *
start)" documents across all shards
// This can be optimized to become only "rows"
// Only reads the data once
testDQADocReads(ShardParams.DQA.FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS,
startValue, ROWS, ROWS + (startValue * jettys.size()), ROWS + (startValue *
jettys.size()));
// DQA.FIND_ID_RELEVANCE_FETCH_BY_IDS (assuming skipGetIds not used -
default)
// Reads data (ids only) from store for "(rows + startValue) * #shards"
documents for each shard
// Besides that reads data (required fields) for "rows" documents across
all shards
testDQADocReads(ShardParams.DQA.FIND_ID_RELEVANCE_FETCH_BY_IDS, startValue,
ROWS, (ROWS + startValue) * jettys.size(), ROWS + ((ROWS + startValue) *
jettys.size()));
}
}
{code}
> Faster searching limited but high rows across many shards all with many hits
> ----------------------------------------------------------------------------
>
> Key: SOLR-6810
> URL: https://issues.apache.org/jira/browse/SOLR-6810
> Project: Solr
> Issue Type: Improvement
> Components: search
> Reporter: Per Steffensen
> Assignee: Shalin Shekhar Mangar
> Labels: distributed_search, performance
> Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch,
> branch_5x_rev1645549.patch
>
>
> Searching "limited but high rows across many shards all with many hits" is
> slow
> E.g.
> * Query from outside client: q=something&rows=1000
> * Resulting in sub-requests to each shard something a-la this
> ** 1) q=something&rows=1000&fl=id,score
> ** 2) Request the full documents with ids in the global-top-1000 found among
> the top-1000 from each shard
> What does the subject mean
> * "limited but high rows" means 1000 in the example above
> * "many shards" means 200-1000 in our case
> * "all with many hits" means that each of the shards have a significant
> number of hits on the query
> The problem grows on all three factors above
> Doing such a query on our system takes between 5 min to 1 hour - depending on
> a lot of things. It ought to be much faster, so lets make it.
> Profiling show that the problem is that it takes lots of time to access the
> store to get id’s for (up to) 1000 docs (value of rows parameter) per shard.
> Having 1000 shards its up to 1 mio ids that has to be fetched. There is
> really no good reason to ever read information from store for more than the
> overall top-1000 documents, that has to be returned to the client.
> For further detail see mail-thread "Slow searching limited but high rows
> across many shards all with high hits" started 13/11-2014 on
> [email protected]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]