[
https://issues.apache.org/jira/browse/SOLR-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048223#comment-16048223
]
Justin Sarma edited comment on SOLR-5611 at 6/13/17 6:38 PM:
-------------------------------------------------------------
I've created a tool which you can use to approximate appropriate shards.rows
values through a Monte Carlo simulation. You can specify what percentage of the
time you want the results to be 100% accurate, for different search depths, and
different shard counts. We used this tool at Shutterstock to improve Solr
latency and reduce heap pressure:
https://github.com/jsarma/solr-shards-rows-optimizer
There's also a blog entry here which explains it in more detail, along with
some perf tests of the results:
https://tech.shutterstock.com/2017/05/09/efficiently-handling-deep-pagination-in-a-distributed-search-engine/
was (Author: jsarma):
I've created a tool which you can use to approximate appropriate shards.rows
values through repeated event simulation. We used this tool at Shutterstock to
improve Solr latency and reduce heap pressure:
https://github.com/jsarma/solr-shards-rows-optimizer
There's also a blog entry here which explains it in more detail, along with
some perf tests of the results:
https://tech.shutterstock.com/2017/05/09/efficiently-handling-deep-pagination-in-a-distributed-search-engine/
> When documents are uniformly distributed over shards, enable returning
> approximated results in distributed query
> ----------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-5611
> URL: https://issues.apache.org/jira/browse/SOLR-5611
> Project: Solr
> Issue Type: Improvement
> Components: SolrCloud
> Reporter: Isaac Hebsh
> Priority: Minor
> Labels: distributed_search, shard, solrcloud
> Fix For: 4.9, 6.0
>
> Attachments: lec5-distributedIndexing.pdf
>
>
> Query with rows=1000, which sent to a collection of 100 shards (shard key
> behaviour is default - based on hash of the unique key), will generate 100
> requests of rows=1000, on each shard.
> This results to total number of rows*numShards unique keys to be retrieved.
> This behaviour is getting worst as numShards grows.
> If the documents are uniformly distributed over the shards, the expected
> number of document should be ~ rows/numShards. Obviously, there might be
> extreme cases, when all of the top X documents are in a specific shard.
> I suggest adding an optional parameter, say approxResults=true, which decides
> whether we should limit the rows in the shard requests to rows/numShardsor
> not. Moreover, we can add a numeric parameter which increases the limit, to
> be more accurate.
> For example, the query {{approxResults=true&approxResults.factor=1.5}} will
> retrieve 1.5*rows/numShards from each shard. In the case of 100 shards and
> rows=1000, each shard will return 15 documents.
> Furthermore, this can reduce the problem of deep paging, because the same
> thing can be applied there. when requested start=100000, Solr creating shard
> request with start=0 and rows=START+ROWS. In the approximated approach, start
> parameter (in the shard requests) can be set to 100000/numShards. The idea of
> the approxResults.factor creates some difficulties here, though.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]