[
https://issues.apache.org/jira/browse/SOLR-12178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joel Bernstein updated SOLR-12178:
----------------------------------
Description:
Currently the *random* Streaming Expression performs a distributed random
sampling using *CloudSolrClient*. This means that a random sample of *N* docs
from each shard is read into memory on the aggregator node and then a page of
*N* docs is created from the samples from each shard. Reading all the samples
from the shards into memory in the aggregator node means the memory consumption
for random sampling grows as a function of: N*numshards. This clearly limits
both N and numshards.
This ticket will change the random sampling approach to an approach similar to
the one used in *CloudSolrStream* where a stream is generated from the shards
without reading all the documents into memory.
When combined with SOLR-12159 this will allow for much larger random samples.
was:
Currently the *random* Streaming Expression performs a distributed random
sampling using *CloudSolrClient*. This means that a random sample of *N* docs
from each shard is read into memory on the aggregator node and then a page of
*N* docs is created from the samples from each shard. Reading all the samples
from the shards into memory in the aggregator node means the memory consumption
for random sampling grows as a function of: N*numshards. This clearly limits
both N and numshards.
This ticket will change the random sampling approach to an approach similar to
the one used in *CloudSolrStream* where a stream is generated from the shards
without reading all the documents into memory.
> Improve efficiency of distributed random sampling
> -------------------------------------------------
>
> Key: SOLR-12178
> URL: https://issues.apache.org/jira/browse/SOLR-12178
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Joel Bernstein
> Assignee: Joel Bernstein
> Priority: Major
> Fix For: 7.4
>
>
> Currently the *random* Streaming Expression performs a distributed random
> sampling using *CloudSolrClient*. This means that a random sample of *N* docs
> from each shard is read into memory on the aggregator node and then a page of
> *N* docs is created from the samples from each shard. Reading all the samples
> from the shards into memory in the aggregator node means the memory
> consumption for random sampling grows as a function of: N*numshards. This
> clearly limits both N and numshards.
> This ticket will change the random sampling approach to an approach similar
> to the one used in *CloudSolrStream* where a stream is generated from the
> shards without reading all the documents into memory.
> When combined with SOLR-12159 this will allow for much larger random samples.
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]