The reason it is nondeterministic is because tasks are not always scheduled
to the same nodes -- so I don't think you can make this deterministic.

If you assume no failures and tasks take a while to run (so it runs slower
than the scheduler can schedule them), then I think you can make it
deterministic by setting spark.locality.wait to a really high number, and
coalescing everything into just N partitions, where N = the number of
machines.




On Fri, Sep 18, 2015 at 5:53 PM, Ulanov, Alexander <alexander.ula...@hpe.com
> wrote:

> Sounds interesting! Is it possible to make it deterministic by using
> global long value and get the element on partition only if
> someFunction(partitionId, globalLong)==true? Or by using some specific
> partitioner that creates such partitionIds that can be decomposed into
> nodeId and number of partitions per node?
>
>
>
> *From:* Reynold Xin [mailto:r...@databricks.com]
> *Sent:* Friday, September 18, 2015 4:37 PM
> *To:* Ulanov, Alexander
> *Cc:* Feynman Liang; dev@spark.apache.org
> *Subject:* Re: One element per node
>
>
>
> Use a global atomic boolean and return nothing from that partition if the
> boolean is true.
>
>
>
> Note that your result won't be deterministic.
>
>
> On Sep 18, 2015, at 4:11 PM, Ulanov, Alexander <alexander.ula...@hpe.com>
> wrote:
>
> Thank you! How can I guarantee that I have only one element per executor
> (per worker, or per physical node)?
>
>
>
> *From:* Feynman Liang [mailto:fli...@databricks.com
> <fli...@databricks.com>]
> *Sent:* Friday, September 18, 2015 4:06 PM
> *To:* Ulanov, Alexander
> *Cc:* dev@spark.apache.org
> *Subject:* Re: One element per node
>
>
>
> rdd.mapPartitions(x => new Iterator(x.head))
>
>
>
> On Fri, Sep 18, 2015 at 3:57 PM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> Dear Spark developers,
>
>
>
> Is it possible (and how to do it if possible) to pick one element per
> physical node from an RDD? Let’s say the first element of any partition on
> that node. The result would be an RDD[element], the count of elements is
> equal to the N of nodes that has partitions of the initial RDD.
>
>
>
> Best regards, Alexander
>
>
>
>

Reply via email to