Since an RDD doesn't have any ordering guarantee to begin with, I
don't think there is any guarantee about the order in which data is
encountered. It can change when the same RDD is reevaluated even.

As you say, your scenario 1 is about the best you can do. You can
achieve this if you can define some function of your data that maps to
0...N in the desired order. Then repartitionAndSortWithinPartitions
with a Partitioner mapping data to value % numPartitions gets about
the desired effect.

On Fri, Oct 10, 2014 at 2:29 AM, Tobias Pfeiffer <t...@preferred.jp> wrote:
> Hi,
>
> I am planning an application where the order of items is somehow important.
> In particular it is an online machine learning application where learning in
> a different order will lead to a different model.
>
> I was wondering about ordering guarantees for Spark applications. So if I
> say myRdd.map(someFun), then someFun will be executed on many cluster nodes,
> but do I know anything about the order of the execution?
>
> Say, for example, if data is distributed like
>
>  node1 | node2 | node3 | node4
>    1   |   2   |   3   |   4
>    5   |   6   |   7   |   8
>    9   |  10   |  11   |  12
>   13   |  14   |  15   |  16
>
> Then I guess that - more or less - first, items 1-4 will be processed, then
> 5-8, then 9-12; about the best I could hope for in a distributed context.
> However, if the distribution is like
>
>  node1 | node2 | node3 | node4
>    1   |   5   |   9   |  13
>    2   |   6   |  10   |  14
>    3   |   7   |  11   |  15
>    4   |   8   |  12   |  16
>
> then items are processed in an order that is completely unrelated to the
> original order of items in my dataset. So is there some way to ensure/steer
> the order in which someFun is processed from a global point of view?
>
> Thanks
> Tobias
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to