Since an RDD doesn't have any ordering guarantee to begin with, I don't think there is any guarantee about the order in which data is encountered. It can change when the same RDD is reevaluated even.
As you say, your scenario 1 is about the best you can do. You can achieve this if you can define some function of your data that maps to 0...N in the desired order. Then repartitionAndSortWithinPartitions with a Partitioner mapping data to value % numPartitions gets about the desired effect. On Fri, Oct 10, 2014 at 2:29 AM, Tobias Pfeiffer <t...@preferred.jp> wrote: > Hi, > > I am planning an application where the order of items is somehow important. > In particular it is an online machine learning application where learning in > a different order will lead to a different model. > > I was wondering about ordering guarantees for Spark applications. So if I > say myRdd.map(someFun), then someFun will be executed on many cluster nodes, > but do I know anything about the order of the execution? > > Say, for example, if data is distributed like > > node1 | node2 | node3 | node4 > 1 | 2 | 3 | 4 > 5 | 6 | 7 | 8 > 9 | 10 | 11 | 12 > 13 | 14 | 15 | 16 > > Then I guess that - more or less - first, items 1-4 will be processed, then > 5-8, then 9-12; about the best I could hope for in a distributed context. > However, if the distribution is like > > node1 | node2 | node3 | node4 > 1 | 5 | 9 | 13 > 2 | 6 | 10 | 14 > 3 | 7 | 11 | 15 > 4 | 8 | 12 | 16 > > then items are processed in an order that is completely unrelated to the > original order of items in my dataset. So is there some way to ensure/steer > the order in which someFun is processed from a global point of view? > > Thanks > Tobias > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org