Hi, I am planning an application where the order of items is somehow important. In particular it is an online machine learning application where learning in a different order will lead to a different model.
I was wondering about ordering guarantees for Spark applications. So if I say myRdd.map(someFun), then someFun will be executed on many cluster nodes, but do I know anything about the order of the execution? Say, for example, if data is distributed like node1 | node2 | node3 | node4 1 | 2 | 3 | 4 5 | 6 | 7 | 8 9 | 10 | 11 | 12 13 | 14 | 15 | 16 Then I guess that - more or less - first, items 1-4 will be processed, then 5-8, then 9-12; about the best I could hope for in a distributed context. However, if the distribution is like node1 | node2 | node3 | node4 1 | 5 | 9 | 13 2 | 6 | 10 | 14 3 | 7 | 11 | 15 4 | 8 | 12 | 16 then items are processed in an order that is completely unrelated to the original order of items in my dataset. So is there some way to ensure/steer the order in which someFun is processed from a global point of view? Thanks Tobias