Hi,

I am planning an application where the order of items is somehow important.
In particular it is an online machine learning application where learning
in a different order will lead to a different model.

I was wondering about ordering guarantees for Spark applications. So if I
say myRdd.map(someFun), then someFun will be executed on many cluster
nodes, but do I know anything about the order of the execution?

Say, for example, if data is distributed like

 node1 | node2 | node3 | node4
   1   |   2   |   3   |   4
   5   |   6   |   7   |   8
   9   |  10   |  11   |  12
  13   |  14   |  15   |  16

Then I guess that - more or less - first, items 1-4 will be processed, then
5-8, then 9-12; about the best I could hope for in a distributed context.
However, if the distribution is like

 node1 | node2 | node3 | node4
   1   |   5   |   9   |  13
   2   |   6   |  10   |  14
   3   |   7   |  11   |  15
   4   |   8   |  12   |  16

then items are processed in an order that is completely unrelated to the
original order of items in my dataset. So is there some way to ensure/steer
the order in which someFun is processed from a global point of view?

Thanks
Tobias

Reply via email to