I doubt Spark has such a ability which can arrange the order of task
execution.
You could try from these aspects.

1. Write your partitioner to group your data.
2. Sort elements in partitions e.g. with row index.
3. Control the order of incoming outcome obtained from Spark at your
application.

xj @ Tokyo life

On Fri, Oct 10, 2014 at 10:29 AM, Tobias Pfeiffer <[email protected]> wrote:

> Hi,
>
> I am planning an application where the order of items is somehow
> important. In particular it is an online machine learning application where
> learning in a different order will lead to a different model.
>
> I was wondering about ordering guarantees for Spark applications. So if I
> say myRdd.map(someFun), then someFun will be executed on many cluster
> nodes, but do I know anything about the order of the execution?
>
> Say, for example, if data is distributed like
>
>  node1 | node2 | node3 | node4
>    1   |   2   |   3   |   4
>    5   |   6   |   7   |   8
>    9   |  10   |  11   |  12
>   13   |  14   |  15   |  16
>
> Then I guess that - more or less - first, items 1-4 will be processed,
> then 5-8, then 9-12; about the best I could hope for in a distributed
> context. However, if the distribution is like
>
>  node1 | node2 | node3 | node4
>    1   |   5   |   9   |  13
>    2   |   6   |  10   |  14
>    3   |   7   |  11   |  15
>    4   |   8   |  12   |  16
>
> then items are processed in an order that is completely unrelated to the
> original order of items in my dataset. So is there some way to ensure/steer
> the order in which someFun is processed from a global point of view?
>
> Thanks
> Tobias
>
>

Reply via email to