Sean,
thanks, I didn't know about repartitionAndSortWithinPartitions, that seems
very helpful!
Tobias
Since an RDD doesn't have any ordering guarantee to begin with, I
don't think there is any guarantee about the order in which data is
encountered. It can change when the same RDD is reevaluated even.
As you say, your scenario 1 is about the best you can do. You can
achieve this if you can define s
I doubt Spark has such a ability which can arrange the order of task
execution.
You could try from these aspects.
1. Write your partitioner to group your data.
2. Sort elements in partitions e.g. with row index.
3. Control the order of incoming outcome obtained from Spark at your
application.
xj