I think it's a bit more narrow, that order is not deterministic on a
partition reevaluation. So things like zipWithIndex or even
zipWithUniqueID do not match the same values and IDs each time. If
this matters at all then a sort is needed before-hand. It has been
documented (recently) in the scalado
+1
Eric Friedman
> On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung wrote:
>
> Are there a large number of non-deterministic lineage operators?
>
> This seems like a pretty big caveat, particularly for casual programmers who
> expect consistent semantics between Spark and Scala.
>
> E.g., m
Are there a large number of non-deterministic lineage operators?
This seems like a pretty big caveat, particularly for casual programmers
who expect consistent semantics between Spark and Scala.
E.g., making sure that there's no randomness what-so-ever in RDD
transformations seems critical. Addit
Yes, I think this another operation that is not deterministic even for
the same RDD. If a partition is lost and recalculated the ordering can
be different in the partition. Sorting the RDD makes the ordering
deterministic.
On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung
wrote:
> Let's say you hav
Let's say you have some rows in a dataset (say X partitions initially).
A
B
C
D
E
.
.
.
.
You repartition to Y > X, then it seems that any of the following could be
valid:
partition 1 partition 2
A B
.
IIRC - the random is seeded with the index, so it will always produce
the same result for the same index. Maybe I don't totally follow
though. Could you give a small example of how this might change the
RDD ordering in a way that you don't expect? In general repartition()
will not preserve the orde