Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Sean Owen
I think it's a bit more narrow, that order is not deterministic on a partition reevaluation. So things like zipWithIndex or even zipWithUniqueID do not match the same values and IDs each time. If this matters at all then a sort is needed before-hand. It has been documented (recently) in the scalado

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Eric Friedman
+1 Eric Friedman > On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung wrote: > > Are there a large number of non-deterministic lineage operators? > > This seems like a pretty big caveat, particularly for casual programmers who > expect consistent semantics between Spark and Scala. > > E.g., m

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Sung Hwan Chung
Are there a large number of non-deterministic lineage operators? This seems like a pretty big caveat, particularly for casual programmers who expect consistent semantics between Spark and Scala. E.g., making sure that there's no randomness what-so-ever in RDD transformations seems critical. Addit

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Sean Owen
Yes, I think this another operation that is not deterministic even for the same RDD. If a partition is lost and recalculated the ordering can be different in the partition. Sorting the RDD makes the ordering deterministic. On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung wrote: > Let's say you hav

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Sung Hwan Chung
Let's say you have some rows in a dataset (say X partitions initially). A B C D E . . . . You repartition to Y > X, then it seems that any of the following could be valid: partition 1 partition 2 A B .

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Patrick Wendell
IIRC - the random is seeded with the index, so it will always produce the same result for the same index. Maybe I don't totally follow though. Could you give a small example of how this might change the RDD ordering in a way that you don't expect? In general repartition() will not preserve the orde