Yes, I think this another operation that is not deterministic even for the same RDD. If a partition is lost and recalculated the ordering can be different in the partition. Sorting the RDD makes the ordering deterministic.
On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung <coded...@cs.stanford.edu> wrote: > Let's say you have some rows in a dataset (say X partitions initially). > > A > B > C > D > E > . > . > . > . > > > You repartition to Y > X, then it seems that any of the following could be > valid: > > partition 1 partition 2 ........................ > A B > ........................ > C E > D . > ........................ > -------------------------- > C E > A B > D . > -------------------------- > D B > C E > A > > etc. etc. > > I.e., although each partition will have the same unordered set, the rows' > orders will change from call to call. > > Now, because row ordering can change from call to call, if you do any > operation that depends on the order of items you saw, then lineage is no > longer deterministic. For example, it seems that the repartition call itself > is a row-order dependent call, because it creates a random number generator > with the partition index as the seed, and then call nextInt as you go > through the rows. > > > On Wed, Oct 8, 2014 at 10:14 PM, Patrick Wendell <pwend...@gmail.com> wrote: >> >> IIRC - the random is seeded with the index, so it will always produce >> the same result for the same index. Maybe I don't totally follow >> though. Could you give a small example of how this might change the >> RDD ordering in a way that you don't expect? In general repartition() >> will not preserve the ordering of an RDD. >> >> On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung >> <coded...@cs.stanford.edu> wrote: >> > I noticed that repartition will result in non-deterministic lineage >> > because >> > it'll result in changed orders for rows. >> > >> > So for instance, if you do things like: >> > >> > val data = read(...) >> > val k = data.repartition(5) >> > val h = k.repartition(5) >> > >> > It seems that this results in different ordering of rows for 'k' each >> > time >> > you call it. >> > And because of this different ordering, 'h' will result in different >> > partitions even, because 'repartition' distributes through a random >> > number >> > generator with the 'index' as the key. > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org