+1 ---- Eric Friedman
> On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung <coded...@cs.stanford.edu> wrote: > > Are there a large number of non-deterministic lineage operators? > > This seems like a pretty big caveat, particularly for casual programmers who > expect consistent semantics between Spark and Scala. > > E.g., making sure that there's no randomness what-so-ever in RDD > transformations seems critical. Additionally, shuffling operators would > usually result in changed orders, etc. > > These are very easy errors to make, and if you tend to cache things, some > errors won't be detected until fault-tolerance is triggered. It would be very > helpful for programmers to have a big warning list of not-to-dos within RDD > transformations. > >> On Wed, Oct 8, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com> wrote: >> Yes, I think this another operation that is not deterministic even for >> the same RDD. If a partition is lost and recalculated the ordering can >> be different in the partition. Sorting the RDD makes the ordering >> deterministic. >> >> On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung >> <coded...@cs.stanford.edu> wrote: >> > Let's say you have some rows in a dataset (say X partitions initially). >> > >> > A >> > B >> > C >> > D >> > E >> > . >> > . >> > . >> > . >> > >> > >> > You repartition to Y > X, then it seems that any of the following could be >> > valid: >> > >> > partition 1 partition 2 ........................ >> > A B >> > ........................ >> > C E >> > D . >> > ........................ >> > -------------------------- >> > C E >> > A B >> > D . >> > -------------------------- >> > D B >> > C E >> > A >> > >> > etc. etc. >> > >> > I.e., although each partition will have the same unordered set, the rows' >> > orders will change from call to call. >> > >> > Now, because row ordering can change from call to call, if you do any >> > operation that depends on the order of items you saw, then lineage is no >> > longer deterministic. For example, it seems that the repartition call >> > itself >> > is a row-order dependent call, because it creates a random number generator >> > with the partition index as the seed, and then call nextInt as you go >> > through the rows. >> > >> > >> > On Wed, Oct 8, 2014 at 10:14 PM, Patrick Wendell <pwend...@gmail.com> >> > wrote: >> >> >> >> IIRC - the random is seeded with the index, so it will always produce >> >> the same result for the same index. Maybe I don't totally follow >> >> though. Could you give a small example of how this might change the >> >> RDD ordering in a way that you don't expect? In general repartition() >> >> will not preserve the ordering of an RDD. >> >> >> >> On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung >> >> <coded...@cs.stanford.edu> wrote: >> >> > I noticed that repartition will result in non-deterministic lineage >> >> > because >> >> > it'll result in changed orders for rows. >> >> > >> >> > So for instance, if you do things like: >> >> > >> >> > val data = read(...) >> >> > val k = data.repartition(5) >> >> > val h = k.repartition(5) >> >> > >> >> > It seems that this results in different ordering of rows for 'k' each >> >> > time >> >> > you call it. >> >> > And because of this different ordering, 'h' will result in different >> >> > partitions even, because 'repartition' distributes through a random >> >> > number >> >> > generator with the 'index' as the key. >> > >> > >