Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

Eric Friedman Thu, 09 Oct 2014 06:26:17 -0700

+1

----
Eric Friedman


> On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung <coded...@cs.stanford.edu> wrote:
> 
> Are there a large number of non-deterministic lineage operators?
> 
> This seems like a pretty big caveat, particularly for casual programmers who 
> expect consistent semantics between Spark and Scala.
> 
> E.g., making sure that there's no randomness what-so-ever in RDD 
> transformations seems critical. Additionally, shuffling operators would 
> usually result in changed orders, etc.
> 
> These are very easy errors to make, and if you tend to cache things, some 
> errors won't be detected until fault-tolerance is triggered. It would be very 
> helpful for programmers to have a big warning list of not-to-dos within RDD 
> transformations.
> 
>> On Wed, Oct 8, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com> wrote:
>> Yes, I think this another operation that is not deterministic even for
>> the same RDD. If a partition is lost and recalculated the ordering can
>> be different in the partition. Sorting the RDD makes the ordering
>> deterministic.
>> 
>> On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung
>> <coded...@cs.stanford.edu> wrote:
>> > Let's say you have some rows in a dataset (say X partitions initially).
>> >
>> > A
>> > B
>> > C
>> > D
>> > E
>> > .
>> > .
>> > .
>> > .
>> >
>> >
>> > You repartition to Y > X, then it seems that any of the following could be
>> > valid:
>> >
>> > partition 1             partition 2                ........................
>> > A                          B
>> > ........................
>> > C                          E
>> > D                           .
>> > ........................
>> > --------------------------
>> > C                          E
>> > A                          B
>> > D                          .
>> > --------------------------
>> > D                          B
>> > C                          E
>> > A
>> >
>> > etc. etc.
>> >
>> > I.e., although each partition will have the same unordered set, the rows'
>> > orders will change from call to call.
>> >
>> > Now, because row ordering can change from call to call, if you do any
>> > operation that depends on the order of items you saw, then lineage is no
>> > longer deterministic. For example, it seems that the repartition call 
>> > itself
>> > is a row-order dependent call, because it creates a random number generator
>> > with the partition index as the seed, and then call nextInt as you go
>> > through the rows.
>> >
>> >
>> > On Wed, Oct 8, 2014 at 10:14 PM, Patrick Wendell <pwend...@gmail.com> 
>> > wrote:
>> >>
>> >> IIRC - the random is seeded with the index, so it will always produce
>> >> the same result for the same index. Maybe I don't totally follow
>> >> though. Could you give a small example of how this might change the
>> >> RDD ordering in a way that you don't expect? In general repartition()
>> >> will not preserve the ordering of an RDD.
>> >>
>> >> On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung
>> >> <coded...@cs.stanford.edu> wrote:
>> >> > I noticed that repartition will result in non-deterministic lineage
>> >> > because
>> >> > it'll result in changed orders for rows.
>> >> >
>> >> > So for instance, if you do things like:
>> >> >
>> >> > val data = read(...)
>> >> > val k = data.repartition(5)
>> >> > val h = k.repartition(5)
>> >> >
>> >> > It seems that this results in different ordering of rows for 'k' each
>> >> > time
>> >> > you call it.
>> >> > And because of this different ordering, 'h' will result in different
>> >> > partitions even, because 'repartition' distributes through a random
>> >> > number
>> >> > generator with the 'index' as the key.
>> >
>> >
>

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

Reply via email to