Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

Sean Owen Wed, 08 Oct 2014 23:58:14 -0700

Yes, I think this another operation that is not deterministic even for
the same RDD. If a partition is lost and recalculated the ordering can
be different in the partition. Sorting the RDD makes the ordering
deterministic.


On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung
<coded...@cs.stanford.edu> wrote:
> Let's say you have some rows in a dataset (say X partitions initially).
>
> A
> B
> C
> D
> E
> .
> .
> .
> .
>
>
> You repartition to Y > X, then it seems that any of the following could be
> valid:
>
> partition 1             partition 2                ........................
> A                          B
> ........................
> C                          E
> D                           .
> ........................
> --------------------------
> C                          E
> A                          B
> D                          .
> --------------------------
> D                          B
> C                          E
> A
>
> etc. etc.
>
> I.e., although each partition will have the same unordered set, the rows'
> orders will change from call to call.
>
> Now, because row ordering can change from call to call, if you do any
> operation that depends on the order of items you saw, then lineage is no
> longer deterministic. For example, it seems that the repartition call itself
> is a row-order dependent call, because it creates a random number generator
> with the partition index as the seed, and then call nextInt as you go
> through the rows.
>
>
> On Wed, Oct 8, 2014 at 10:14 PM, Patrick Wendell <pwend...@gmail.com> wrote:
>>
>> IIRC - the random is seeded with the index, so it will always produce
>> the same result for the same index. Maybe I don't totally follow
>> though. Could you give a small example of how this might change the
>> RDD ordering in a way that you don't expect? In general repartition()
>> will not preserve the ordering of an RDD.
>>
>> On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung
>> <coded...@cs.stanford.edu> wrote:
>> > I noticed that repartition will result in non-deterministic lineage
>> > because
>> > it'll result in changed orders for rows.
>> >
>> > So for instance, if you do things like:
>> >
>> > val data = read(...)
>> > val k = data.repartition(5)
>> > val h = k.repartition(5)
>> >
>> > It seems that this results in different ordering of rows for 'k' each
>> > time
>> > you call it.
>> > And because of this different ordering, 'h' will result in different
>> > partitions even, because 'repartition' distributes through a random
>> > number
>> > generator with the 'index' as the key.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

Reply via email to