Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Aaron Davidson
> > From: Aaron Davidson > Reply-To: > Date: Sat, 12 Jul 2014 16:32:22 -0700 > To: > Subject: Re: Confused by groupByKey() and the default partitioner > > Yes, groupByKey() does partition by the hash of the key unless you specify > a custom Partitioner. > > (1) If

Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Guanhua Yan
: Subject: Re: Confused by groupByKey() and the default partitioner Yes, groupByKey() does partition by the hash of the key unless you specify a custom Partitioner. (1) If you were to use groupByKey() when the data was already partitioned correctly, the data would indeed not be shuffled. Here

Re: Confused by groupByKey() and the default partitioner

2014-07-12 Thread Aaron Davidson
Yes, groupByKey() does partition by the hash of the key unless you specify a custom Partitioner. (1) If you were to use groupByKey() when the data was already partitioned correctly, the data would indeed not be shuffled. Here is the associated code, you'll see that it simply checks that the Partit

Confused by groupByKey() and the default partitioner

2014-07-12 Thread Guanhua Yan
Hi: I have trouble understanding the default partitioner (hash) in Spark. Suppose that an RDD with two partitions is created as follows: x = sc.parallelize([("a", 1), ("b", 4), ("a", 10), ("c", 7)], 2) Does spark partition x based on the hash of the key (e.g., "a", "b", "c") by default? (1) Assumi