Re: Shuffle on joining two RDDs

2015-02-16 Thread Davies Liu
This will be fixed by https://github.com/apache/spark/pull/4629 On Fri, Feb 13, 2015 at 10:43 AM, Imran Rashid wrote: > yeah I thought the same thing at first too, I suggested something equivalent > w/ preservesPartitioning = true, but that isn't enough. the join is done by > union-ing the two t

Re: Shuffle on joining two RDDs

2015-02-13 Thread Imran Rashid
yeah I thought the same thing at first too, I suggested something equivalent w/ preservesPartitioning = true, but that isn't enough. the join is done by union-ing the two transformed rdds, which is very different from the way it works under the hood in scala to enable narrow dependencies. It real

Re: Shuffle on joining two RDDs

2015-02-13 Thread Karlson
In https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38, wouldn't it help to change the lines vs = rdd.map(lambda (k, v): (k, (1, v))) ws = other.map(lambda (k, v): (k, (2, v))) to vs = rdd.mapValues(lambda v: (1, v)) ws = other.mapValues(lambda v: (2, v)) ?

Re: Shuffle on joining two RDDs

2015-02-13 Thread Karlson
Does that mean partitioning does not work in Python? Or does this only effect joining? On 2015-02-12 19:27, Davies Liu wrote: The feature works as expected in Scala/Java, but not implemented in Python. On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid wrote: I wonder if the issue is that these

Re: Shuffle on joining two RDDs

2015-02-12 Thread Davies Liu
The feature works as expected in Scala/Java, but not implemented in Python. On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid wrote: > I wonder if the issue is that these lines just need to add > preservesPartitioning = true > ? > > https://github.com/apache/spark/blob/master/python/pyspark/join.py#L

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
I wonder if the issue is that these lines just need to add preservesPartitioning = true ? https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38 I am getting the feeling this is an issue w/ pyspark On Thu, Feb 12, 2015 at 10:43 AM, Imran Rashid wrote: > ah, sorry I am not too

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
ah, sorry I am not too familiar w/ pyspark, sorry I missed that part. It could be that pyspark doesn't properly support narrow dependencies, or maybe you need to be more explicit about the partitioner. I am looking into the pyspark api but you might have some better guesses here than I thought.

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi, I believe that partitionBy will use the same (default) partitioner on both RDDs. On 2015-02-12 17:12, Sean Owen wrote: Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid wrote: Hi Karlson, I think your assumptions are correct

Re: Shuffle on joining two RDDs

2015-02-12 Thread Sean Owen
Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid wrote: > Hi Karlson, > > I think your assumptions are correct -- that join alone shouldn't require > any shuffling. But its possible you are getting tripped up by lazy > evaluation of RDD

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi Imran, thanks for your quick reply. Actually I am doing this: rddA = rddA.partitionBy(n).cache() rddB = rddB.partitionBy(n).cache() followed by rddA.count() rddB.count() then joinedRDD = rddA.join(rddB) I thought that the count() would force the evaluation, so any subsequ

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
Hi Karlson, I think your assumptions are correct -- that join alone shouldn't require any shuffling. But its possible you are getting tripped up by lazy evaluation of RDDs. After you do your partitionBy, are you sure those RDDs are actually materialized & cached somewhere? eg., if you just did