subject:"Re\: Shuffle on joining two RDDs"

Re: Shuffle on joining two RDDs

2015-02-16 Thread Davies Liu

This will be fixed by https://github.com/apache/spark/pull/4629 On Fri, Feb 13, 2015 at 10:43 AM, Imran Rashid wrote: > yeah I thought the same thing at first too, I suggested something equivalent > w/ preservesPartitioning = true, but that isn't enough. the join is done by > union-ing the two t

Re: Shuffle on joining two RDDs

2015-02-13 Thread Imran Rashid

yeah I thought the same thing at first too, I suggested something equivalent w/ preservesPartitioning = true, but that isn't enough. the join is done by union-ing the two transformed rdds, which is very different from the way it works under the hood in scala to enable narrow dependencies. It real

Re: Shuffle on joining two RDDs

2015-02-13 Thread Karlson

In https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38, wouldn't it help to change the lines vs = rdd.map(lambda (k, v): (k, (1, v))) ws = other.map(lambda (k, v): (k, (2, v))) to vs = rdd.mapValues(lambda v: (1, v)) ws = other.mapValues(lambda v: (2, v)) ?

Re: Shuffle on joining two RDDs

2015-02-13 Thread Karlson

Does that mean partitioning does not work in Python? Or does this only effect joining? On 2015-02-12 19:27, Davies Liu wrote: The feature works as expected in Scala/Java, but not implemented in Python. On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid wrote: I wonder if the issue is that these

Re: Shuffle on joining two RDDs

2015-02-12 Thread Davies Liu

The feature works as expected in Scala/Java, but not implemented in Python. On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid wrote: > I wonder if the issue is that these lines just need to add > preservesPartitioning = true > ? > > https://github.com/apache/spark/blob/master/python/pyspark/join.py#L

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid

I wonder if the issue is that these lines just need to add preservesPartitioning = true ? https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38 I am getting the feeling this is an issue w/ pyspark On Thu, Feb 12, 2015 at 10:43 AM, Imran Rashid wrote: > ah, sorry I am not too

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid

ah, sorry I am not too familiar w/ pyspark, sorry I missed that part. It could be that pyspark doesn't properly support narrow dependencies, or maybe you need to be more explicit about the partitioner. I am looking into the pyspark api but you might have some better guesses here than I thought.

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson

Hi, I believe that partitionBy will use the same (default) partitioner on both RDDs. On 2015-02-12 17:12, Sean Owen wrote: Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid wrote: Hi Karlson, I think your assumptions are correct

Re: Shuffle on joining two RDDs

2015-02-12 Thread Sean Owen

Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid wrote: > Hi Karlson, > > I think your assumptions are correct -- that join alone shouldn't require > any shuffling. But its possible you are getting tripped up by lazy > evaluation of RDD

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson

Hi Imran, thanks for your quick reply. Actually I am doing this: rddA = rddA.partitionBy(n).cache() rddB = rddB.partitionBy(n).cache() followed by rddA.count() rddB.count() then joinedRDD = rddA.join(rddB) I thought that the count() would force the evaluation, so any subsequ

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid

Hi Karlson, I think your assumptions are correct -- that join alone shouldn't require any shuffling. But its possible you are getting tripped up by lazy evaluation of RDDs. After you do your partitionBy, are you sure those RDDs are actually materialized & cached somewhere? eg., if you just did

Re: Shuffle on joining two RDDs

Re: Shuffle on joining two RDDs

Re: Shuffle on joining two RDDs

Re: Shuffle on joining two RDDs

Re: Shuffle on joining two RDDs

Re: Shuffle on joining two RDDs

Re: Shuffle on joining two RDDs

Re: Shuffle on joining two RDDs

Re: Shuffle on joining two RDDs

Re: Shuffle on joining two RDDs

Re: Shuffle on joining two RDDs

11 matches

Site Navigation

Mail list logo

Footer information