This will be fixed by https://github.com/apache/spark/pull/4629
On Fri, Feb 13, 2015 at 10:43 AM, Imran Rashid wrote:
> yeah I thought the same thing at first too, I suggested something equivalent
> w/ preservesPartitioning = true, but that isn't enough. the join is done by
> union-ing the two t
yeah I thought the same thing at first too, I suggested something
equivalent w/ preservesPartitioning = true, but that isn't enough. the
join is done by union-ing the two transformed rdds, which is very different
from the way it works under the hood in scala to enable narrow
dependencies. It real
In
https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38,
wouldn't it help to change the lines
vs = rdd.map(lambda (k, v): (k, (1, v)))
ws = other.map(lambda (k, v): (k, (2, v)))
to
vs = rdd.mapValues(lambda v: (1, v))
ws = other.mapValues(lambda v: (2, v))
?
Does that mean partitioning does not work in Python? Or does this only
effect joining?
On 2015-02-12 19:27, Davies Liu wrote:
The feature works as expected in Scala/Java, but not implemented in
Python.
On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid
wrote:
I wonder if the issue is that these
The feature works as expected in Scala/Java, but not implemented in Python.
On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid wrote:
> I wonder if the issue is that these lines just need to add
> preservesPartitioning = true
> ?
>
> https://github.com/apache/spark/blob/master/python/pyspark/join.py#L
I wonder if the issue is that these lines just need to add
preservesPartitioning = true
?
https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38
I am getting the feeling this is an issue w/ pyspark
On Thu, Feb 12, 2015 at 10:43 AM, Imran Rashid wrote:
> ah, sorry I am not too
ah, sorry I am not too familiar w/ pyspark, sorry I missed that part. It
could be that pyspark doesn't properly support narrow dependencies, or
maybe you need to be more explicit about the partitioner. I am looking
into the pyspark api but you might have some better guesses here than I
thought.
Hi,
I believe that partitionBy will use the same (default) partitioner on
both RDDs.
On 2015-02-12 17:12, Sean Owen wrote:
Doesn't this require that both RDDs have the same partitioner?
On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid
wrote:
Hi Karlson,
I think your assumptions are correct
Doesn't this require that both RDDs have the same partitioner?
On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid wrote:
> Hi Karlson,
>
> I think your assumptions are correct -- that join alone shouldn't require
> any shuffling. But its possible you are getting tripped up by lazy
> evaluation of RDD
Hi Imran,
thanks for your quick reply.
Actually I am doing this:
rddA = rddA.partitionBy(n).cache()
rddB = rddB.partitionBy(n).cache()
followed by
rddA.count()
rddB.count()
then joinedRDD = rddA.join(rddB)
I thought that the count() would force the evaluation, so any subsequ
Hi Karlson,
I think your assumptions are correct -- that join alone shouldn't require
any shuffling. But its possible you are getting tripped up by lazy
evaluation of RDDs. After you do your partitionBy, are you sure those RDDs
are actually materialized & cached somewhere? eg., if you just did
11 matches
Mail list logo