Re: Shuffle on joining two RDDs

Sean Owen Thu, 12 Feb 2015 08:14:59 -0800

Doesn't this require that both RDDs have the same partitioner?

On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid <iras...@cloudera.com> wrote:
> Hi Karlson,
>
> I think your assumptions are correct -- that join alone shouldn't require
> any shuffling.  But its possible you are getting tripped up by lazy
> evaluation of RDDs.  After you do your partitionBy, are you sure those RDDs
> are actually materialized & cached somewhere?  eg., if you just did this:
>
> val rddA = someData.partitionBy(N)
> val rddB = someOtherData.partitionBy(N)
> val joinedRdd = rddA.join(rddB)
> joinedRdd.count() //or any other action
>
> then the partitioning isn't actually getting run until you do the join.  So
> though the join itself can happen without partitioning, joinedRdd.count()
> will trigger the evaluation of rddA & rddB which will require shuffles.
> Note that even if you have some intervening action on rddA & rddB that
> shuffles them, unless you persist the result, you will need to reshuffle
> them for the join.
>
> If this doesn't help explain things, for debugging
>
> joinedRdd.getPartitions.foreach{println}
>
> this is getting into the weeds, but at least this will tell us whether or
> not you are getting narrow dependencies, which would avoid the shuffle.
> (Does anyone know of a simpler way to check this?)
>
> hope this helps,
> Imran
>
>
>
>
> On Thu, Feb 12, 2015 at 9:25 AM, Karlson <ksonsp...@siberie.de> wrote:
>>
>> Hi All,
>>
>> using Pyspark, I create two RDDs (one with about 2M records (~200MB), the
>> other with about 8M records (~2GB)) of the format (key, value).
>>
>> I've done a partitionBy(num_partitions) on both RDDs and verified that
>> both RDDs have the same number of partitions and that equal keys reside on
>> the same partition (via mapPartitionsWithIndex).
>>
>> Now I'd expect that for a join on the two RDDs no shuffling is necessary.
>> Looking at the Web UI under http://driver:4040 however reveals that that
>> assumption is false.
>>
>> In fact I am seeing shuffle writes of about 200MB and reads of about 50MB.
>>
>> What's the explanation for that behaviour? Where am I wrong with my
>> assumption?
>>
>> Thanks in advance,
>>
>> Karlson
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Shuffle on joining two RDDs

Reply via email to