Shuffle on joining two RDDs

Karlson Thu, 12 Feb 2015 07:28:41 -0800

Hi All,

using Pyspark, I create two RDDs (one with about 2M records (~200MB),the other with about 8M records (~2GB)) of the format (key, value).

I've done a partitionBy(num_partitions) on both RDDs and verified thatboth RDDs have the same number of partitions and that equal keys resideon the same partition (via mapPartitionsWithIndex).

Now I'd expect that for a join on the two RDDs no shuffling isnecessary. Looking at the Web UI under http://driver:4040 howeverreveals that that assumption is false.

In fact I am seeing shuffle writes of about 200MB and reads of about50MB.

What's the explanation for that behaviour? Where am I wrong with myassumption?


Thanks in advance,

Karlson

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Shuffle on joining two RDDs

Reply via email to