Hi All,

using Pyspark, I create two RDDs (one with about 2M records (~200MB), the other with about 8M records (~2GB)) of the format (key, value).

I've done a partitionBy(num_partitions) on both RDDs and verified that both RDDs have the same number of partitions and that equal keys reside on the same partition (via mapPartitionsWithIndex).

Now I'd expect that for a join on the two RDDs no shuffling is necessary. Looking at the Web UI under http://driver:4040 however reveals that that assumption is false.

In fact I am seeing shuffle writes of about 200MB and reads of about 50MB.

What's the explanation for that behaviour? Where am I wrong with my assumption?

Thanks in advance,

Karlson

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to