Hi All,
using Pyspark, I create two RDDs (one with about 2M records (~200MB),
the other with about 8M records (~2GB)) of the format (key, value).
I've done a partitionBy(num_partitions) on both RDDs and verified that
both RDDs have the same number of partitions and that equal keys reside
on the same partition (via mapPartitionsWithIndex).
Now I'd expect that for a join on the two RDDs no shuffling is
necessary. Looking at the Web UI under http://driver:4040 however
reveals that that assumption is false.
In fact I am seeing shuffle writes of about 200MB and reads of about
50MB.
What's the explanation for that behaviour? Where am I wrong with my
assumption?
Thanks in advance,
Karlson
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org