I have 2 RDDs I want to Join.  We will call them RDD A and RDD B.  RDD A has
1 billion rows; RDD B has 100k rows.  I want to join them on a single key.

95% of the rows in RDD A have the same key to join with RDD B.  Before I can
join the two RDDs, I must map them to tuples where the first element is the
key and the second is the value.

Since 95% of the rows in RDD A have the same key, they now go into the same
partition.  When I perform the join, the system will try to execute this
partition in just one task.  This one task will try to load too much data
into memory at once and die a horrible death.

I know that this is caused by the HashPartitioner that is used by default in
Spark; everything with the same hashed key goes into the same partition.  I
also tried the RangePartitioner but still saw 95% of the data go into the
same partition.  What I'd really like is a partitioner that puts everything
with the same key into the same partition *except* when the partition is
over a certain size, then it would just spill into the next partition.

Writing my own partitioner is a big job, and requires a lot of testing to
make sure I get it right.  Is there a simpler way to solve this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-produces-one-huge-partition-tp23358.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to