Well, for what it's worth, I found the issue after spending the whole night running experiments;).
Basically, I needed to give a higher number of partition for the groupByKey. I was simply using the default, which generated only 4 partitions and so the whole thing blew up. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Last-step-of-processing-is-using-too-much-memory-tp10134p10147.html Sent from the Apache Spark User List mailing list archive at Nabble.com.