When you do groupBy(), it wish to load all the data into memory for best
performance, then you should specify the number of partitions carefully.
In Spark master or upcoming 1.1 release, PySpark can do external groupBy(),
it means that it will dumps the data into disks if there is not enough memor
Well, for what it's worth, I found the issue after spending the whole night
running experiments;).
Basically, I needed to give a higher number of partition for the groupByKey.
I was simply using the default, which generated only 4 partitions and so the
whole thing blew up.
--
View this message