Re: Last step of processing is using too much memory.

2014-07-29 Thread Davies Liu
When you do groupBy(), it wish to load all the data into memory for best performance, then you should specify the number of partitions carefully. In Spark master or upcoming 1.1 release, PySpark can do external groupBy(), it means that it will dumps the data into disks if there is not enough memor

Re: Last step of processing is using too much memory.

2014-07-18 Thread Roch Denis
Well, for what it's worth, I found the issue after spending the whole night running experiments;). Basically, I needed to give a higher number of partition for the groupByKey. I was simply using the default, which generated only 4 partitions and so the whole thing blew up. -- View this message