If you are on 1.0.0 release you can also try converting your RDD to a SchemaRDD and run a groupBy there. The SparkSQL optimizer "may" yield better results. It's worth a try at least.
On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta <soumya.sima...@gmail.com> wrote: > > > > >> >> Solution 2 is to map the objects into a pair RDD where the >> key is the number of the day in the interval, then group by >> key, collect, and parallelize the resulting grouped data. >> However, I worry collecting large data sets is going to be >> a serious performance bottleneck. >> >> > Why do you have to do a "collect" ? You can do a groupBy and then write > the grouped data to disk again >