If you are on 1.0.0 release you can also try converting your RDD to a
SchemaRDD and run a groupBy there. The SparkSQL optimizer "may" yield
better results. It's worth a try at least.


On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta <soumya.sima...@gmail.com>
wrote:

>
>
>
>
>>
>> Solution 2 is to map the objects into a pair RDD where the
>> key is the number of the day in the interval, then group by
>> key, collect, and parallelize the resulting grouped data.
>> However, I worry collecting large data sets is going to be
>> a serious performance bottleneck.
>>
>>
> Why do you have to do a "collect" ?  You can do a groupBy and then write
> the grouped data to disk again
>

Reply via email to