>
> Solution 2 is to map the objects into a pair RDD where the
> key is the number of the day in the interval, then group by
> key, collect, and parallelize the resulting grouped data.
> However, I worry collecting large data sets is going to be
> a serious performance bottleneck.
>
>
Why do you have to do a "collect" ?  You can do a groupBy and then write
the grouped data to disk again

Reply via email to