> > Solution 2 is to map the objects into a pair RDD where the > key is the number of the day in the interval, then group by > key, collect, and parallelize the resulting grouped data. > However, I worry collecting large data sets is going to be > a serious performance bottleneck. > > Why do you have to do a "collect" ? You can do a groupBy and then write the grouped data to disk again
- How to separate a subset of an RDD by day? bdamos
- Re: How to separate a subset of an RDD by day? Soumya Simanta
- Re: How to separate a subset of an RDD by day? Soumya Simanta
- Re: How to separate a subset of an RDD by day? bdamos
- Re: How to separate a subset of an RDD by day? Soumya Simanta
- Re: How to separate a subset of an RDD by day? Sean Owen