Can you not just filter the range you want, then groupBy timestamp/86400 ? That sounds like your solution 1 and is about as fast as it gets, I think. Are you thinking you would have to filter out each day individually from there, and that's why it would be slow? I don't think that's needed. You also don't need to map to pairs.
On Fri, Jul 11, 2014 at 10:19 PM, bdamos <a...@adobe.com> wrote: > Hi, I have an RDD that represents data over a time interval and I want > to select some subinterval of my data and partition it by day > based on a unix time field in the data. > What is the best way to do this with Spark? > > I have currently implemented 2 solutions, both which seem suboptimal. > Solution 1 is to filter the subinterval from the overall data set, > and then to filter each day out of this filtered data set. > However, this causes the same data in the subset to be filtered many times. > > Solution 2 is to map the objects into a pair RDD where the > key is the number of the day in the interval, then group by > key, collect, and parallelize the resulting grouped data. > However, I worry collecting large data sets is going to be > a serious performance bottleneck. > > A small query using Solution 1 takes 13 seconds to run, and the same > query using Solution 2 takes 10 seconds to run, > but I think this can be further improved. > Does anybody have any suggestions on the best way to separate > a subset of data by day? > > Thanks, > Brandon. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.