Can you not just filter the range you want, then groupBy
timestamp/86400 ? That sounds like your solution 1 and is about as
fast as it gets, I think. Are you thinking you would have to filter
out each day individually from there, and that's why it would be slow?
I don't think that's needed. You also don't need to map to pairs.

On Fri, Jul 11, 2014 at 10:19 PM, bdamos <a...@adobe.com> wrote:
> Hi, I have an RDD that represents data over a time interval and I want
> to select some subinterval of my data and partition it by day
> based on a unix time field in the data.
> What is the best way to do this with Spark?
>
> I have currently implemented 2 solutions, both which seem suboptimal.
> Solution 1 is to filter the subinterval from the overall data set,
> and then to filter each day out of this filtered data set.
> However, this causes the same data in the subset to be filtered many times.
>
> Solution 2 is to map the objects into a pair RDD where the
> key is the number of the day in the interval, then group by
> key, collect, and parallelize the resulting grouped data.
> However, I worry collecting large data sets is going to be
> a serious performance bottleneck.
>
> A small query using Solution 1 takes 13 seconds to run, and the same
> query using Solution 2 takes 10 seconds to run,
> but I think this can be further improved.
> Does anybody have any suggestions on the best way to separate
> a subset of data by day?
>
> Thanks,
> Brandon.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-separate-a-subset-of-an-RDD-by-day-tp9454.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to