Re: Folding an RDD in order

2014-10-17 Thread Michael Misiewicz
My goal is for rows to be partitioned according to timestamp bins (e.g. with each partition representing an even interval of time), and then ordered by timestamp *within* each partition. Ordering by user ID is not important. In my aggregate function, in the seqOp function, I am checking to verify t

Re: Folding an RDD in order

2014-10-17 Thread Cheng Lian
Hm, a little confused here. What exactly the ordering do you expect? It seems that you want all the elements in the RDD to be sorted first by timestamp and then by user_id. If this is true, then you can simply do this: |rawData.map {case (time, user, amount) => (time, user) -> amount }.sortBy

Re: Folding an RDD in order

2014-10-17 Thread Michael Misiewicz
Thank you for sharing this Cheng! This is fantastic. I was able to implement it and it seems like it's working quite well. I'm definitely on the right track now! I'm still having a small problem with the rows inside each partition being out of order - but I suspect this is because in the code curr

Re: Folding an RDD in order

2014-10-16 Thread Cheng Lian
RDD.aggregate doesn’t require the RDD elements to be pairs, so you don’t need to use user_id to be the key or the RDD. For example, you can use an empty Map as the zero value of the aggregation. The key of the Map is the user_id you extracted from each tuple, and the value is the aggregated val

Re: Folding an RDD in order

2014-10-16 Thread Michael Misiewicz
I note that one of the listing variants of aggregateByKey accepts a partitioner as an argument: def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)] Would it be possible to extract my sorted parent's partitio

Re: Folding an RDD in order

2014-10-16 Thread Michael Misiewicz
Thanks for the suggestion! That does look really helpful, I see what you mean about it being more general than fold. I think I will replace my fold with aggregate - it should give me more control over the process. I think the problem will still exist though - which is that I can't get the correct

Re: Folding an RDD in order

2014-10-16 Thread Cheng Lian
Hi Michael, I'm not sure I fully understood your question, but I think RDD.aggregate can be helpful in your case. You can see it as a more general version of fold. Cheng On 10/16/14 11:15 PM, Michael Misiewicz wrote: Hi, I'm working on a problem where I'd like to sum items in an RDD /in

Folding an RDD in order

2014-10-16 Thread Michael Misiewicz
Hi, I'm working on a problem where I'd like to sum items in an RDD *in order (* approximately*)*. I am currently trying to implement this using a fold, but I'm having some issues because the sorting key of my data is not the same as the folding key for my data. I have data that looks like this: u