My goal is for rows to be partitioned according to timestamp bins (e.g.
with each partition representing an even interval of time), and then
ordered by timestamp *within* each partition. Ordering by user ID is not
important. In my aggregate function, in the seqOp function, I am checking
to verify t
Hm, a little confused here. What exactly the ordering do you expect? It
seems that you want all the elements in the RDD to be sorted first by
timestamp and then by user_id. If this is true, then you can simply do this:
|rawData.map {case (time, user, amount) => (time, user) -> amount
}.sortBy
Thank you for sharing this Cheng! This is fantastic. I was able to
implement it and it seems like it's working quite well. I'm definitely on
the right track now!
I'm still having a small problem with the rows inside each partition being
out of order - but I suspect this is because in the code curr
RDD.aggregate doesn’t require the RDD elements to be pairs, so you don’t
need to use user_id to be the key or the RDD. For example, you can use
an empty Map as the zero value of the aggregation. The key of the Map is
the user_id you extracted from each tuple, and the value is the
aggregated val
I note that one of the listing variants of aggregateByKey accepts a
partitioner as an argument:
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V)
⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
Would it be possible to extract my sorted parent's partitio
Thanks for the suggestion! That does look really helpful, I see what you
mean about it being more general than fold. I think I will replace my fold
with aggregate - it should give me more control over the process.
I think the problem will still exist though - which is that I can't get the
correct
Hi Michael,
I'm not sure I fully understood your question, but I think RDD.aggregate
can be helpful in your case. You can see it as a more general version of
fold.
Cheng
On 10/16/14 11:15 PM, Michael Misiewicz wrote:
Hi,
I'm working on a problem where I'd like to sum items in an RDD /in
Hi,
I'm working on a problem where I'd like to sum items in an RDD *in order (*
approximately*)*. I am currently trying to implement this using a fold, but
I'm having some issues because the sorting key of my data is not the same
as the folding key for my data. I have data that looks like this:
u