GroupBy Key and then sort values with the group

abraham.jacob Wed, 17 Sep 2014 08:39:40 -0700

Hi Group,

I am quite fresh in the spark world. There is a particular use case that I just 
cannot understand how to accomplish in spark. I am using Cloudera 
CDH5/YARN/Java 7.


I have a dataset that has the following characteristics -

A JavaPairRDD that represents the following -

Key => {int ID}
Value => {date effectiveFrom, float value}

Let's say that the data I have is the following -


Partition - 1
[K=> 1, V=> {09-17-2014, 2.8}]
[K=> 1, V=> {09-11-2014, 3.9}]
[K=> 3, V=> {09-18-2014, 5.0}]
[K=> 3, V=> {09-10-2014, 7.4}]


Partition - 2
[K=> 2, V=> {09-13-2014, 2.5}]
[K=> 4, V=> {09-07-2014, 6.2}]
[K=> 2, V=> {09-12-2014, 1.8}]
[K=> 4, V=> {09-22-2014, 2.9}]


Grouping by key gives me the following RDD

Partition - 1
[K=> 1, V=> Iterable({09-17-2014, 2.8}, {09-11-2014, 3.9})]
[K=> 3, V=> Iterable({09-18-2014, 5.0}, {09-10-2014, 7.4})]

Partition - 2
[K=> 2, Iterable({09-13-2014, 2.5}, {09-12-2014, 1.8})]
[K=> 4, Iterable({09-07-2014, 6.2}, {09-22-2014, 2.9})]

Now I would like to sort by the values and the result should look like this -

Partition - 1
[K=> 1, V=> Iterable({09-11-2014, 3.9}, {09-17-2014, 2.8})]
[K=> 3, V=> Iterable({09-10-2014, 7.4}, {09-18-2014, 5.0})]

Partition - 2
[K=> 2, Iterable({09-12-2014, 1.8}, {09-13-2014, 2.5})]
[K=> 4, Iterable({09-07-2014, 6.2}, {09-22-2014, 2.9})]


What is the best way to do this in spark? If so desired, I can even move the 
"effectiveFrom" (the field that I want to sort on) into the key field.

A code snippet or some pointers on how to solve this would be very helpful.

Regards,
Abraham

GroupBy Key and then sort values with the group

Reply via email to