Hi Group,
I am quite fresh in the spark world. There is a particular use case that I just
cannot understand how to accomplish in spark. I am using Cloudera
CDH5/YARN/Java 7.
I have a dataset that has the following characteristics -
A JavaPairRDD that represents the following -
Key => {int ID}
Value => {date effectiveFrom, float value}
Let's say that the data I have is the following -
Partition - 1
[K=> 1, V=> {09-17-2014, 2.8}]
[K=> 1, V=> {09-11-2014, 3.9}]
[K=> 3, V=> {09-18-2014, 5.0}]
[K=> 3, V=> {09-10-2014, 7.4}]
Partition - 2
[K=> 2, V=> {09-13-2014, 2.5}]
[K=> 4, V=> {09-07-2014, 6.2}]
[K=> 2, V=> {09-12-2014, 1.8}]
[K=> 4, V=> {09-22-2014, 2.9}]
Grouping by key gives me the following RDD
Partition - 1
[K=> 1, V=> Iterable({09-17-2014, 2.8}, {09-11-2014, 3.9})]
[K=> 3, V=> Iterable({09-18-2014, 5.0}, {09-10-2014, 7.4})]
Partition - 2
[K=> 2, Iterable({09-13-2014, 2.5}, {09-12-2014, 1.8})]
[K=> 4, Iterable({09-07-2014, 6.2}, {09-22-2014, 2.9})]
Now I would like to sort by the values and the result should look like this -
Partition - 1
[K=> 1, V=> Iterable({09-11-2014, 3.9}, {09-17-2014, 2.8})]
[K=> 3, V=> Iterable({09-10-2014, 7.4}, {09-18-2014, 5.0})]
Partition - 2
[K=> 2, Iterable({09-12-2014, 1.8}, {09-13-2014, 2.5})]
[K=> 4, Iterable({09-07-2014, 6.2}, {09-22-2014, 2.9})]
What is the best way to do this in spark? If so desired, I can even move the
"effectiveFrom" (the field that I want to sort on) into the key field.
A code snippet or some pointers on how to solve this would be very helpful.
Regards,
Abraham