Hi Group, I am quite fresh in the spark world. There is a particular use case that I just cannot understand how to accomplish in spark. I am using Cloudera CDH5/YARN/Java 7.
I have a dataset that has the following characteristics - A JavaPairRDD that represents the following - Key => {int ID} Value => {date effectiveFrom, float value} Let's say that the data I have is the following - Partition - 1 [K=> 1, V=> {09-17-2014, 2.8}] [K=> 1, V=> {09-11-2014, 3.9}] [K=> 3, V=> {09-18-2014, 5.0}] [K=> 3, V=> {09-10-2014, 7.4}] Partition - 2 [K=> 2, V=> {09-13-2014, 2.5}] [K=> 4, V=> {09-07-2014, 6.2}] [K=> 2, V=> {09-12-2014, 1.8}] [K=> 4, V=> {09-22-2014, 2.9}] Grouping by key gives me the following RDD Partition - 1 [K=> 1, V=> Iterable({09-17-2014, 2.8}, {09-11-2014, 3.9})] [K=> 3, V=> Iterable({09-18-2014, 5.0}, {09-10-2014, 7.4})] Partition - 2 [K=> 2, Iterable({09-13-2014, 2.5}, {09-12-2014, 1.8})] [K=> 4, Iterable({09-07-2014, 6.2}, {09-22-2014, 2.9})] Now I would like to sort by the values and the result should look like this - Partition - 1 [K=> 1, V=> Iterable({09-11-2014, 3.9}, {09-17-2014, 2.8})] [K=> 3, V=> Iterable({09-10-2014, 7.4}, {09-18-2014, 5.0})] Partition - 2 [K=> 2, Iterable({09-12-2014, 1.8}, {09-13-2014, 2.5})] [K=> 4, Iterable({09-07-2014, 6.2}, {09-22-2014, 2.9})] What is the best way to do this in spark? If so desired, I can even move the "effectiveFrom" (the field that I want to sort on) into the key field. A code snippet or some pointers on how to solve this would be very helpful. Regards, Abraham