Let me provide step wise details: 1. I have an RDD = { (ID2,18159) - *element 1 * (ID1,18159) - *element 2* (ID3,18159) - *element 3* (ID2,36318) - *element 4 * (ID1,36318) - *element 5* (ID3,36318) (ID2,54477) (ID1,54477) (ID3,54477) }
2. RDD.groupByKey().mapValues(v => v.toArray()) Array( (ID1,Array(*18159*, 308703, 72636, 64544, 39244, 107937, *54477*, 145272, 100079, *36318*, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076, 45431, 100136)), (ID3,Array(100079, 19622, *18159*, 212064, 107937, 44683, 150022, 39244, 100136, 58866, 72636, 145272, 817, 89366, * 54477*, *36318*, 308703, 160992, 45431, 162076)), (ID2,Array(308703, * 54477*, 89366, 39244, 150022, 72636, 817, 58866, 44683, 19622, 160992, 107937, 100079, 100136, 145272, 64544, *18159*, 45431, *36318*, 162076)) ) whereas in Step 2 I need as below: Array( (ID1,Array(*18159*,*36318*, *54477,...*)), (ID3,Array(*18159*,*36318*, *54477, ...*)), (ID2,Array(*18159*,*36318*, *54477, ...*)) ) Does this help ? On Tue, Jul 26, 2016 at 2:25 AM, Marco Mistroni <mmistr...@gmail.com> wrote: > Apologies janardhan, i always get confused on this > Ok. so you have a (key, val) RDD (val is irrelevant here) > > then you can do this > val reduced = myRDD.reduceByKey((first, second) => first ++ second) > > val sorted = reduced.sortBy(tpl => tpl._1) > > hth > > > > On Tue, Jul 26, 2016 at 3:31 AM, janardhan shetty <janardhan...@gmail.com> > wrote: > >> groupBy is a shuffle operation and index is already lost in this process >> if I am not wrong and don't see *sortWith* operation on RDD. >> >> Any suggestions or help ? >> >> On Mon, Jul 25, 2016 at 12:58 AM, Marco Mistroni <mmistr...@gmail.com> >> wrote: >> >>> Hi >>> after you do a groupBy you should use a sortWith. >>> Basically , a groupBy reduces your structure to (anyone correct me if i >>> m wrong) a RDD[(key,val)], which you can see as a tuple.....so you could >>> use sortWith (or sortBy, cannot remember which one) (tpl=> tpl._1) >>> hth >>> >>> On Mon, Jul 25, 2016 at 1:21 AM, janardhan shetty < >>> janardhan...@gmail.com> wrote: >>> >>>> Thanks Marco. This solved the order problem. Had another question which >>>> is prefix to this. >>>> >>>> As you can see below ID2,ID1 and ID3 are in order and I need to >>>> maintain this index order as well. But when we do groupByKey >>>> operation(*rdd.distinct.groupByKey().mapValues(v >>>> => v.toArray*)) >>>> everything is *jumbled*. >>>> Is there any way we can maintain this order as well ? >>>> >>>> scala> RDD.foreach(println) >>>> (ID2,18159) >>>> (ID1,18159) >>>> (ID3,18159) >>>> >>>> (ID2,18159) >>>> (ID1,18159) >>>> (ID3,18159) >>>> >>>> (ID2,36318) >>>> (ID1,36318) >>>> (ID3,36318) >>>> >>>> (ID2,54477) >>>> (ID1,54477) >>>> (ID3,54477) >>>> >>>> *Jumbled version : * >>>> Array( >>>> (ID1,Array(*18159*, 308703, 72636, 64544, 39244, 107937, *54477*, >>>> 145272, 100079, *36318*, 160992, 817, 89366, 150022, 19622, 44683, >>>> 58866, 162076, 45431, 100136)), >>>> (ID3,Array(100079, 19622, *18159*, 212064, 107937, 44683, 150022, >>>> 39244, 100136, 58866, 72636, 145272, 817, 89366, * 54477*, *36318*, >>>> 308703, 160992, 45431, 162076)), >>>> (ID2,Array(308703, * 54477*, 89366, 39244, 150022, 72636, 817, 58866, >>>> 44683, 19622, 160992, 107937, 100079, 100136, 145272, 64544, *18159*, >>>> 45431, *36318*, 162076)) >>>> ) >>>> >>>> *Expected output:* >>>> Array( >>>> (ID1,Array(*18159*,*36318*, *54477,...*)), >>>> (ID3,Array(*18159*,*36318*, *54477, ...*)), >>>> (ID2,Array(*18159*,*36318*, *54477, ...*)) >>>> ) >>>> >>>> As you can see after *groupbyKey* operation is complete item 18519 is >>>> in index 0 for ID1, index 2 for ID3 and index 16 for ID2 where as expected >>>> is index 0 >>>> >>>> >>>> On Sun, Jul 24, 2016 at 12:43 PM, Marco Mistroni <mmistr...@gmail.com> >>>> wrote: >>>> >>>>> Hello >>>>> Uhm you have an array containing 3 tuples? >>>>> If all the arrays have same length, you can just zip all of them, >>>>> creatings a list of tuples >>>>> then you can scan the list 5 by 5...? >>>>> >>>>> so something like >>>>> >>>>> (Array(0)_2,Array(1)._2,Array(2)._2).zipped.toList >>>>> >>>>> this will give you a list of tuples of 3 elements containing each >>>>> items from ID1, ID2 and ID3 ... sample below >>>>> res: List((18159,100079,308703), (308703, 19622, 54477), (72636,18159, >>>>> 89366)..........) >>>>> >>>>> then you can use a recursive function to compare each element such as >>>>> >>>>> def iterate(lst:List[(Int, Int, Int)]):T = { >>>>> if (lst.isEmpty): /// return your comparison >>>>> else { >>>>> val splits = lst.splitAt(5) >>>>> // do sometjhing about it using splits._1 >>>>> iterate(splits._2) >>>>> } >>>>> >>>>> will this help? or am i still missing something? >>>>> >>>>> kr >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 24 Jul 2016 5:52 pm, "janardhan shetty" <janardhan...@gmail.com> >>>>> wrote: >>>>> >>>>>> Array( >>>>>> (ID1,Array(18159, 308703, 72636, 64544, 39244, 107937, 54477, 145272, >>>>>> 100079, 36318, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076, >>>>>> 45431, 100136)), >>>>>> (ID3,Array(100079, 19622, 18159, 212064, 107937, 44683, 150022, >>>>>> 39244, 100136, 58866, 72636, 145272, 817, 89366, 54477, 36318, 308703, >>>>>> 160992, 45431, 162076)), >>>>>> (ID2,Array(308703, 54477, 89366, 39244, 150022, 72636, 817, 58866, >>>>>> 44683, 19622, 160992, 107937, 100079, 100136, 145272, 64544, 18159, >>>>>> 45431, >>>>>> 36318, 162076)) >>>>>> ) >>>>>> >>>>>> I need to compare first 5 elements of ID1 with first five element of >>>>>> ID3 next first 5 elements of ID1 to ID2. Similarly next 5 elements in >>>>>> that >>>>>> order until the end of number of elements. >>>>>> Let me know if this helps >>>>>> >>>>>> >>>>>> On Sun, Jul 24, 2016 at 7:45 AM, Marco Mistroni <mmistr...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Apologies I misinterpreted.... could you post two use cases? >>>>>>> Kr >>>>>>> >>>>>>> On 24 Jul 2016 3:41 pm, "janardhan shetty" <janardhan...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Marco, >>>>>>>> >>>>>>>> Thanks for the response. It is indexed order and not ascending or >>>>>>>> descending order. >>>>>>>> On Jul 24, 2016 7:37 AM, "Marco Mistroni" <mmistr...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Use map values to transform to an rdd where values are sorted? >>>>>>>>> Hth >>>>>>>>> >>>>>>>>> On 24 Jul 2016 6:23 am, "janardhan shetty" <janardhan...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I have a key,value pair rdd where value is an array of Ints. I >>>>>>>>>> need to maintain the order of the value in order to execute >>>>>>>>>> downstream >>>>>>>>>> modifications. How do we maintain the order of values? >>>>>>>>>> Ex: >>>>>>>>>> rdd = (id1,[5,2,3,15], >>>>>>>>>> Id2,[9,4,2,5]....) >>>>>>>>>> >>>>>>>>>> Followup question how do we compare between one element in rdd >>>>>>>>>> with all other elements ? >>>>>>>>>> >>>>>>>>> >>>>>> >>>> >>> >> >