Re: Maintaining order of pair rdd

janardhan shetty Tue, 26 Jul 2016 16:36:40 -0700

Let me provide step wise details:

1.
I have an RDD  = {
(ID2,18159) - *element 1  *
(ID1,18159) - *element 2*
(ID3,18159) - *element 3*
(ID2,36318) - *element 4 *
(ID1,36318) - *element 5*
(ID3,36318)
(ID2,54477)
(ID1,54477)
(ID3,54477)
}


2. RDD.groupByKey().mapValues(v => v.toArray())

Array(
(ID1,Array(*18159*, 308703, 72636, 64544, 39244, 107937, *54477*, 145272,
100079, *36318*, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076,
45431, 100136)),
(ID3,Array(100079, 19622, *18159*, 212064, 107937, 44683, 150022, 39244,
100136, 58866, 72636, 145272, 817, 89366, * 54477*, *36318*, 308703,
160992, 45431, 162076)),
(ID2,Array(308703, * 54477*, 89366, 39244, 150022, 72636, 817, 58866,
44683, 19622, 160992, 107937, 100079, 100136, 145272, 64544, *18159*,
45431, *36318*, 162076))
)


whereas in Step 2 I need as below:

Array(
(ID1,Array(*18159*,*36318*, *54477,...*)),
(ID3,Array(*18159*,*36318*, *54477, ...*)),
(ID2,Array(*18159*,*36318*, *54477, ...*))
)

Does this help ?

On Tue, Jul 26, 2016 at 2:25 AM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Apologies janardhan, i always get confused on this
> Ok. so you have a  (key, val) RDD (val is irrelevant here)
>
> then you can do this
> val reduced = myRDD.reduceByKey((first, second) => first  ++ second)
>
> val sorted = reduced.sortBy(tpl => tpl._1)
>
> hth
>
>
>
> On Tue, Jul 26, 2016 at 3:31 AM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> groupBy is a shuffle operation and index is already lost in this process
>> if I am not wrong and don't see *sortWith* operation on RDD.
>>
>> Any suggestions or help ?
>>
>> On Mon, Jul 25, 2016 at 12:58 AM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> Hi
>>>  after you do a groupBy you should use a sortWith.
>>> Basically , a groupBy reduces your structure to (anyone correct me if i
>>> m wrong) a RDD[(key,val)], which you can see as a tuple.....so you could
>>> use sortWith (or sortBy, cannot remember which one) (tpl=> tpl._1)
>>> hth
>>>
>>> On Mon, Jul 25, 2016 at 1:21 AM, janardhan shetty <
>>> janardhan...@gmail.com> wrote:
>>>
>>>> Thanks Marco. This solved the order problem. Had another question which
>>>> is prefix to this.
>>>>
>>>> As you can see below ID2,ID1 and ID3 are in order and I need to
>>>> maintain this index order as well. But when we do groupByKey 
>>>> operation(*rdd.distinct.groupByKey().mapValues(v
>>>> => v.toArray*))
>>>> everything is *jumbled*.
>>>> Is there any way we can maintain this order as well ?
>>>>
>>>> scala> RDD.foreach(println)
>>>> (ID2,18159)
>>>> (ID1,18159)
>>>> (ID3,18159)
>>>>
>>>> (ID2,18159)
>>>> (ID1,18159)
>>>> (ID3,18159)
>>>>
>>>> (ID2,36318)
>>>> (ID1,36318)
>>>> (ID3,36318)
>>>>
>>>> (ID2,54477)
>>>> (ID1,54477)
>>>> (ID3,54477)
>>>>
>>>> *Jumbled version : *
>>>> Array(
>>>> (ID1,Array(*18159*, 308703, 72636, 64544, 39244, 107937, *54477*,
>>>> 145272, 100079, *36318*, 160992, 817, 89366, 150022, 19622, 44683,
>>>> 58866, 162076, 45431, 100136)),
>>>> (ID3,Array(100079, 19622, *18159*, 212064, 107937, 44683, 150022,
>>>> 39244, 100136, 58866, 72636, 145272, 817, 89366, * 54477*, *36318*,
>>>> 308703, 160992, 45431, 162076)),
>>>> (ID2,Array(308703, * 54477*, 89366, 39244, 150022, 72636, 817, 58866,
>>>> 44683, 19622, 160992, 107937, 100079, 100136, 145272, 64544, *18159*,
>>>> 45431, *36318*, 162076))
>>>> )
>>>>
>>>> *Expected output:*
>>>> Array(
>>>> (ID1,Array(*18159*,*36318*, *54477,...*)),
>>>> (ID3,Array(*18159*,*36318*, *54477, ...*)),
>>>> (ID2,Array(*18159*,*36318*, *54477, ...*))
>>>> )
>>>>
>>>> As you can see after *groupbyKey* operation is complete item 18519 is
>>>> in index 0 for ID1, index 2 for ID3 and index 16 for ID2 where as expected
>>>> is index 0
>>>>
>>>>
>>>> On Sun, Jul 24, 2016 at 12:43 PM, Marco Mistroni <mmistr...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello
>>>>>  Uhm you have an array containing 3 tuples?
>>>>> If all the arrays have same length, you can just zip all of them,
>>>>> creatings a list of tuples
>>>>> then you can scan the list 5 by 5...?
>>>>>
>>>>> so something like
>>>>>
>>>>> (Array(0)_2,Array(1)._2,Array(2)._2).zipped.toList
>>>>>
>>>>> this will give you a list of tuples of 3 elements containing each
>>>>> items from ID1, ID2 and ID3  ... sample below
>>>>> res: List((18159,100079,308703), (308703, 19622, 54477), (72636,18159,
>>>>> 89366)..........)
>>>>>
>>>>> then you can use a recursive function to compare each element such as
>>>>>
>>>>> def iterate(lst:List[(Int, Int, Int)]):T = {
>>>>>     if (lst.isEmpty): /// return your comparison
>>>>>     else {
>>>>>          val splits = lst.splitAt(5)
>>>>>          // do sometjhing about it using splits._1
>>>>>          iterate(splits._2)
>>>>>    }
>>>>>
>>>>> will this help? or am i still missing something?
>>>>>
>>>>> kr
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 24 Jul 2016 5:52 pm, "janardhan shetty" <janardhan...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Array(
>>>>>> (ID1,Array(18159, 308703, 72636, 64544, 39244, 107937, 54477, 145272,
>>>>>> 100079, 36318, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076,
>>>>>> 45431, 100136)),
>>>>>> (ID3,Array(100079, 19622, 18159, 212064, 107937, 44683, 150022,
>>>>>> 39244, 100136, 58866, 72636, 145272, 817, 89366, 54477, 36318, 308703,
>>>>>> 160992, 45431, 162076)),
>>>>>> (ID2,Array(308703, 54477, 89366, 39244, 150022, 72636, 817, 58866,
>>>>>> 44683, 19622, 160992, 107937, 100079, 100136, 145272, 64544, 18159, 
>>>>>> 45431,
>>>>>> 36318, 162076))
>>>>>> )
>>>>>>
>>>>>> I need to compare first 5 elements of ID1 with first five element of
>>>>>> ID3  next first 5 elements of ID1 to ID2. Similarly next 5 elements in 
>>>>>> that
>>>>>> order until the end of number of elements.
>>>>>> Let me know if this helps
>>>>>>
>>>>>>
>>>>>> On Sun, Jul 24, 2016 at 7:45 AM, Marco Mistroni <mmistr...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Apologies I misinterpreted.... could you post two use cases?
>>>>>>> Kr
>>>>>>>
>>>>>>> On 24 Jul 2016 3:41 pm, "janardhan shetty" <janardhan...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Marco,
>>>>>>>>
>>>>>>>> Thanks for the response. It is indexed order and not ascending or
>>>>>>>> descending order.
>>>>>>>> On Jul 24, 2016 7:37 AM, "Marco Mistroni" <mmistr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Use map values to transform to an rdd where values are sorted?
>>>>>>>>> Hth
>>>>>>>>>
>>>>>>>>> On 24 Jul 2016 6:23 am, "janardhan shetty" <janardhan...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I have a key,value pair rdd where value is an array of Ints. I
>>>>>>>>>> need to maintain the order of the value in order to execute 
>>>>>>>>>> downstream
>>>>>>>>>> modifications. How do we maintain the order of values?
>>>>>>>>>> Ex:
>>>>>>>>>> rdd = (id1,[5,2,3,15],
>>>>>>>>>> Id2,[9,4,2,5]....)
>>>>>>>>>>
>>>>>>>>>> Followup question how do we compare between one element in rdd
>>>>>>>>>> with all other elements ?
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Maintaining order of pair rdd

Reply via email to