Need to order iterator values in spark dataframe

Ranjan, Abhinav Thu, 26 Mar 2020 09:54:22 -0700

Hi,

I have a dataframe which has data like:


key                         |    code    |    code_value
1                            |    c1        |    11
1                            |    c2        |    12
1                            |    c2        |    9
1                            |    c3        |    12
1                            |    c2        |    13
1                            |    c2        |    14
1                            |    c4        |    12
1                            |    c2        |    15
1                            |    c1        |    12

I need to group the data based on key and then apply some custom logicon every of the value I got by grouping. So I did this:


lets suppose it is in a dataframe df.

*case class key_class(key: string, code: string, code_value: string)*


df
.as[key_class]
.groupByKey(_.key)
.mapGroups {
  (x, groupedValues) =>
    val status = groupedValues.map(row => {
      // do some custom logic on row
      ("SUCCESS")
    }).toList

}.toDF("status")

The issue with above approach is the values I get after applyinggroupByKey are not sorted/ordered. I want the values to be sorted by thecolumn 'code'.


There is a way to do this:

1. get them in a list and then apply sort ==> this will result in OOM ifthe iterartor is too big.

2. I think some how to apply the secondary sort, but problem with thatapproach is I have to keep track of the key change.

3. sortWithinPartitions cannot be applied because groupBy will mess upthe order.


4. Another approach is:

df
.as[key_class]
.sort("key").sort("code")
.map {
 // do stuff here
}

but here also I have to keep track of the key change within mapfunction, and sometimes this also overflows if the keys are skewed.

_/*So is there any way in which I can get the values sorted aftergrouping them by a key.??*/_


_/*
*/_

_/*Thanks,*/_

_/*Abhinav
*/_

Need to order iterator values in spark dataframe

Reply via email to