Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

Rabin Banerjee Thu, 03 Nov 2016 07:08:29 -0700

Hi Koert & Robin ,

*  Thanks ! *But if you go through the blog https://bzhangusc.wordpress.co
m/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ and check the
comments under the blog it's actually working, although I am not sure how .
And yes I agree a custom aggregate UDAF is a good option .


Can anyone share the best way to implement this in Spark .?

Regards,
Rabin Banerjee

On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers <ko...@tresata.com> wrote:

> Just realized you only want to keep first element. You can do this without
> sorting by doing something similar to min or max operation using a custom
> aggregator/udaf or reduceGroups on Dataset. This is also more efficient.
>
> On Nov 3, 2016 7:53 AM, "Rabin Banerjee" <dev.rabin.baner...@gmail.com>
> wrote:
>
>> Hi All ,
>>
>>   I want to do a dataframe operation to find the rows having the latest
>> timestamp in each group using the below operation
>>
>> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
>> .select("customername","service_type","mobileno","cust_addr")
>>
>>
>> *Spark Version :: 1.6.x*
>>
>> My Question is *"Will Spark guarantee the Order while doing the groupBy , if 
>> DF is ordered using OrderBy previously in Spark 1.6.x"??*
>>
>>
>> *I referred a blog here :: 
>> **https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>>  
>> <https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/>*
>>
>> *Which claims it will work except in Spark 1.5.1 and 1.5.2 .*
>>
>>
>> *I need a bit elaboration of how internally spark handles it ? also is it 
>> more efficient than using a Window function ?*
>>
>>
>> *Thanks in Advance ,*
>>
>> *Rabin Banerjee*
>>
>>
>>
>>

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

Reply via email to