Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

Robin East Thu, 03 Nov 2016 05:38:25 -0700

I don’t think the semantics of groupBy necessarily preserve ordering - whatever 
the implementation details or the observed behaviour. I would use a Window 
operation and order within the group.





> On 3 Nov 2016, at 11:53, Rabin Banerjee <dev.rabin.baner...@gmail.com> wrote:
> 
> Hi All ,
> 
>   I want to do a dataframe operation to find the rows having the latest 
> timestamp in each group using the below operation 
> df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr"))
> .select("customername","service_type","mobileno","cust_addr")
> 
> Spark Version :: 1.6.x
> My Question is "Will Spark guarantee the Order while doing the groupBy , if 
> DF is ordered using OrderBy previously in Spark 1.6.x"??
> 
> I referred a blog here :: 
> https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
>  
> <https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/>
> Which claims it will work except in Spark 1.5.1 and 1.5.2 .
> 
> I need a bit elaboration of how internally spark handles it ? also is it more 
> efficient than using a Window function ?
> 
> Thanks in Advance ,
> Rabin Banerjee
> 
>

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

Reply via email to