I don’t think the semantics of groupBy necessarily preserve ordering - whatever the implementation details or the observed behaviour. I would use a Window operation and order within the group.
> On 3 Nov 2016, at 11:53, Rabin Banerjee <dev.rabin.baner...@gmail.com> wrote: > > Hi All , > > I want to do a dataframe operation to find the rows having the latest > timestamp in each group using the below operation > df.orderBy(desc("transaction_date")).groupBy("mobileno").agg(first("customername").as("customername"),first("service_type").as("service_type"),first("cust_addr").as("cust_abbr")) > .select("customername","service_type","mobileno","cust_addr") > > Spark Version :: 1.6.x > My Question is "Will Spark guarantee the Order while doing the groupBy , if > DF is ordered using OrderBy previously in Spark 1.6.x"?? > > I referred a blog here :: > https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ > > <https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/> > Which claims it will work except in Spark 1.5.1 and 1.5.2 . > > I need a bit elaboration of how internally spark handles it ? also is it more > efficient than using a Window function ? > > Thanks in Advance , > Rabin Banerjee > >