Hi all, I tried to do something like the following in Spark
df.orderBy('col1, 'col2).groupBy('col1).agg(first('col3)) I was hoping to get, within each col1 value, the value for col3 that corresponds to the highest value for col2 within that col1 group. This only works if the order on col2 is preserved after the groupBy step. https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ suggests that it is (unlike RDD.groupBy, DataFrame.groupBy is described as preserving the order). Yet in my experiments, I find that in some cases the order is not preserved. Running the same code multiple times gives me different results. If this is a bug, I'll happily work on a reproducible example and post to JIRA but I thought I'd check with the mailing list first in case that is, in fact, the expected behaviour? Thanks Timothée