Hi all,

I tried to do something like the following in Spark

df.orderBy('col1, 'col2).groupBy('col1).agg(first('col3))

I was hoping to get, within each col1 value, the value for col3 that
corresponds to the highest value for col2 within that col1 group. This only
works if the order on col2 is preserved after the groupBy step.

https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
suggests that it is (unlike RDD.groupBy, DataFrame.groupBy is described as
preserving the order).

Yet in my experiments, I find that in some cases the order is not
preserved. Running the same code multiple times gives me different results.

If this is a bug, I'll happily work on a reproducible example and post to
JIRA but I thought I'd check with the mailing list first in case that is,
in fact, the expected behaviour?

Thanks
Timothée

Reply via email to