Is DataFrame.groupBy supposed to preserve order within groups?

Timothée Carayol Thu, 17 Dec 2015 23:56:26 -0800

Hi all,

I tried to do something like the following in Spark


df.orderBy('col1, 'col2).groupBy('col1).agg(first('col3))

I was hoping to get, within each col1 value, the value for col3 that
corresponds to the highest value for col2 within that col1 group. This only
works if the order on col2 is preserved after the groupBy step.

https://bzhangusc.wordpress.com/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/
suggests that it is (unlike RDD.groupBy, DataFrame.groupBy is described as
preserving the order).

Yet in my experiments, I find that in some cases the order is not
preserved. Running the same code multiple times gives me different results.

If this is a bug, I'll happily work on a reproducible example and post to
JIRA but I thought I'd check with the mailing list first in case that is,
in fact, the expected behaviour?

Thanks
Timothée

Is DataFrame.groupBy supposed to preserve order within groups?

Reply via email to