Seems like a bug.
On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xell...@gmail.com> wrote: > Hi Devs, > > I am seeing some behavior with window functions that is a bit unintuitive > and would like to get some clarification. > > When using aggregation function with window, the frame boundary seems to > change depending on the order of the window. > > Example: > (1) > > df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v') > > w1 = Window.partitionBy('id') > > df.withColumn('v2', mean(df.v).over(w1)).show() > > +---+---+---+ > > | id| v| v2| > > +---+---+---+ > > | 0| 1|2.0| > > | 0| 2|2.0| > > | 0| 3|2.0| > > +---+---+---+ > > (2) > df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v') > > w2 = Window.partitionBy('id').orderBy('v') > > df.withColumn('v2', mean(df.v).over(w2)).show() > > +---+---+---+ > > | id| v| v2| > > +---+---+---+ > > | 0| 1|1.0| > > | 0| 2|1.5| > > | 0| 3|2.0| > > +---+---+---+ > > Seems like orderBy('v') in the example (2) also changes the frame > boundaries from ( > > unboundedPreceding, unboundedFollowing) to (unboundedPreceding, > currentRow). > > > I found this behavior a bit unintuitive. I wonder if this behavior is by > design and if so, what's the specific rule that orderBy() interacts with > frame boundaries? > > > Thanks, > > Li > >