Hi Devs, I am seeing some behavior with window functions that is a bit unintuitive and would like to get some clarification.
When using aggregation function with window, the frame boundary seems to change depending on the order of the window. Example: (1) df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v') w1 = Window.partitionBy('id') df.withColumn('v2', mean(df.v).over(w1)).show() +---+---+---+ | id| v| v2| +---+---+---+ | 0| 1|2.0| | 0| 2|2.0| | 0| 3|2.0| +---+---+---+ (2) df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v') w2 = Window.partitionBy('id').orderBy('v') df.withColumn('v2', mean(df.v).over(w2)).show() +---+---+---+ | id| v| v2| +---+---+---+ | 0| 1|1.0| | 0| 2|1.5| | 0| 3|2.0| +---+---+---+ Seems like orderBy('v') in the example (2) also changes the frame boundaries from ( unboundedPreceding, unboundedFollowing) to (unboundedPreceding, currentRow). I found this behavior a bit unintuitive. I wonder if this behavior is by design and if so, what's the specific rule that orderBy() interacts with frame boundaries? Thanks, Li