Hi Li, You might find my pending PR useful: https://github.com/apache/spark/pull/20045/files
It contains a big bunch of test cases covering the windowing functionality, showing and checking the behavior of a number of special cases. On Wed, Apr 4, 2018 at 4:26 AM, Reynold Xin <r...@databricks.com> wrote: > Thanks Li! > > On Tue, Apr 3, 2018 at 7:23 PM Li Jin <ice.xell...@gmail.com> wrote: > >> Thanks all for the explanation. I am happy to update the API doc. >> >> https://issues.apache.org/jira/browse/SPARK-23861 >> >> On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin <r...@databricks.com> wrote: >> >>> Ah ok. Thanks for commenting. Everyday I learn something new about SQL. >>> >>> For others to follow, SQL Server has a good explanation of the behavior: >>> https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause- >>> transact-sql >>> >>> >>> Can somebody (Li?) update the API documentation to specify the gotchas, >>> in case users are not familiar with SQL window function semantics? >>> >>> >>> >>> General Remarks >>> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#general-remarks> >>> >>> More than one window function can be used in a single query with a >>> single FROM clause. The OVER clause for each function can differ in >>> partitioning and ordering. >>> >>> If PARTITION BY is not specified, the function treats all rows of the >>> query result set as a single group. >>> Important! >>> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#important> >>> >>> If ROWS/RANGE is specified and <window frame preceding> is used for >>> <window frame extent> (short syntax) then this specification is used for >>> the window frame boundary starting point and CURRENT ROW is used for the >>> boundary ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS >>> BETWEEN 5 PRECEDING AND CURRENT ROW”. >>> >>> Note+ >>> >>> If ORDER BY is not specified entire partition is used for a window >>> frame. This applies only to functions that do not require ORDER BY clause. >>> If ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED >>> PRECEDING AND CURRENT ROW is used as default for window frame. This applies >>> only to functions that have can accept optional ROWS/RANGE specification. >>> For example, ranking functions cannot accept ROWS/RANGE, therefore this >>> window frame is not applied even though ORDER BY is present and ROWS/RANGE >>> is not. >>> >>> >>> >>> >>> >>> On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang <jiangxb1...@gmail.com> >>> wrote: >>> >>>> This is actually by design, without a `ORDER BY` clause, all rows are >>>> considered as the peer row of the current row, which means that the frame >>>> is effectively the entire partition. This behavior follows the window >>>> syntax of PGSQL. >>>> You can refer to the comment by yhuai: https://github.com/ >>>> apache/spark/pull/5604#discussion_r157931911 >>>> :) >>>> >>>> 2018-04-04 6:27 GMT+08:00 Reynold Xin <r...@databricks.com>: >>>> >>>>> Do other (non-Hive) SQL systems do the same thing? >>>>> >>>>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier < >>>>> her...@databricks.com> wrote: >>>>> >>>>>> This is something we inherited from Hive: https://cwiki.apache. >>>>>> org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics >>>>>> >>>>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW >>>>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND >>>>>>> CURRENT ROW. >>>>>> >>>>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW >>>>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND >>>>>>> UNBOUNDED FOLLOWING. >>>>>> >>>>>> >>>>>> It sort of makes sense if you think about it. If there is no ordering >>>>>> there is no way to have a bound frame. If there is ordering we default to >>>>>> the most commonly used deterministic frame. >>>>>> >>>>>> >>>>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <r...@databricks.com> >>>>>> wrote: >>>>>> >>>>>>> Seems like a bug. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xell...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Devs, >>>>>>>> >>>>>>>> I am seeing some behavior with window functions that is a bit >>>>>>>> unintuitive and would like to get some clarification. >>>>>>>> >>>>>>>> When using aggregation function with window, the frame boundary >>>>>>>> seems to change depending on the order of the window. >>>>>>>> >>>>>>>> Example: >>>>>>>> (1) >>>>>>>> >>>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v') >>>>>>>> >>>>>>>> w1 = Window.partitionBy('id') >>>>>>>> >>>>>>>> df.withColumn('v2', mean(df.v).over(w1)).show() >>>>>>>> >>>>>>>> +---+---+---+ >>>>>>>> >>>>>>>> | id| v| v2| >>>>>>>> >>>>>>>> +---+---+---+ >>>>>>>> >>>>>>>> | 0| 1|2.0| >>>>>>>> >>>>>>>> | 0| 2|2.0| >>>>>>>> >>>>>>>> | 0| 3|2.0| >>>>>>>> >>>>>>>> +---+---+---+ >>>>>>>> >>>>>>>> (2) >>>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v') >>>>>>>> >>>>>>>> w2 = Window.partitionBy('id').orderBy('v') >>>>>>>> >>>>>>>> df.withColumn('v2', mean(df.v).over(w2)).show() >>>>>>>> >>>>>>>> +---+---+---+ >>>>>>>> >>>>>>>> | id| v| v2| >>>>>>>> >>>>>>>> +---+---+---+ >>>>>>>> >>>>>>>> | 0| 1|1.0| >>>>>>>> >>>>>>>> | 0| 2|1.5| >>>>>>>> >>>>>>>> | 0| 3|2.0| >>>>>>>> >>>>>>>> +---+---+---+ >>>>>>>> >>>>>>>> Seems like orderBy('v') in the example (2) also changes the frame >>>>>>>> boundaries from ( >>>>>>>> >>>>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding, >>>>>>>> currentRow). >>>>>>>> >>>>>>>> >>>>>>>> I found this behavior a bit unintuitive. I wonder if this behavior >>>>>>>> is by design and if so, what's the specific rule that orderBy() >>>>>>>> interacts >>>>>>>> with frame boundaries? >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Li >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>