Thanks Li! On Tue, Apr 3, 2018 at 7:23 PM Li Jin <ice.xell...@gmail.com> wrote:
> Thanks all for the explanation. I am happy to update the API doc. > > https://issues.apache.org/jira/browse/SPARK-23861 > > On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin <r...@databricks.com> wrote: > >> Ah ok. Thanks for commenting. Everyday I learn something new about SQL. >> >> For others to follow, SQL Server has a good explanation of the behavior: >> https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql >> >> >> Can somebody (Li?) update the API documentation to specify the gotchas, >> in case users are not familiar with SQL window function semantics? >> >> >> >> General Remarks >> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#general-remarks> >> >> More than one window function can be used in a single query with a single >> FROM clause. The OVER clause for each function can differ in partitioning >> and ordering. >> >> If PARTITION BY is not specified, the function treats all rows of the >> query result set as a single group. >> Important! >> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#important> >> >> If ROWS/RANGE is specified and <window frame preceding> is used for >> <window frame extent> (short syntax) then this specification is used for >> the window frame boundary starting point and CURRENT ROW is used for the >> boundary ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS >> BETWEEN 5 PRECEDING AND CURRENT ROW”. >> >> Note+ >> >> If ORDER BY is not specified entire partition is used for a window frame. >> This applies only to functions that do not require ORDER BY clause. If >> ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED >> PRECEDING AND CURRENT ROW is used as default for window frame. This applies >> only to functions that have can accept optional ROWS/RANGE specification. >> For example, ranking functions cannot accept ROWS/RANGE, therefore this >> window frame is not applied even though ORDER BY is present and ROWS/RANGE >> is not. >> >> >> >> >> >> On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang <jiangxb1...@gmail.com> >> wrote: >> >>> This is actually by design, without a `ORDER BY` clause, all rows are >>> considered as the peer row of the current row, which means that the frame >>> is effectively the entire partition. This behavior follows the window >>> syntax of PGSQL. >>> You can refer to the comment by yhuai: >>> https://github.com/apache/spark/pull/5604#discussion_r157931911 >>> :) >>> >>> 2018-04-04 6:27 GMT+08:00 Reynold Xin <r...@databricks.com>: >>> >>>> Do other (non-Hive) SQL systems do the same thing? >>>> >>>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier < >>>> her...@databricks.com> wrote: >>>> >>>>> This is something we inherited from Hive: >>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics >>>>> >>>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW >>>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND >>>>>> CURRENT ROW. >>>>> >>>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW >>>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND >>>>>> UNBOUNDED FOLLOWING. >>>>> >>>>> >>>>> It sort of makes sense if you think about it. If there is no ordering >>>>> there is no way to have a bound frame. If there is ordering we default to >>>>> the most commonly used deterministic frame. >>>>> >>>>> >>>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <r...@databricks.com> >>>>> wrote: >>>>> >>>>>> Seems like a bug. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xell...@gmail.com> wrote: >>>>>> >>>>>>> Hi Devs, >>>>>>> >>>>>>> I am seeing some behavior with window functions that is a bit >>>>>>> unintuitive and would like to get some clarification. >>>>>>> >>>>>>> When using aggregation function with window, the frame boundary >>>>>>> seems to change depending on the order of the window. >>>>>>> >>>>>>> Example: >>>>>>> (1) >>>>>>> >>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v') >>>>>>> >>>>>>> w1 = Window.partitionBy('id') >>>>>>> >>>>>>> df.withColumn('v2', mean(df.v).over(w1)).show() >>>>>>> >>>>>>> +---+---+---+ >>>>>>> >>>>>>> | id| v| v2| >>>>>>> >>>>>>> +---+---+---+ >>>>>>> >>>>>>> | 0| 1|2.0| >>>>>>> >>>>>>> | 0| 2|2.0| >>>>>>> >>>>>>> | 0| 3|2.0| >>>>>>> >>>>>>> +---+---+---+ >>>>>>> >>>>>>> (2) >>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v') >>>>>>> >>>>>>> w2 = Window.partitionBy('id').orderBy('v') >>>>>>> >>>>>>> df.withColumn('v2', mean(df.v).over(w2)).show() >>>>>>> >>>>>>> +---+---+---+ >>>>>>> >>>>>>> | id| v| v2| >>>>>>> >>>>>>> +---+---+---+ >>>>>>> >>>>>>> | 0| 1|1.0| >>>>>>> >>>>>>> | 0| 2|1.5| >>>>>>> >>>>>>> | 0| 3|2.0| >>>>>>> >>>>>>> +---+---+---+ >>>>>>> >>>>>>> Seems like orderBy('v') in the example (2) also changes the frame >>>>>>> boundaries from ( >>>>>>> >>>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding, >>>>>>> currentRow). >>>>>>> >>>>>>> >>>>>>> I found this behavior a bit unintuitive. I wonder if this behavior >>>>>>> is by design and if so, what's the specific rule that orderBy() >>>>>>> interacts >>>>>>> with frame boundaries? >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Li >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >