Re: Clarify window behavior in Spark SQL

Reynold Xin Tue, 03 Apr 2018 19:27:08 -0700

Thanks Li!

On Tue, Apr 3, 2018 at 7:23 PM Li Jin <ice.xell...@gmail.com> wrote:


> Thanks all for the explanation. I am happy to update the API doc.
>
> https://issues.apache.org/jira/browse/SPARK-23861
>
> On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> Ah ok. Thanks for commenting. Everyday I learn something new about SQL.
>>
>> For others to follow, SQL Server has a good explanation of the behavior:
>> https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql
>>
>>
>> Can somebody (Li?) update the API documentation to specify the gotchas,
>> in case users are not familiar with SQL window function semantics?
>>
>>
>>
>> General Remarks
>> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#general-remarks>
>>
>> More than one window function can be used in a single query with a single
>> FROM clause. The OVER clause for each function can differ in partitioning
>> and ordering.
>>
>> If PARTITION BY is not specified, the function treats all rows of the
>> query result set as a single group.
>> Important!
>> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#important>
>>
>> If ROWS/RANGE is specified and <window frame preceding> is used for
>> <window frame extent> (short syntax) then this specification is used for
>> the window frame boundary starting point and CURRENT ROW is used for the
>> boundary ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS
>> BETWEEN 5 PRECEDING AND CURRENT ROW”.
>>
>> Note+
>>
>> If ORDER BY is not specified entire partition is used for a window frame.
>> This applies only to functions that do not require ORDER BY clause. If
>> ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
>> PRECEDING AND CURRENT ROW is used as default for window frame. This applies
>> only to functions that have can accept optional ROWS/RANGE specification.
>> For example, ranking functions cannot accept ROWS/RANGE, therefore this
>> window frame is not applied even though ORDER BY is present and ROWS/RANGE
>> is not.
>>
>>
>>
>>
>>
>> On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang <jiangxb1...@gmail.com>
>> wrote:
>>
>>> This is actually by design, without a `ORDER BY` clause, all rows are
>>> considered as the peer row of the current row, which means that the frame
>>> is effectively the entire partition. This behavior follows the window
>>> syntax of PGSQL.
>>> You can refer to the comment by yhuai:
>>> https://github.com/apache/spark/pull/5604#discussion_r157931911
>>> :)
>>>
>>> 2018-04-04 6:27 GMT+08:00 Reynold Xin <r...@databricks.com>:
>>>
>>>> Do other (non-Hive) SQL systems do the same thing?
>>>>
>>>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
>>>> her...@databricks.com> wrote:
>>>>
>>>>> This is something we inherited from Hive:
>>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>>>>
>>>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
>>>>>> CURRENT ROW.
>>>>>
>>>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
>>>>>> UNBOUNDED FOLLOWING.
>>>>>
>>>>>
>>>>> It sort of makes sense if you think about it. If there is no ordering
>>>>> there is no way to have a bound frame. If there is ordering we default to
>>>>> the most commonly used deterministic frame.
>>>>>
>>>>>
>>>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Seems like a bug.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xell...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Devs,
>>>>>>>
>>>>>>> I am seeing some behavior with window functions that is a bit
>>>>>>> unintuitive and would like to get some clarification.
>>>>>>>
>>>>>>> When using aggregation function with window, the frame boundary
>>>>>>> seems to change depending on the order of the window.
>>>>>>>
>>>>>>> Example:
>>>>>>> (1)
>>>>>>>
>>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>>>
>>>>>>> w1 = Window.partitionBy('id')
>>>>>>>
>>>>>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>>>>>
>>>>>>> +---+---+---+
>>>>>>>
>>>>>>> | id|  v| v2|
>>>>>>>
>>>>>>> +---+---+---+
>>>>>>>
>>>>>>> |  0|  1|2.0|
>>>>>>>
>>>>>>> |  0|  2|2.0|
>>>>>>>
>>>>>>> |  0|  3|2.0|
>>>>>>>
>>>>>>> +---+---+---+
>>>>>>>
>>>>>>> (2)
>>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>>>
>>>>>>> w2 = Window.partitionBy('id').orderBy('v')
>>>>>>>
>>>>>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>>>>>
>>>>>>> +---+---+---+
>>>>>>>
>>>>>>> | id|  v| v2|
>>>>>>>
>>>>>>> +---+---+---+
>>>>>>>
>>>>>>> |  0|  1|1.0|
>>>>>>>
>>>>>>> |  0|  2|1.5|
>>>>>>>
>>>>>>> |  0|  3|2.0|
>>>>>>>
>>>>>>> +---+---+---+
>>>>>>>
>>>>>>> Seems like orderBy('v') in the example (2) also changes the frame
>>>>>>> boundaries from (
>>>>>>>
>>>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>>>>>> currentRow).
>>>>>>>
>>>>>>>
>>>>>>> I found this behavior a bit unintuitive. I wonder if this behavior
>>>>>>> is by design and if so, what's the specific rule that orderBy() 
>>>>>>> interacts
>>>>>>> with frame boundaries?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Li
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Clarify window behavior in Spark SQL

Reply via email to