Re: Clarify window behavior in Spark SQL

Sandor Murakozi Mon, 09 Apr 2018 05:07:28 -0700

Hi Li,
You might find my pending PR useful:
https://github.com/apache/spark/pull/20045/files


It contains a big bunch of test cases covering the windowing functionality,
showing and checking the behavior of a number of special cases.

On Wed, Apr 4, 2018 at 4:26 AM, Reynold Xin <r...@databricks.com> wrote:

> Thanks Li!
>
> On Tue, Apr 3, 2018 at 7:23 PM Li Jin <ice.xell...@gmail.com> wrote:
>
>> Thanks all for the explanation. I am happy to update the API doc.
>>
>> https://issues.apache.org/jira/browse/SPARK-23861
>>
>> On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Ah ok. Thanks for commenting. Everyday I learn something new about SQL.
>>>
>>> For others to follow, SQL Server has a good explanation of the behavior:
>>> https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-
>>> transact-sql
>>>
>>>
>>> Can somebody (Li?) update the API documentation to specify the gotchas,
>>> in case users are not familiar with SQL window function semantics?
>>>
>>>
>>>
>>> General Remarks
>>> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#general-remarks>
>>>
>>> More than one window function can be used in a single query with a
>>> single FROM clause. The OVER clause for each function can differ in
>>> partitioning and ordering.
>>>
>>> If PARTITION BY is not specified, the function treats all rows of the
>>> query result set as a single group.
>>> Important!
>>> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#important>
>>>
>>> If ROWS/RANGE is specified and <window frame preceding> is used for
>>> <window frame extent> (short syntax) then this specification is used for
>>> the window frame boundary starting point and CURRENT ROW is used for the
>>> boundary ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS
>>> BETWEEN 5 PRECEDING AND CURRENT ROW”.
>>>
>>> Note+
>>>
>>> If ORDER BY is not specified entire partition is used for a window
>>> frame. This applies only to functions that do not require ORDER BY clause.
>>> If ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
>>> PRECEDING AND CURRENT ROW is used as default for window frame. This applies
>>> only to functions that have can accept optional ROWS/RANGE specification.
>>> For example, ranking functions cannot accept ROWS/RANGE, therefore this
>>> window frame is not applied even though ORDER BY is present and ROWS/RANGE
>>> is not.
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang <jiangxb1...@gmail.com>
>>> wrote:
>>>
>>>> This is actually by design, without a `ORDER BY` clause, all rows are
>>>> considered as the peer row of the current row, which means that the frame
>>>> is effectively the entire partition. This behavior follows the window
>>>> syntax of PGSQL.
>>>> You can refer to the comment by yhuai: https://github.com/
>>>> apache/spark/pull/5604#discussion_r157931911
>>>> :)
>>>>
>>>> 2018-04-04 6:27 GMT+08:00 Reynold Xin <r...@databricks.com>:
>>>>
>>>>> Do other (non-Hive) SQL systems do the same thing?
>>>>>
>>>>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
>>>>> her...@databricks.com> wrote:
>>>>>
>>>>>> This is something we inherited from Hive: https://cwiki.apache.
>>>>>> org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>>>>>
>>>>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>>>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
>>>>>>> CURRENT ROW.
>>>>>>
>>>>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>>>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
>>>>>>> UNBOUNDED FOLLOWING.
>>>>>>
>>>>>>
>>>>>> It sort of makes sense if you think about it. If there is no ordering
>>>>>> there is no way to have a bound frame. If there is ordering we default to
>>>>>> the most commonly used deterministic frame.
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Seems like a bug.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xell...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Devs,
>>>>>>>>
>>>>>>>> I am seeing some behavior with window functions that is a bit
>>>>>>>> unintuitive and would like to get some clarification.
>>>>>>>>
>>>>>>>> When using aggregation function with window, the frame boundary
>>>>>>>> seems to change depending on the order of the window.
>>>>>>>>
>>>>>>>> Example:
>>>>>>>> (1)
>>>>>>>>
>>>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>>>>
>>>>>>>> w1 = Window.partitionBy('id')
>>>>>>>>
>>>>>>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>>>>>>
>>>>>>>> +---+---+---+
>>>>>>>>
>>>>>>>> | id|  v| v2|
>>>>>>>>
>>>>>>>> +---+---+---+
>>>>>>>>
>>>>>>>> |  0|  1|2.0|
>>>>>>>>
>>>>>>>> |  0|  2|2.0|
>>>>>>>>
>>>>>>>> |  0|  3|2.0|
>>>>>>>>
>>>>>>>> +---+---+---+
>>>>>>>>
>>>>>>>> (2)
>>>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>>>>
>>>>>>>> w2 = Window.partitionBy('id').orderBy('v')
>>>>>>>>
>>>>>>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>>>>>>
>>>>>>>> +---+---+---+
>>>>>>>>
>>>>>>>> | id|  v| v2|
>>>>>>>>
>>>>>>>> +---+---+---+
>>>>>>>>
>>>>>>>> |  0|  1|1.0|
>>>>>>>>
>>>>>>>> |  0|  2|1.5|
>>>>>>>>
>>>>>>>> |  0|  3|2.0|
>>>>>>>>
>>>>>>>> +---+---+---+
>>>>>>>>
>>>>>>>> Seems like orderBy('v') in the example (2) also changes the frame
>>>>>>>> boundaries from (
>>>>>>>>
>>>>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>>>>>>> currentRow).
>>>>>>>>
>>>>>>>>
>>>>>>>> I found this behavior a bit unintuitive. I wonder if this behavior
>>>>>>>> is by design and if so, what's the specific rule that orderBy() 
>>>>>>>> interacts
>>>>>>>> with frame boundaries?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Li
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Clarify window behavior in Spark SQL

Reply via email to