Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Maciej Szymkiewicz Fri, 02 Dec 2016 12:24:06 -0800

Sure, here you are: https://issues.apache.org/jira/browse/SPARK-18690


To be fair I am not fully convinced it is worth it.


On 12/02/2016 12:51 AM, Reynold Xin wrote:
> Can you submit a pull request with test cases based on that change?
>
>
> On Dec 1, 2016, 9:39 AM -0800, Maciej Szymkiewicz
> <mszymkiew...@gmail.com>, wrote:
>>
>> This doesn't affect that. The only concern is what we consider to
>> UNBOUNDED on Python side.
>>
>>
>> On 12/01/2016 07:56 AM, assaf.mendelson wrote:
>>>
>>> I may be mistaken but if I remember correctly spark behaves
>>> differently when it is bounded in the past and when it is not.
>>> Specifically I seem to recall a fix which made sure that when there
>>> is no lower bound then the aggregation is done one by one instead of
>>> doing the whole range for each window. So I believe it should be
>>> configured exactly the same as in scala/java so the optimization
>>> would take place.
>>>
>>> Assaf.
>>>
>>>  
>>>
>>> *From:* rxin [via Apache Spark Developers List]
>>> [mailto:ml-node+[hidden email]
>>> </user/SendEmail.jtp?type=node&node=20074&i=0>]
>>> *Sent:* Wednesday, November 30, 2016 8:35 PM
>>> *To:* Mendelson, Assaf
>>> *Subject:* Re: [SPARK-17845] [SQL][PYTHON] More self-evident window
>>> function frame boundary API
>>>
>>>  
>>>
>>> Yes I'd define unboundedPreceding to -sys.maxsize, but also any
>>> value less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered
>>> unboundedPreceding too. We need to be careful with long overflow
>>> when transferring data over to Java.
>>>
>>>  
>>>
>>>  
>>>
>>> On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz <[hidden email]
>>> </user/SendEmail.jtp?type=node&node=20069&i=0>> wrote:
>>>
>>> It is platform specific so theoretically can be larger, but 2**63 -
>>> 1 is a standard on 64 bit platform and 2**31 - 1 on 32bit platform.
>>> I can submit a patch but I am not sure how to proceed. Personally I
>>> would set
>>>
>>> unboundedPreceding = -sys.maxsize
>>> unboundedFollowing = sys.maxsize
>>>
>>> to keep backwards compatibility.
>>>
>>> On 11/30/2016 06:52 PM, Reynold Xin wrote:
>>>
>>>     Ah ok for some reason when I did the pull request sys.maxsize
>>>     was much larger than 2^63. Do you want to submit a patch to fix
>>>     this?
>>>
>>>      
>>>
>>>      
>>>
>>>     On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <[hidden
>>>     email] </user/SendEmail.jtp?type=node&node=20069&i=1>> wrote:
>>>
>>>     The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code
>>>     which used to work before is off by one.
>>>
>>>     On 11/30/2016 06:43 PM, Reynold Xin wrote:
>>>
>>>         Can you give a repro? Anything less than -(1 << 63) is
>>>         considered negative infinity (i.e. unbounded preceding).
>>>
>>>          
>>>
>>>         On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <[hidden
>>>         email] </user/SendEmail.jtp?type=node&node=20069&i=2>> wrote:
>>>
>>>         Hi,
>>>
>>>         I've been looking at the SPARK-17845 and I am curious if
>>>         there is any
>>>         reason to make it a breaking change. In Spark 2.0 and below
>>>         we could use:
>>>
>>>            
>>>         Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>>>         sys.maxsize))
>>>
>>>         In 2.1.0 this code will silently produce incorrect results
>>>         (ROWS BETWEEN
>>>         -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>>>         Window.unboundedPreceding equal -sys.maxsize to ensure backward
>>>         compatibility?
>>>
>>>         --
>>>
>>>         Maciej Szymkiewicz
>>>
>>>
>>>         
>>> ---------------------------------------------------------------------
>>>         To unsubscribe e-mail: [hidden email]
>>>         </user/SendEmail.jtp?type=node&node=20069&i=3>
>>>
>>>          
>>>
>>>      
>>>
>>>     -- 
>>>
>>>     Maciej Szymkiewicz
>>>
>>>      
>>>
>>>  
>>>
>>> -- 
>>> Maciej Szymkiewicz
>>>
>>>  
>>>
>>>  
>>>
>>> ------------------------------------------------------------------------
>>>
>>> *If you reply to this email, your message will be added to the
>>> discussion below:*
>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20069.html
>>>
>>> To start a new topic under Apache Spark Developers List, email
>>> [hidden email] </user/SendEmail.jtp?type=node&node=20074&i=1>
>>> To unsubscribe from Apache Spark Developers List, click here.
>>> NAML
>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>>
>>> ------------------------------------------------------------------------
>>> View this message in context: RE: [SPARK-17845] [SQL][PYTHON] More
>>> self-evident window function frame boundary API
>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20074.html>
>>> Sent from the Apache Spark Developers List mailing list archive
>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
>>> Nabble.com.
>>
>> --  
>> Maciej Szymkiewicz

-- 
Maciej Szymkiewicz

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

Reply via email to