Re: SPARK-17845 - window function frame boundary API

ayan guha Sun, 09 Oct 2016 22:49:49 -0700

Hi Reynold

Thanks for asking. I am from sql world and use sparl sql with analytical
functions prety heavily.


IMHO, Window.rowsBetween() as a function name looks fine. What i would
propose would be:

Window.rowsBetween(startFrom=UNBOUNDED,endTo=CURRENT_ROW,preceeding=0,following=0)


startFrom, endTo: Determining range

preceeding,following: Anchor of current row and thus altering the range.


Calls:


//ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING

Window.rowsBetween(preceeding=3,following=3)

//ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING

Window.rowsBetween(startFrom=UNBOUNDED,preceeding=3)

//ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Window.rowsBetween(startFrom=UNBOUNDED,endTo=CURRENT_ROW)

//ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING (Not PRECEDING)

Window.rowsBetween(startFrom=CURRENT_ROW,endTo=UNBOUNDED)

//ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

Window.rowsBetween(startFrom=UNBOUNDED,endTo=UNBOUNDED)


One missing scenario (I think that is a valid one)


//ROWS BETWEEN 3 FOLLOWING AND UNBOUNDED FOLLOWING (pair of (2))

Window.rowsBetween(endTo=UNBOUNDED,following=3)


This will be closer to SQL options, IMHO.


Thoughts?


On Mon, Oct 10, 2016 at 3:50 PM, Reynold Xin <r...@databricks.com> wrote:

> Hi all,
>
> I tried to use the window function DataFrame API this weekend and found it
> awkward to use, especially with respect to specifying frame boundaries. I
> wrote down some options here and am curious your thoughts. If you have
> suggestions on the API beyond what's already listed in the JIRA ticket, do
> bring them up too.
>
> Please comment on the JIRA ticket directly: https://issues.
> apache.org/jira/browse/SPARK-17845
>
>
> I've attached the content of the JIRA ticket here to save you a click:
>
>
> ANSI SQL uses the following to specify the frame boundaries for window
> functions:
>
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
>
> In Spark's DataFrame API, we use integer values to indicate relative
> position:
>
>    - 0 means "CURRENT ROW"
>    - -1 means "1 PRECEDING"
>    - Long.MinValue means "UNBOUNDED PRECEDING"
>    - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
>
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 
> PRECEDINGWindow.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT 
> ROWWindow.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDINGWindow.rowsBetween(0, 
> Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
> FOLLOWINGWindow.rowsBetween(Long.MinValue, Long.MaxValue)
>
> I think using numeric values to indicate relative positions is actually a
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate
> unbounded ends is pretty confusing:
>
> 1. The API is not self-evident. There is no way for a new user to figure
> out how to indicate an unbounded frame by looking at just the API. The user
> has to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we
> use -sys.maxsize and +sys.maxsize.
>
> To make this API less confusing, we have a few options:
>
> Option 1. Add the following (additional) methods:
>
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsBetween(-3, +3)  // 
> this one exists already// ROWS BETWEEN UNBOUNDED PRECEDING AND 3 
> PRECEDINGWindow.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT 
> ROWWindow.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED 
> PRECEDINGWindow.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
> FOLLOWINGWindow.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
>
> This is obviously very verbose, but is very similar to how these functions
> are done in SQL, and is perhaps the most obvious to end users, especially
> if they come from SQL background.
>
> Option 2. Decouple the specification for frame begin and frame end into
> two functions. Assume the boundary is unlimited unless specified.
>
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDINGWindow.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWWindow.rowsToCurrent() or 
> Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDINGWindow.rowsFromCurrent() 
> or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING// no need to 
> specify
>
> If we go with option 2, we should throw exceptions if users specify
> multiple from's or to's. A variant of option 2 is to require explicitly
> specification of begin/end even in the case of unbounded boundary, e.g.:
>
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
>
>
>


-- 
Best Regards,
Ayan Guha

Re: SPARK-17845 - window function frame boundary API

Reply via email to