SPARK-17845 - window function frame boundary API

Reynold Xin Sun, 09 Oct 2016 21:51:06 -0700

Hi all,

I tried to use the window function DataFrame API this weekend and found it
awkward to use, especially with respect to specifying frame boundaries. I
wrote down some options here and am curious your thoughts. If you have
suggestions on the API beyond what's already listed in the JIRA ticket, do
bring them up too.


Please comment on the JIRA ticket directly:
https://issues.apache.org/jira/browse/SPARK-17845


I've attached the content of the JIRA ticket here to save you a click:


ANSI SQL uses the following to specify the frame boundaries for window
functions:

ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

In Spark's DataFrame API, we use integer values to indicate relative
position:

   - 0 means "CURRENT ROW"
   - -1 means "1 PRECEDING"
   - Long.MinValue means "UNBOUNDED PRECEDING"
   - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"

// ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsBetween(-3, +3)
// ROWS BETWEEN UNBOUNDED PRECEDING AND 3
PRECEDINGWindow.rowsBetween(Long.MinValue, -3)
// ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROWWindow.rowsBetween(Long.MinValue, 0)
// ROWS BETWEEN CURRENT ROW AND UNBOUNDED
PRECEDINGWindow.rowsBetween(0, Long.MaxValue)
// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWINGWindow.rowsBetween(Long.MinValue, Long.MaxValue)

I think using numeric values to indicate relative positions is actually a
good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate
unbounded ends is pretty confusing:

1. The API is not self-evident. There is no way for a new user to figure
out how to indicate an unbounded frame by looking at just the API. The user
has to read the doc to figure this out.
2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
3. Different languages have different min/max values, e.g. in Python we use
-sys.maxsize and +sys.maxsize.

To make this API less confusing, we have a few options:

Option 1. Add the following (additional) methods:

// ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsBetween(-3, +3)
// this one exists already// ROWS BETWEEN UNBOUNDED PRECEDING AND 3
PRECEDINGWindow.rowsBetweenUnboundedPrecedingAnd(-3)
// ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROWWindow.rowsBetweenUnboundedPrecedingAndCurrentRow()
// ROWS BETWEEN CURRENT ROW AND UNBOUNDED
PRECEDINGWindow.rowsBetweenCurrentRowAndUnboundedFollowing()
// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWINGWindow.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()

This is obviously very verbose, but is very similar to how these functions
are done in SQL, and is perhaps the most obvious to end users, especially
if they come from SQL background.

Option 2. Decouple the specification for frame begin and frame end into two
functions. Assume the boundary is unlimited unless specified.

// ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsFrom(-3).rowsTo(3)
// ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDINGWindow.rowsTo(-3)
// ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROWWindow.rowsToCurrent() or Window.rowsTo(0)
// ROWS BETWEEN CURRENT ROW AND UNBOUNDED
PRECEDINGWindow.rowsFromCurrent() or Window.rowsFrom(0)
// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING// no need to specify

If we go with option 2, we should throw exceptions if users specify
multiple from's or to's. A variant of option 2 is to require explicitly
specification of begin/end even in the case of unbounded boundary, e.g.:

Window.rowsFromBeginning().rowsTo(-3)
or
Window.rowsFromUnboundedPreceding().rowsTo(-3)

SPARK-17845 - window function frame boundary API

Reply via email to