Hi Reynold Thanks for asking. I am from sql world and use sparl sql with analytical functions prety heavily.
IMHO, Window.rowsBetween() as a function name looks fine. What i would propose would be: Window.rowsBetween(startFrom=UNBOUNDED,endTo=CURRENT_ROW,preceeding=0,following=0) startFrom, endTo: Determining range preceeding,following: Anchor of current row and thus altering the range. Calls: //ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING Window.rowsBetween(preceeding=3,following=3) //ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING Window.rowsBetween(startFrom=UNBOUNDED,preceeding=3) //ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW Window.rowsBetween(startFrom=UNBOUNDED,endTo=CURRENT_ROW) //ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING (Not PRECEDING) Window.rowsBetween(startFrom=CURRENT_ROW,endTo=UNBOUNDED) //ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Window.rowsBetween(startFrom=UNBOUNDED,endTo=UNBOUNDED) One missing scenario (I think that is a valid one) //ROWS BETWEEN 3 FOLLOWING AND UNBOUNDED FOLLOWING (pair of (2)) Window.rowsBetween(endTo=UNBOUNDED,following=3) This will be closer to SQL options, IMHO. Thoughts? On Mon, Oct 10, 2016 at 3:50 PM, Reynold Xin <r...@databricks.com> wrote: > Hi all, > > I tried to use the window function DataFrame API this weekend and found it > awkward to use, especially with respect to specifying frame boundaries. I > wrote down some options here and am curious your thoughts. If you have > suggestions on the API beyond what's already listed in the JIRA ticket, do > bring them up too. > > Please comment on the JIRA ticket directly: https://issues. > apache.org/jira/browse/SPARK-17845 > > > I've attached the content of the JIRA ticket here to save you a click: > > > ANSI SQL uses the following to specify the frame boundaries for window > functions: > > ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > > In Spark's DataFrame API, we use integer values to indicate relative > position: > > - 0 means "CURRENT ROW" > - -1 means "1 PRECEDING" > - Long.MinValue means "UNBOUNDED PRECEDING" > - Long.MaxValue to indicate "UNBOUNDED FOLLOWING" > > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsBetween(-3, +3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 > PRECEDINGWindow.rowsBetween(Long.MinValue, -3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT > ROWWindow.rowsBetween(Long.MinValue, 0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDINGWindow.rowsBetween(0, > Long.MaxValue) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED > FOLLOWINGWindow.rowsBetween(Long.MinValue, Long.MaxValue) > > I think using numeric values to indicate relative positions is actually a > good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate > unbounded ends is pretty confusing: > > 1. The API is not self-evident. There is no way for a new user to figure > out how to indicate an unbounded frame by looking at just the API. The user > has to read the doc to figure this out. > 2. It is weird Long.MinValue or Long.MaxValue has some special meaning. > 3. Different languages have different min/max values, e.g. in Python we > use -sys.maxsize and +sys.maxsize. > > To make this API less confusing, we have a few options: > > Option 1. Add the following (additional) methods: > > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsBetween(-3, +3) // > this one exists already// ROWS BETWEEN UNBOUNDED PRECEDING AND 3 > PRECEDINGWindow.rowsBetweenUnboundedPrecedingAnd(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT > ROWWindow.rowsBetweenUnboundedPrecedingAndCurrentRow() > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED > PRECEDINGWindow.rowsBetweenCurrentRowAndUnboundedFollowing() > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED > FOLLOWINGWindow.rowsBetweenUnboundedPrecedingAndUnboundedFollowing() > > This is obviously very verbose, but is very similar to how these functions > are done in SQL, and is perhaps the most obvious to end users, especially > if they come from SQL background. > > Option 2. Decouple the specification for frame begin and frame end into > two functions. Assume the boundary is unlimited unless specified. > > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsFrom(-3).rowsTo(3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDINGWindow.rowsTo(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWWindow.rowsToCurrent() or > Window.rowsTo(0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDINGWindow.rowsFromCurrent() > or Window.rowsFrom(0) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING// no need to > specify > > If we go with option 2, we should throw exceptions if users specify > multiple from's or to's. A variant of option 2 is to require explicitly > specification of begin/end even in the case of unbounded boundary, e.g.: > > Window.rowsFromBeginning().rowsTo(-3) > or > Window.rowsFromUnboundedPreceding().rowsTo(-3) > > > -- Best Regards, Ayan Guha