Hi Julian, This is a great improvement over the previous hopping window. Thanks for thinking through this. I also like if we can introduce a TUMBLE function with more control over how we define tumbling window size. With the current FLOOR based model we have to perform date/time arithmetic to have tumbling windows such as 5 minutes tumbling windows (May be there is a better way that I don't know). But TUMBLE function that can specify the parameters such as window size would be nice. I am +1 for other extensions as well.
Thanks Milinda On Wed, Jun 24, 2015 at 6:32 PM, Julian Hyde <jh...@apache.org> wrote: > Hi all, > > Forgive the cross-post. This is for Calcite devs interested in > streaming and Samza devs interested in SQL. > > I've been thinking some more about how to implement hopping and > tumbling windows in streaming SQL. I was previously at a loss to find > a concise syntax that is consistent with how SQL semantics, but I have > found a syntax that I think can please everyone. > > Recall that a hopping window emits a sub-total every X seconds of > records that have arrived over the last Y seconds. A tumbling window > is a hopping window where X and Y are equal. > > In https://calcite.incubator.apache.org/docs/stream.html#hopping-windows > I give an example, "emit, every hour, the number of each product > ordered over the past three hours". > > That example gives a query in terms of a GROUP BY (in the HourlyTotals > view) followed by a moving sum. I didn't think that it was possible to > express using just one GROUP BY, because that would violate one of the > principles of SQL: that each record entering a GROUP BY contributes to > precisely one output record. > > But I've just realized that the CUBE, ROLLUP and GROUPING SETS > operators (already in SQL) violate that principle. And if they can do > it, we can do the same. So we could add another grouping function, > HOP(t, emit, retain). > > The query would look like this: > > SELECT STREAM START(rowtime) AS rowtime, > productId, > SUM(units) AS sumUnits, > COUNT(*) AS c > FROM Orders > GROUP BY HOP(rowtime, INTERVAL '1' HOUR, INTERVAL '3' HOUR), > productId > > Much nicer than the one in stream.html! > > The "trick" is that the HOP function is returning a list of rowtime > values. For example, for row 1 {rowtime: '09:33', ...} it will return > ['09:00', '10:00', '11:00']; for row 2 {rowtime: '10:05', ...} it will > return ['10:00', '11:00', '12:00']. The system adds each row to > several sub-totals, and emits each sub-total when it is complete. The > sub-total for '09:00' will contain only row 1, and will be emitted at > '10:00'; the sub-total for '10:00' will contain row 1 and row 2, and > will be emitted at '11:00', and so forth. > > Returning multiple values is related to the flatMap function in Spark > (and earlier selectMany in LINQ) and makes HOP's semantics similar to > GROUPING SETS and therefore sound. > > START is a new aggregate function that returns the lower bound of the > current sub-total; END similarly. > > Note that the "retain" argument does not need to be a whole multiple > of the "emit" argument. This was a major limitation in the previous > proposal. > > There are some straightforward extensions: > * Define a TUMBLE function; > * Add an "align" argument to HOP, to allow windows to start at, say, 5 > minutes past each hour; > * Apply HOP to windows based on row-counts; > * Allow user-defined windowing functions that similarly return a list > of interval start-end points. > > Julian > -- Milinda Pathirage PhD Student | Research Assistant School of Informatics and Computing | Data to Insight Center Indiana University twitter: milindalakmal skype: milinda.pathirage blog: http://milinda.pathirage.org