Re: Window spec in SQL language vs Samza system details

Julian Hyde Tue, 10 Feb 2015 13:25:58 -0800

The answer depends on your design philosophy. We need to strike a balance 
between making it possible and making it easy. Because SQL is a powerful closed 
language, we can achieve a lot by combining the elements. For example, I think 
that your example can be solved by joining a "heartbeat" stream to the recent 
history of the trades stream:

 SELECT STREAM t.*
  FROM (
    SELECT ticker, count(*) 
    FROM Trades OVER (RANGE INTERVAL '1' HOUR PRECEDING)
    GROUP BY ticker) AS t
 JOIN (STREAM Heartbeat(INTERVAL '11' MINUTE))

Note: "Heartbeat" is a user-defined stream function; OVER in the FROM clause to 
convert a stream into a relation that contains its history over a particular 
period relative to the "current time" of the query.

I seems natural to use a heartbeat because of the "input - output cardinality 
principle":
* If it's one row in, one row out, use windowed aggregation (agg OVER window in 
the SELECT clause). 
* If it's several rows in, one row out, use streaming GROUP BY (on time plus 
possibly other keys)
* If it's possible that the system produces output when there are no rows in, 
you need to use a heartbeat stream

Now, if you take the "make it possible" philosophy to its limit, there is a 
danger that you end up with SQL that is so complicated that end users don't 
understand it, and/or is so complicated that the planner cannot recognize the 
pattern and convert it to your nice efficient physical operator.

That is the point where you very carefully introduce SQL extensions.

But I strongly suggest creating a "cookbook". Give each stream-processing 
pattern a name and illustrate using a simple example. Show the SQL that can 
achieve that pattern. It doesn't matter if the SQL is a bit gnarly. Users will 
grok it, and adapt the SQL for their applications. And they will compose it 
using union, join, group by... to create new patterns you hadn't thought of.

If you extend SQL without sufficient thought, you might break composability, 
and that is a huge problem. Or you end up writing the planner so that it 
produces the right plan when it sees the query in its sugared version but not 
when expressed using the fundamentals (case in point: if we had introduced a 
"tumbling window" syntax, someone could still express it using the JOIN query 
above); a lesser problem, but still undermines your users' trust in the system 
to do the right thing.

Julian

> On Feb 9, 2015, at 11:46 AM, Yi Pan <[email protected]> wrote:
> 
> Hi, Julian and all,
> 
> We had a discussion in LinkedIn last week regarding to the window spec in
> SQL language on top of Samza systems. There are some issues in the window
> spec that I want to discuss:
> 
> Consider that we want to have a count of stock trades (as a infinite
> stream) happened in the last hour, but only every 11min. It is easy to
> write the first part as:
>   SELECT STREAM rowtime, count(*) OVER (ORDER BY rowtime RANGE INTERVAL
> '1' HOUR PROCEDING)
>       FROM Trades
>   The above will create a stream of counts that happened every hour
> continuously as rows are scanned.
> 
> Now here is the question:
>   a. how do we have the count every 11min instead of as the row comes in?
> As we discussed before, there are examples that we can create by doing
> truncating / grouping on the rowtime to "sample" the continuous moving
> counting window to get a count every 11min. But that has two issues:
>      - From implementation point of view, there is no efficiency
> improvement since the system still computes the count for each and every
> row comes in
>      - If Samza implement a more efficient tumbling window operator, there
> is no easy way to identify the section of SQL statement that can map to the
> more efficient tumbling window operator, as the sampling is done via math /
> group-by aggregation instead of window spec
>   b. if there is no row in Trades between 12:00pm to 2:00pm, how do we
> tell the system to still generate 0 counts for the time moments: 12:11pm,
> 12:22pm, 12:33pm, etc.? Or, those rows are delayed in the delivery in the
> system and user wants to ignore late-arrival of messages after 5min timeout
> to close the counting window? How can we support that use case w/o breaking
> SQL grammar?
> 
> Both the above issues seem to require some extension to the window spec in
> SQL grammar. Julian, what do you think? Is it creating too many
> language/parser/planner problems in SQL?

Re: Window spec in SQL language vs Samza system details

Reply via email to