Hi, Julian and all,

We had a discussion in LinkedIn last week regarding to the window spec in
SQL language on top of Samza systems. There are some issues in the window
spec that I want to discuss:

Consider that we want to have a count of stock trades (as a infinite
stream) happened in the last hour, but only every 11min. It is easy to
write the first part as:
   SELECT STREAM rowtime, count(*) OVER (ORDER BY rowtime RANGE INTERVAL
'1' HOUR PROCEDING)
       FROM Trades
   The above will create a stream of counts that happened every hour
continuously as rows are scanned.

Now here is the question:
   a. how do we have the count every 11min instead of as the row comes in?
As we discussed before, there are examples that we can create by doing
truncating / grouping on the rowtime to "sample" the continuous moving
counting window to get a count every 11min. But that has two issues:
      - From implementation point of view, there is no efficiency
improvement since the system still computes the count for each and every
row comes in
      - If Samza implement a more efficient tumbling window operator, there
is no easy way to identify the section of SQL statement that can map to the
more efficient tumbling window operator, as the sampling is done via math /
group-by aggregation instead of window spec
   b. if there is no row in Trades between 12:00pm to 2:00pm, how do we
tell the system to still generate 0 counts for the time moments: 12:11pm,
12:22pm, 12:33pm, etc.? Or, those rows are delayed in the delivery in the
system and user wants to ignore late-arrival of messages after 5min timeout
to close the counting window? How can we support that use case w/o breaking
SQL grammar?

Both the above issues seem to require some extension to the window spec in
SQL grammar. Julian, what do you think? Is it creating too many
language/parser/planner problems in SQL?

Reply via email to