Hi, Julian and all, We had a discussion in LinkedIn last week regarding to the window spec in SQL language on top of Samza systems. There are some issues in the window spec that I want to discuss:
Consider that we want to have a count of stock trades (as a infinite stream) happened in the last hour, but only every 11min. It is easy to write the first part as: SELECT STREAM rowtime, count(*) OVER (ORDER BY rowtime RANGE INTERVAL '1' HOUR PROCEDING) FROM Trades The above will create a stream of counts that happened every hour continuously as rows are scanned. Now here is the question: a. how do we have the count every 11min instead of as the row comes in? As we discussed before, there are examples that we can create by doing truncating / grouping on the rowtime to "sample" the continuous moving counting window to get a count every 11min. But that has two issues: - From implementation point of view, there is no efficiency improvement since the system still computes the count for each and every row comes in - If Samza implement a more efficient tumbling window operator, there is no easy way to identify the section of SQL statement that can map to the more efficient tumbling window operator, as the sampling is done via math / group-by aggregation instead of window spec b. if there is no row in Trades between 12:00pm to 2:00pm, how do we tell the system to still generate 0 counts for the time moments: 12:11pm, 12:22pm, 12:33pm, etc.? Or, those rows are delayed in the delivery in the system and user wants to ignore late-arrival of messages after 5min timeout to close the counting window? How can we support that use case w/o breaking SQL grammar? Both the above issues seem to require some extension to the window spec in SQL grammar. Julian, what do you think? Is it creating too many language/parser/planner problems in SQL?