Re: Streaming SQL - object models, ASTs and algebras

Julian Hyde Thu, 29 Jan 2015 16:31:19 -0800

The validation logic is extensible within Calcite (for example, the validator 
has an interface SqlValidatorNamespace that represents anything that can be 
used as a table alias, such as table usages in the FROM clause or sub-queries), 
but I think it would be very complex to abstract that logic as a plug-in.


My plan would be to make streaming a core concept within Calcite (and related 
concepts such as collation and monotonic operators are already present) and a 
particular schema (implementation of the Schema SPI) might or might not have 
relations that are streamable.

It is interesting that some queries with NOT IN, ALL, EXCEPT would be valid 
(typically if there is some actual or implied upper-bound on the rowtime). And 
if I'm not mistaken each of these operators can be converted to an LEFT 
anti-JOIN, so ideally the validator would allow the query iff it allows it in 
its JOIN rewrite.

Another way of approaching this is by defining a "timeliness" property: how 
long do I have to wait before a given query emits the last record with a given 
rowtime? If the answer is "infinity" then we presume that the end user would 
not want to execute the query... but we might presume wrong, because the query 
might produce some useful results in the mean time. In that case, we wouldn't 
have to change the validator, but would add a metadata provider (similar to 
those that provide the cost model) to compute the timeliness of each relational 
expression. That is probably easier to do, because you're working on the 
algebra rather than SQL.

Julian



> On Jan 29, 2015, at 4:05 PM, Milinda Pathirage <[email protected]> wrote:
> 
> Hi Julian,
> 
> I like your proposal. I think it saves us lot of time. But I have a
> question regarding SQL validation. I'm not an SQL expert, so I may be wrong.
> 
> As I understand some SQL constructs such as NOT IN, ALL, EXCEPT will not
> valid in the context of some stream queries due to there blocking nature.
> Is it possible to do custom validations depending on the context? If I
> rephrase it, is validation logic extensible?
> 
> Thanks
> Milinda
> 
> On Thu, Jan 29, 2015 at 6:32 PM, Julian Hyde <[email protected]> wrote:
> 
>> 
>>> On Jan 29, 2015, at 3:04 PM, Yi Pan <[email protected]> wrote:
>>> 
>>> Hi, Julian,
>>> 
>>> Thanks for sharing your idea! It is interesting and well organized. Let
>> me
>>> try to summarize the main difference between yours and the current
>> proposal
>>> are:
>>> - removing the '[]' used to define the window specification, using OVER
>> on
>>> the stream/table instead
>>> - join/select output can be either stream or table, since both become
>>> legitimate data sources in query.
>> 
>> Yes.
>> 
>>> There are still two points that I want to comment on:
>>> a. in the current proposal, the window operator can specify a step size
>> for
>>> window advancement. With your examples, it seems that all windows will be
>>> incremented by step size 1. Hence, the output of the query will be
>>> "continuous" stream, which reports a moving average, instead of an fixed
>>> size window average across all rows. Is it easy to extend your model for
>>> that use case?
>> 
>> I presume we are talking about windowed aggregation -- one row out for
>> each row in.
>> 
>> My previous example was a "sliding" window over a time interval.
>> 
>> You can also have a sliding window over a number of rows:
>> 
>>  select symbol, sum(quantity) over (order by rowtime rows 10 preceding)
>> from StockTrades
>> 
>> The SQL syntax supports various size windows, e.g.
>> 
>>  select name,
>>    sum(sal) over (order by hiredate
>>                           range between shoesize preceding and age
>> following)
>>  from emp;
>> 
>> and thus each record defines its own window. You can also define a "paged"
>> window, for example the cumulative total trades since the top of the hour:
>> 
>> select stream rowtime, ticker, amount,
>>  sum(amount) over (order by rowtime partition by ticker, trunc(rowtime to
>> hour))
>> from StockTrades;
>> 
>> For example, you might get:
>> 
>>  {10:01, ORCL, 10, 10}
>>  {10:02, MSFT, 30, 30}
>>  {10:03, ORCL, 100, 110}
>>  {10:17, MSFT, 45, 75}
>>  {10:59, ORCL, 20, 130}
>>  {11:02, ORCL, 50, 50}
>> 
>> There are more elaborate ways of defining windows but these cover ~99% of
>> real world cases, IMO. The remainder could be handled by a "user-defined
>> window" SPI just as complex aggregate functions could be handled by a
>> "user-defined aggregate".
>> 
>>> b. for output streams, there is no definition of "partitions". I remember
>>> that we talked about it briefly in our early discussion and you commented
>>> that it should not be there in SQL. I would argue that a specification to
>>> how to partition the output stream is needed in the current system, since
>>> we are lacking of two things: i) automatically identify the key that can
>> be
>>> used for optimal partitioning of a query's output; ii) auto-scaling of
>> the
>>> number of partitions of a stream if the initial partition is not enough.
>>> Hence, we have to give user a way to tell the system how to partition. I
>> am
>>> curious about what exactly are the reasons that you believe that
>> partition
>>> should not be added in SQL syntax?
>> 
>> For the same reason that in an RDBMS you do not say
>> 
>> INSERT INTO Emp:disk1 VALUES (10, 'Fred', 'Sales');
>> INSERT INTO Emp:disk2 VALUES (20, 'Bill', 'Marketing');
>> 
>> Data independence. In an RDBMS you say
>> 
>> CREATE TABLE Emp(empno INTEGER, name VARCHAR(20), department VARCHAR(20))
>>  PARTITION BY HASHCODE (department);
>> 
>> and let the system take care of placing the records. You could do the same
>> for a stream, partitioning based on a column, and could even partition
>> based on its actual value not its hash code:
>> 
>> CREATE STREAM Orders(..., p INTEGER) PARTITION BY VALUE (p);
>> 
>> If Orders has 10 partitions, then
>> 
>> INSERT INTO Orders VALUES (..., 7);
>> 
>> will go to partition #7 and
>> 
>> INSERT INTO Orders VALUES (..., 11);
>> 
>> will fail. By putting the decisions in the hands of the optimizer guided
>> by metadata, the system can adapt without you having to change application
>> code. Moving the partitioning decision into a data value is another
>> database "trick" -- the context is all in the data.
>> 
>>> BTW, it seems even with your proposed model, the current physical
>> operators
>>> won't be affected, which sounds good.
>> 
>> Yes, that was my goal. Samza operators are fine as they are. You guys know
>> what you are doing (and what your users want).
>> 
>> Julian
>> 
>> 
> 
> 
> -- 
> Milinda Pathirage
> 
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
> 
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org

Re: Streaming SQL - object models, ASTs and algebras

Reply via email to