Re: Streaming SQL - object models, ASTs and algebras

Yi Pan Thu, 29 Jan 2015 16:43:01 -0800

One more, Julian, do you mind if I post your proposed SQL model to
SAMZA-390? That way, more ppl can view it and we should continue discussion
there.


Thanks!

-Yi

On Thu, Jan 29, 2015 at 4:27 PM, Julian Hyde <jul...@hydromatic.net> wrote:

> The validation logic is extensible within Calcite (for example, the
> validator has an interface SqlValidatorNamespace that represents anything
> that can be used as a table alias, such as table usages in the FROM clause
> or sub-queries), but I think it would be very complex to abstract that
> logic as a plug-in.
>
> My plan would be to make streaming a core concept within Calcite (and
> related concepts such as collation and monotonic operators are already
> present) and a particular schema (implementation of the Schema SPI) might
> or might not have relations that are streamable.
>
> It is interesting that some queries with NOT IN, ALL, EXCEPT would be
> valid (typically if there is some actual or implied upper-bound on the
> rowtime). And if I'm not mistaken each of these operators can be converted
> to an LEFT anti-JOIN, so ideally the validator would allow the query iff it
> allows it in its JOIN rewrite.
>
> Another way of approaching this is by defining a "timeliness" property:
> how long do I have to wait before a given query emits the last record with
> a given rowtime? If the answer is "infinity" then we presume that the end
> user would not want to execute the query... but we might presume wrong,
> because the query might produce some useful results in the mean time. In
> that case, we wouldn't have to change the validator, but would add a
> metadata provider (similar to those that provide the cost model) to compute
> the timeliness of each relational expression. That is probably easier to
> do, because you're working on the algebra rather than SQL.
>
> Julian
>
>
>
> > On Jan 29, 2015, at 4:05 PM, Milinda Pathirage <mpath...@umail.iu.edu>
> wrote:
> >
> > Hi Julian,
> >
> > I like your proposal. I think it saves us lot of time. But I have a
> > question regarding SQL validation. I'm not an SQL expert, so I may be
> wrong.
> >
> > As I understand some SQL constructs such as NOT IN, ALL, EXCEPT will not
> > valid in the context of some stream queries due to there blocking nature.
> > Is it possible to do custom validations depending on the context? If I
> > rephrase it, is validation logic extensible?
> >
> > Thanks
> > Milinda
> >
> > On Thu, Jan 29, 2015 at 6:32 PM, Julian Hyde <jul...@hydromatic.net>
> wrote:
> >
> >>
> >>> On Jan 29, 2015, at 3:04 PM, Yi Pan <nickpa...@gmail.com> wrote:
> >>>
> >>> Hi, Julian,
> >>>
> >>> Thanks for sharing your idea! It is interesting and well organized. Let
> >> me
> >>> try to summarize the main difference between yours and the current
> >> proposal
> >>> are:
> >>> - removing the '[]' used to define the window specification, using OVER
> >> on
> >>> the stream/table instead
> >>> - join/select output can be either stream or table, since both become
> >>> legitimate data sources in query.
> >>
> >> Yes.
> >>
> >>> There are still two points that I want to comment on:
> >>> a. in the current proposal, the window operator can specify a step size
> >> for
> >>> window advancement. With your examples, it seems that all windows will
> be
> >>> incremented by step size 1. Hence, the output of the query will be
> >>> "continuous" stream, which reports a moving average, instead of an
> fixed
> >>> size window average across all rows. Is it easy to extend your model
> for
> >>> that use case?
> >>
> >> I presume we are talking about windowed aggregation -- one row out for
> >> each row in.
> >>
> >> My previous example was a "sliding" window over a time interval.
> >>
> >> You can also have a sliding window over a number of rows:
> >>
> >>  select symbol, sum(quantity) over (order by rowtime rows 10 preceding)
> >> from StockTrades
> >>
> >> The SQL syntax supports various size windows, e.g.
> >>
> >>  select name,
> >>    sum(sal) over (order by hiredate
> >>                           range between shoesize preceding and age
> >> following)
> >>  from emp;
> >>
> >> and thus each record defines its own window. You can also define a
> "paged"
> >> window, for example the cumulative total trades since the top of the
> hour:
> >>
> >> select stream rowtime, ticker, amount,
> >>  sum(amount) over (order by rowtime partition by ticker, trunc(rowtime
> to
> >> hour))
> >> from StockTrades;
> >>
> >> For example, you might get:
> >>
> >>  {10:01, ORCL, 10, 10}
> >>  {10:02, MSFT, 30, 30}
> >>  {10:03, ORCL, 100, 110}
> >>  {10:17, MSFT, 45, 75}
> >>  {10:59, ORCL, 20, 130}
> >>  {11:02, ORCL, 50, 50}
> >>
> >> There are more elaborate ways of defining windows but these cover ~99%
> of
> >> real world cases, IMO. The remainder could be handled by a "user-defined
> >> window" SPI just as complex aggregate functions could be handled by a
> >> "user-defined aggregate".
> >>
> >>> b. for output streams, there is no definition of "partitions". I
> remember
> >>> that we talked about it briefly in our early discussion and you
> commented
> >>> that it should not be there in SQL. I would argue that a specification
> to
> >>> how to partition the output stream is needed in the current system,
> since
> >>> we are lacking of two things: i) automatically identify the key that
> can
> >> be
> >>> used for optimal partitioning of a query's output; ii) auto-scaling of
> >> the
> >>> number of partitions of a stream if the initial partition is not
> enough.
> >>> Hence, we have to give user a way to tell the system how to partition.
> I
> >> am
> >>> curious about what exactly are the reasons that you believe that
> >> partition
> >>> should not be added in SQL syntax?
> >>
> >> For the same reason that in an RDBMS you do not say
> >>
> >> INSERT INTO Emp:disk1 VALUES (10, 'Fred', 'Sales');
> >> INSERT INTO Emp:disk2 VALUES (20, 'Bill', 'Marketing');
> >>
> >> Data independence. In an RDBMS you say
> >>
> >> CREATE TABLE Emp(empno INTEGER, name VARCHAR(20), department
> VARCHAR(20))
> >>  PARTITION BY HASHCODE (department);
> >>
> >> and let the system take care of placing the records. You could do the
> same
> >> for a stream, partitioning based on a column, and could even partition
> >> based on its actual value not its hash code:
> >>
> >> CREATE STREAM Orders(..., p INTEGER) PARTITION BY VALUE (p);
> >>
> >> If Orders has 10 partitions, then
> >>
> >> INSERT INTO Orders VALUES (..., 7);
> >>
> >> will go to partition #7 and
> >>
> >> INSERT INTO Orders VALUES (..., 11);
> >>
> >> will fail. By putting the decisions in the hands of the optimizer guided
> >> by metadata, the system can adapt without you having to change
> application
> >> code. Moving the partitioning decision into a data value is another
> >> database "trick" -- the context is all in the data.
> >>
> >>> BTW, it seems even with your proposed model, the current physical
> >> operators
> >>> won't be affected, which sounds good.
> >>
> >> Yes, that was my goal. Samza operators are fine as they are. You guys
> know
> >> what you are doing (and what your users want).
> >>
> >> Julian
> >>
> >>
> >
> >
> > --
> > Milinda Pathirage
> >
> > PhD Student | Research Assistant
> > School of Informatics and Computing | Data to Insight Center
> > Indiana University
> >
> > twitter: milindalakmal
> > skype: milinda.pathirage
> > blog: http://milinda.pathirage.org
>
>

Re: Streaming SQL - object models, ASTs and algebras

Reply via email to