One more, Julian, do you mind if I post your proposed SQL model to SAMZA-390? That way, more ppl can view it and we should continue discussion there.
Thanks! -Yi On Thu, Jan 29, 2015 at 4:27 PM, Julian Hyde <jul...@hydromatic.net> wrote: > The validation logic is extensible within Calcite (for example, the > validator has an interface SqlValidatorNamespace that represents anything > that can be used as a table alias, such as table usages in the FROM clause > or sub-queries), but I think it would be very complex to abstract that > logic as a plug-in. > > My plan would be to make streaming a core concept within Calcite (and > related concepts such as collation and monotonic operators are already > present) and a particular schema (implementation of the Schema SPI) might > or might not have relations that are streamable. > > It is interesting that some queries with NOT IN, ALL, EXCEPT would be > valid (typically if there is some actual or implied upper-bound on the > rowtime). And if I'm not mistaken each of these operators can be converted > to an LEFT anti-JOIN, so ideally the validator would allow the query iff it > allows it in its JOIN rewrite. > > Another way of approaching this is by defining a "timeliness" property: > how long do I have to wait before a given query emits the last record with > a given rowtime? If the answer is "infinity" then we presume that the end > user would not want to execute the query... but we might presume wrong, > because the query might produce some useful results in the mean time. In > that case, we wouldn't have to change the validator, but would add a > metadata provider (similar to those that provide the cost model) to compute > the timeliness of each relational expression. That is probably easier to > do, because you're working on the algebra rather than SQL. > > Julian > > > > > On Jan 29, 2015, at 4:05 PM, Milinda Pathirage <mpath...@umail.iu.edu> > wrote: > > > > Hi Julian, > > > > I like your proposal. I think it saves us lot of time. But I have a > > question regarding SQL validation. I'm not an SQL expert, so I may be > wrong. > > > > As I understand some SQL constructs such as NOT IN, ALL, EXCEPT will not > > valid in the context of some stream queries due to there blocking nature. > > Is it possible to do custom validations depending on the context? If I > > rephrase it, is validation logic extensible? > > > > Thanks > > Milinda > > > > On Thu, Jan 29, 2015 at 6:32 PM, Julian Hyde <jul...@hydromatic.net> > wrote: > > > >> > >>> On Jan 29, 2015, at 3:04 PM, Yi Pan <nickpa...@gmail.com> wrote: > >>> > >>> Hi, Julian, > >>> > >>> Thanks for sharing your idea! It is interesting and well organized. Let > >> me > >>> try to summarize the main difference between yours and the current > >> proposal > >>> are: > >>> - removing the '[]' used to define the window specification, using OVER > >> on > >>> the stream/table instead > >>> - join/select output can be either stream or table, since both become > >>> legitimate data sources in query. > >> > >> Yes. > >> > >>> There are still two points that I want to comment on: > >>> a. in the current proposal, the window operator can specify a step size > >> for > >>> window advancement. With your examples, it seems that all windows will > be > >>> incremented by step size 1. Hence, the output of the query will be > >>> "continuous" stream, which reports a moving average, instead of an > fixed > >>> size window average across all rows. Is it easy to extend your model > for > >>> that use case? > >> > >> I presume we are talking about windowed aggregation -- one row out for > >> each row in. > >> > >> My previous example was a "sliding" window over a time interval. > >> > >> You can also have a sliding window over a number of rows: > >> > >> select symbol, sum(quantity) over (order by rowtime rows 10 preceding) > >> from StockTrades > >> > >> The SQL syntax supports various size windows, e.g. > >> > >> select name, > >> sum(sal) over (order by hiredate > >> range between shoesize preceding and age > >> following) > >> from emp; > >> > >> and thus each record defines its own window. You can also define a > "paged" > >> window, for example the cumulative total trades since the top of the > hour: > >> > >> select stream rowtime, ticker, amount, > >> sum(amount) over (order by rowtime partition by ticker, trunc(rowtime > to > >> hour)) > >> from StockTrades; > >> > >> For example, you might get: > >> > >> {10:01, ORCL, 10, 10} > >> {10:02, MSFT, 30, 30} > >> {10:03, ORCL, 100, 110} > >> {10:17, MSFT, 45, 75} > >> {10:59, ORCL, 20, 130} > >> {11:02, ORCL, 50, 50} > >> > >> There are more elaborate ways of defining windows but these cover ~99% > of > >> real world cases, IMO. The remainder could be handled by a "user-defined > >> window" SPI just as complex aggregate functions could be handled by a > >> "user-defined aggregate". > >> > >>> b. for output streams, there is no definition of "partitions". I > remember > >>> that we talked about it briefly in our early discussion and you > commented > >>> that it should not be there in SQL. I would argue that a specification > to > >>> how to partition the output stream is needed in the current system, > since > >>> we are lacking of two things: i) automatically identify the key that > can > >> be > >>> used for optimal partitioning of a query's output; ii) auto-scaling of > >> the > >>> number of partitions of a stream if the initial partition is not > enough. > >>> Hence, we have to give user a way to tell the system how to partition. > I > >> am > >>> curious about what exactly are the reasons that you believe that > >> partition > >>> should not be added in SQL syntax? > >> > >> For the same reason that in an RDBMS you do not say > >> > >> INSERT INTO Emp:disk1 VALUES (10, 'Fred', 'Sales'); > >> INSERT INTO Emp:disk2 VALUES (20, 'Bill', 'Marketing'); > >> > >> Data independence. In an RDBMS you say > >> > >> CREATE TABLE Emp(empno INTEGER, name VARCHAR(20), department > VARCHAR(20)) > >> PARTITION BY HASHCODE (department); > >> > >> and let the system take care of placing the records. You could do the > same > >> for a stream, partitioning based on a column, and could even partition > >> based on its actual value not its hash code: > >> > >> CREATE STREAM Orders(..., p INTEGER) PARTITION BY VALUE (p); > >> > >> If Orders has 10 partitions, then > >> > >> INSERT INTO Orders VALUES (..., 7); > >> > >> will go to partition #7 and > >> > >> INSERT INTO Orders VALUES (..., 11); > >> > >> will fail. By putting the decisions in the hands of the optimizer guided > >> by metadata, the system can adapt without you having to change > application > >> code. Moving the partitioning decision into a data value is another > >> database "trick" -- the context is all in the data. > >> > >>> BTW, it seems even with your proposed model, the current physical > >> operators > >>> won't be affected, which sounds good. > >> > >> Yes, that was my goal. Samza operators are fine as they are. You guys > know > >> what you are doing (and what your users want). > >> > >> Julian > >> > >> > > > > > > -- > > Milinda Pathirage > > > > PhD Student | Research Assistant > > School of Informatics and Computing | Data to Insight Center > > Indiana University > > > > twitter: milindalakmal > > skype: milinda.pathirage > > blog: http://milinda.pathirage.org > >