Hi Julian, I like your proposal. I think it saves us lot of time. But I have a question regarding SQL validation. I'm not an SQL expert, so I may be wrong.
As I understand some SQL constructs such as NOT IN, ALL, EXCEPT will not valid in the context of some stream queries due to there blocking nature. Is it possible to do custom validations depending on the context? If I rephrase it, is validation logic extensible? Thanks Milinda On Thu, Jan 29, 2015 at 6:32 PM, Julian Hyde <jul...@hydromatic.net> wrote: > > > On Jan 29, 2015, at 3:04 PM, Yi Pan <nickpa...@gmail.com> wrote: > > > > Hi, Julian, > > > > Thanks for sharing your idea! It is interesting and well organized. Let > me > > try to summarize the main difference between yours and the current > proposal > > are: > > - removing the '[]' used to define the window specification, using OVER > on > > the stream/table instead > > - join/select output can be either stream or table, since both become > > legitimate data sources in query. > > Yes. > > > There are still two points that I want to comment on: > > a. in the current proposal, the window operator can specify a step size > for > > window advancement. With your examples, it seems that all windows will be > > incremented by step size 1. Hence, the output of the query will be > > "continuous" stream, which reports a moving average, instead of an fixed > > size window average across all rows. Is it easy to extend your model for > > that use case? > > I presume we are talking about windowed aggregation -- one row out for > each row in. > > My previous example was a "sliding" window over a time interval. > > You can also have a sliding window over a number of rows: > > select symbol, sum(quantity) over (order by rowtime rows 10 preceding) > from StockTrades > > The SQL syntax supports various size windows, e.g. > > select name, > sum(sal) over (order by hiredate > range between shoesize preceding and age > following) > from emp; > > and thus each record defines its own window. You can also define a "paged" > window, for example the cumulative total trades since the top of the hour: > > select stream rowtime, ticker, amount, > sum(amount) over (order by rowtime partition by ticker, trunc(rowtime to > hour)) > from StockTrades; > > For example, you might get: > > {10:01, ORCL, 10, 10} > {10:02, MSFT, 30, 30} > {10:03, ORCL, 100, 110} > {10:17, MSFT, 45, 75} > {10:59, ORCL, 20, 130} > {11:02, ORCL, 50, 50} > > There are more elaborate ways of defining windows but these cover ~99% of > real world cases, IMO. The remainder could be handled by a "user-defined > window" SPI just as complex aggregate functions could be handled by a > "user-defined aggregate". > > > b. for output streams, there is no definition of "partitions". I remember > > that we talked about it briefly in our early discussion and you commented > > that it should not be there in SQL. I would argue that a specification to > > how to partition the output stream is needed in the current system, since > > we are lacking of two things: i) automatically identify the key that can > be > > used for optimal partitioning of a query's output; ii) auto-scaling of > the > > number of partitions of a stream if the initial partition is not enough. > > Hence, we have to give user a way to tell the system how to partition. I > am > > curious about what exactly are the reasons that you believe that > partition > > should not be added in SQL syntax? > > For the same reason that in an RDBMS you do not say > > INSERT INTO Emp:disk1 VALUES (10, 'Fred', 'Sales'); > INSERT INTO Emp:disk2 VALUES (20, 'Bill', 'Marketing'); > > Data independence. In an RDBMS you say > > CREATE TABLE Emp(empno INTEGER, name VARCHAR(20), department VARCHAR(20)) > PARTITION BY HASHCODE (department); > > and let the system take care of placing the records. You could do the same > for a stream, partitioning based on a column, and could even partition > based on its actual value not its hash code: > > CREATE STREAM Orders(..., p INTEGER) PARTITION BY VALUE (p); > > If Orders has 10 partitions, then > > INSERT INTO Orders VALUES (..., 7); > > will go to partition #7 and > > INSERT INTO Orders VALUES (..., 11); > > will fail. By putting the decisions in the hands of the optimizer guided > by metadata, the system can adapt without you having to change application > code. Moving the partitioning decision into a data value is another > database "trick" -- the context is all in the data. > > > BTW, it seems even with your proposed model, the current physical > operators > > won't be affected, which sounds good. > > Yes, that was my goal. Samza operators are fine as they are. You guys know > what you are doing (and what your users want). > > Julian > > -- Milinda Pathirage PhD Student | Research Assistant School of Informatics and Computing | Data to Insight Center Indiana University twitter: milindalakmal skype: milinda.pathirage blog: http://milinda.pathirage.org