Re: Streaming SQL - object models, ASTs and algebras

Yi Pan Thu, 29 Jan 2015 16:40:33 -0800

Hi, Julian,

Forgive me if I am slow in following your examples.


{quote}
You can also define a "paged" window, for example the cumulative total
trades since the top of the hour:

select stream rowtime, ticker, amount,
  sum(amount) over (order by rowtime partition by ticker, trunc(rowtime to
hour))
from StockTrades;

For example, you might get:

  {10:01, ORCL, 10, 10}
  {10:02, MSFT, 30, 30}
  {10:03, ORCL, 100, 110}
  {10:17, MSFT, 45, 75}
  {10:59, ORCL, 20, 130}
  {11:02, ORCL, 50, 50}
{quote}

I am wondering if I can get an average that's per 30 min window averages?
I.e. the following is the input events in a stream:
  {10:01, ORCL, 10, 10}
  {10:02, MSFT, 30, 30}
  {10:03, ORCL, 100, 110}
  {10:17, MSFT, 45, 75}
  {10:59, ORCL, 20, 130}
  {11:02, ORCL, 50, 50}
Can I get the non-overlapping window average from 10:00-10:29, and
10:30-10:59, ... ? Could you give an example how to define that window
operation in your model? Note that in this use case, although I may have 1
trading record per minute from the stream, I only generate 2 average
records from 10:00-11:00.

{quote}
CREATE TABLE Emp(empno INTEGER, name VARCHAR(20), department VARCHAR(20))
  PARTITION BY HASHCODE (department);
{quote}
That's good! At least it resolves my question on: "which field is the
partition key". However, I still have the question on the number of
partitions. As I stated that when the system currently does not have a
"auto-scaling" feature, the number of partitions for a stream has to be
explicitly specified. Where do you suggest to put this information in w/o
breaking SQL syntax?

On Thu, Jan 29, 2015 at 3:32 PM, Julian Hyde <jul...@hydromatic.net> wrote:

>
> > On Jan 29, 2015, at 3:04 PM, Yi Pan <nickpa...@gmail.com> wrote:
> >
> > Hi, Julian,
> >
> > Thanks for sharing your idea! It is interesting and well organized. Let
> me
> > try to summarize the main difference between yours and the current
> proposal
> > are:
> > - removing the '[]' used to define the window specification, using OVER
> on
> > the stream/table instead
> > - join/select output can be either stream or table, since both become
> > legitimate data sources in query.
>
> Yes.
>
> > There are still two points that I want to comment on:
> > a. in the current proposal, the window operator can specify a step size
> for
> > window advancement. With your examples, it seems that all windows will be
> > incremented by step size 1. Hence, the output of the query will be
> > "continuous" stream, which reports a moving average, instead of an fixed
> > size window average across all rows. Is it easy to extend your model for
> > that use case?
>
> I presume we are talking about windowed aggregation -- one row out for
> each row in.
>
> My previous example was a "sliding" window over a time interval.
>
> You can also have a sliding window over a number of rows:
>
>   select symbol, sum(quantity) over (order by rowtime rows 10 preceding)
> from StockTrades
>
> The SQL syntax supports various size windows, e.g.
>
>   select name,
>     sum(sal) over (order by hiredate
>                            range between shoesize preceding and age
> following)
>   from emp;
>
> and thus each record defines its own window. You can also define a "paged"
> window, for example the cumulative total trades since the top of the hour:
>
> select stream rowtime, ticker, amount,
>   sum(amount) over (order by rowtime partition by ticker, trunc(rowtime to
> hour))
> from StockTrades;
>
> For example, you might get:
>
>   {10:01, ORCL, 10, 10}
>   {10:02, MSFT, 30, 30}
>   {10:03, ORCL, 100, 110}
>   {10:17, MSFT, 45, 75}
>   {10:59, ORCL, 20, 130}
>   {11:02, ORCL, 50, 50}
>
> There are more elaborate ways of defining windows but these cover ~99% of
> real world cases, IMO. The remainder could be handled by a "user-defined
> window" SPI just as complex aggregate functions could be handled by a
> "user-defined aggregate".
>
> > b. for output streams, there is no definition of "partitions". I remember
> > that we talked about it briefly in our early discussion and you commented
> > that it should not be there in SQL. I would argue that a specification to
> > how to partition the output stream is needed in the current system, since
> > we are lacking of two things: i) automatically identify the key that can
> be
> > used for optimal partitioning of a query's output; ii) auto-scaling of
> the
> > number of partitions of a stream if the initial partition is not enough.
> > Hence, we have to give user a way to tell the system how to partition. I
> am
> > curious about what exactly are the reasons that you believe that
> partition
> > should not be added in SQL syntax?
>
> For the same reason that in an RDBMS you do not say
>
> INSERT INTO Emp:disk1 VALUES (10, 'Fred', 'Sales');
> INSERT INTO Emp:disk2 VALUES (20, 'Bill', 'Marketing');
>
> Data independence. In an RDBMS you say
>
> CREATE TABLE Emp(empno INTEGER, name VARCHAR(20), department VARCHAR(20))
>   PARTITION BY HASHCODE (department);
>
> and let the system take care of placing the records. You could do the same
> for a stream, partitioning based on a column, and could even partition
> based on its actual value not its hash code:
>
> CREATE STREAM Orders(..., p INTEGER) PARTITION BY VALUE (p);
>
> If Orders has 10 partitions, then
>
> INSERT INTO Orders VALUES (..., 7);
>
> will go to partition #7 and
>
> INSERT INTO Orders VALUES (..., 11);
>
> will fail. By putting the decisions in the hands of the optimizer guided
> by metadata, the system can adapt without you having to change application
> code. Moving the partitioning decision into a data value is another
> database "trick" -- the context is all in the data.
>
> > BTW, it seems even with your proposed model, the current physical
> operators
> > won't be affected, which sounds good.
>
> Yes, that was my goal. Samza operators are fine as they are. You guys know
> what you are doing (and what your users want).
>
> Julian
>
>

Re: Streaming SQL - object models, ASTs and algebras

Reply via email to