Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Rui Wang Wed, 21 Oct 2020 21:01:59 -0700

This seems reasonable to merge window metadata as a column (more compact
format).


Maybe Julian can comment whether there was special consideration to define
window_start and window_end as two separate columns?


-Rui

On Wed, Oct 21, 2020 at 8:20 PM Danny Chan <[email protected]> wrote:

> Some of our fellows suggest the window table value functions to append a
> structure type field:
>
> window: ROW(start: timestamp(3), end: timestamp(3))
>
> For example if
>
> table T has schema: (ts: timestamp(3), b: varchar(20)), then
>
> Tumble(TABLE T, DESCRIPTOR(T.ts), INTERVAL '5' MINUTE)
>
> has return type
>
> Row(ts timestamp(3), b varchar(20), window Row(start timestamp(3), end
> timestamp(3)))
>
> Thus in the aggregate key or the join condition, we can compare the
> structure “window” field directly, e.g.
>
> GROUP BY T.window or ON L.window = R.window
>
> What do you think about this ?
>
> Rui Wang <[email protected]> 于2020年6月9日周二 上午1:34写道：
>
> > On Mon, Jun 8, 2020 at 12:04 AM Viliam Durina <[email protected]>
> > wrote:
> >
> > > > 2. rename the hop_size to emit_frequency. Because that's what this
> > > argument tries to say: e.g. emit in every x time_unit.
> > >
> > > It's not about how often you emit. By "emit" I mean the moment when the
> > > rows are sent out. For example, when working with event time and the
> > input
> > > stalls, you might emit much later. Or you can use it for historical
> data
> > in
> > > which case the actual interval will be much shorter. It's the time
> > interval
> > > by which the window "hops". Two subsequent windows will be this much
> > apart.
> > >
> > > Regarding the DESCRIPTOR, the PTF is supposed to be implemented by up
> to
> > 4
> > > member functions: describe, start, fulfill, finish. The `describe`
> > function
> > > is supposed to be called during validation. It can validate the
> > arguments.
> > > In case of TUMBLE, it can check whether the column described by the
> > > descriptor is present in the input table and if it is of a correct
> type.
> > It
> > > can throw errors which will become validation errors.
> >
> > Yes. Descriptor has a feature to enable a type checking. If considering
> it,
> > functions will be (argument name not decided yet):
> >
> > TUMBLE(
> > data TABLE,
> > time_column DESCRIPTOR(*TIMESTAMP*),
> > window_size INTERVAL DAY TO SECOND,
> > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> >
> > HOP(
> > data TABLE,
> > time_column DESCRIPTOR(*TIMESTAMP*),
> > window_size INTERVAL DAY TO SECOND,
> > emit_frequency INTERVAL DAY TO SECOND,
> > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> >
> >
> >
> >
> > > It also determines
> > > the actual output table type.
> > >
> > > I think it's worthwhile to read the PTF spec before implementing this,
> > it's
> > > freely available here:
> > >
> > >
> >
> https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip
> > >
> > > Viliam
> > >
> > > On Sun, 7 Jun 2020 at 06:23, Rui Wang <[email protected]> wrote:
> > >
> > > > Thanks Danny for your summary!
> > > >
> > > > For question one, I prefer the ordering of parameter in option two.
> > > > However, for argument names, I think it might be better to
> > > > 1. name the window size explicitly by "window_size".
> > > > 2. rename the hop_size to emit_frequency. Because that's what this
> > > > argument tries to say: e.g. emit in every x time_unit.
> > > > 3. don't use "table" as the first argument name, because "table" is a
> > > > keyword. So we can replace it by "data"
> > > >
> > > > so it might be
> > > > TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
> > > > HOP(data, DESCRIPTOR(time_column), window_size, emit_frequency[,
> > offset])
> > > >
> > > > Meanwhile, one thing worths mentioning is, like Viliam always
> discussed
> > > in
> > > > CALCITE-4000[1], we can also finalize the data types for each
> > signature:
> > > > TUMBLE(
> > > > data TABLE,
> > > > time_column DESCRIPTOR,
> > > > window_size INTERVAL DAY TO SECOND,
> > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > >
> > > > HOP(
> > > > data TABLE,
> > > > time_column DESCRIPTOR,
> > > > window_size INTERVAL DAY TO SECOND,
> > > > emit_frequency INTERVAL DAY TO SECOND,
> > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > >
> > > > So those intervals have to be INTERVAL DAY TO SECOND, not INTERVAL
> YEAR
> > > TO
> > > > MONTH, because year to month varies in seconds.
> > > >
> > > >
> > > > Regarding question two, I tried to dig into codebase and I think it
> > could
> > > > be a RexCorrelVariable to replace that table_name, so it becomes
> > > > table_function(RexCorrelVariable, DESCRIPTOR($1), internal:INTERVAL
> > > MINUTE)
> > > >
> > > > I know RexCorrelVariable might be not designed for this purpose, but
> I
> > > > find RexCorrelVariable is really useful because:
> > > > 1. It can provide an id, to indicate which input of TableFunctionScan
> > > that
> > > > this table_function is applied on. Note that TableFunctionScan might
> > not
> > > > have one input.
> > > > 2. It can provide RelDataType, to save the referenced input's schema.
> > The
> > > > input schema is useful during operator argument type validation (for
> > > > DESCRIPTOR) and operator return type inference.
> > > >
> > > >
> > > > [1]:
> > > >
> > > >
> > >
> >
> https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665
> > > >
> > > >
> > > > -Rui
> > > >
> > > > On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <[email protected]>
> > wrote:
> > > >
> > > > > Hi, all, in the last 2 releases, Rui Wang has contributed the
> window
> > > > table
> > > > > functions to replace the legacy group window syntax, the idea comes
> > > from
> > > > > the SQL-2016 polymorphic table functions. But the current
> > > implementation
> > > > > also got some feedbacks(or confusion), this thread tries to have a
> > > > > discussion on theses questions that are undecided.
> > > > >
> > > > > One is about the window function argument names:
> > > > >
> > > > > === option1
> > > > > TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> > > > > HOP(table, DESCRIPTOR(time_column), slide, size)
> > > > >
> > > > > === option2
> > > > > TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> > > > > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
> > > > >
> > > > > I would prefer for option2 for the reasons pointed out by Viliam in
> > > > > CALCITE-3737.
> > > > >
> > > > >
> > > > > Another is about how to translate the query
> > > > >
> > > > > For query, select * from TUMBLE(table, DESCRIPTOR(time_column),
> > > > interval);
> > > > >
> > > > > Currently our outputs plan is:
> > > > >
> > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> > > > > window_end=[$3])
> > > > > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> > > > > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER ORDERID,
> > > > > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start, TIMESTAMP(0)
> > > > > window_end)])
> > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> > > > > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
> > > > >
> > > > > But this seems not right, because the "table table_name" argument
> is
> > > not
> > > > a
> > > > > relational expression, in CALCITE-3955, Julian has pointed out that
> > we
> > > > > should translate the "table" argument as a CURSOR (or TABLE by
> > Viliam),
> > > > but
> > > > > another question is how to translate the referenced name
> > "time_column",
> > > > to
> > > > > a correlate variable ? which is also confusing because there is no
> > > > > correlation in the plan actually.
> > > > >
> > > > > Any suggestions are appreciated !
> > > > >
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/CALCITE-3955
> > > > > [2] https://issues.apache.org/jira/browse/CALCITE-3737
> > > > >
> > > > > Best,
> > > > > Danny Chan
> > > > >
> > > >
> > >
> > >
> > > --
> > > Viliam Durina
> > > Jet Developer
> > >       hazelcast®
> > >
> > >   <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA
> > 94402 |
> > > USA
> > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453> |
> hazelcast.com <
> > > https://www.hazelcast.com>
> > >
> > > --
> > > This message contains confidential information and is intended only for
> > > the
> > > individuals named. If you are not the named addressee you should not
> > > disseminate, distribute or copy this e-mail. Please notify the sender
> > > immediately by e-mail if you have received this e-mail by mistake and
> > > delete this e-mail from your system. E-mail transmission cannot be
> > > guaranteed to be secure or error-free as information could be
> > intercepted,
> > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > viruses.
> > > The sender therefore does not accept liability for any errors or
> > omissions
> > > in the contents of this message, which arise as a result of e-mail
> > > transmission. If verification is required, please request a hard-copy
> > > version. -Hazelcast
> > >
> >
>

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Reply via email to