Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function

Benchao Li Fri, 09 Oct 2020 23:03:35 -0700

Hi Jark,

Thanks for your reply, this makes sense to me.


The scenario I used above is just a case to explain what I'm concerned
about,
not necessarily a production use case. We can leave it to the future to see
whether
other users have these use cases.

Then I have no other concerns, +1 to start the VOTE.


Jark Wu <[email protected]> 于2020年10月10日周六 下午1:44写道：

> Hi Benchao,
>
> That's a good question.
>
> IMO, the new windowed operators and the current time operators are two
> different sets of functions,
> just like time operators and non-time operators are two different sets of
> functions.
> I think it's fine if we don't support integrating them, just like time
> operators can't be applied on non-windowed aggregate.
> If users want to use time operators in the whole pipeline, then he/she can
> use the grouped window aggregates instead of the window TVFs.
>
> The key idea of window TVF is that all the operators in the pipeline are
> based on the **windows**.
> In terms of syntax, if the key clause (e.g. group by, partitioned by, join
> on, order by) contains window_start and window_end,
> it can be translated into windowed operators.
> Thus, we will have windowed CEP, windowed sort, windowed over aggregate in
> the future to make it possible to build a windowed pipeline.
>
> But I think we can elaborate the integration more in the future if users
> need it. Actually, I don't fully understand the scenario of integrating
> window TVF and time operators at this point.
> For example, interval join an input stream and a window join result. I
> don't see why it can't be expressed by nested window join and why users
> have to use interval join here.
> Maybe we can wait for more inputs from users when the window TVF is
> released and we can elaborate it again.
>
> Best,
> Jark
>
> On Sat, 10 Oct 2020 at 12:01, 刘 芃成 <[email protected]> wrote:
>
>> Hi, Benchao,
>>        I think I got your point, actually, in current implementation for
>> group window aggregation, the value of time attributes(e.g.
>> TUMBLE_ROWTIME/TUMBLE_PROCTIME) is calculated as (window_end – 1), so I
>> think we can just use it directly if you need this. But I think this time
>> attributes is mainly suggested to use in case of cascaded window operations.
>> Regarding the example you provided, I think the semantics of the SQL in
>> your example which doing interval join(e.g. with TUMBLE_ROWTIME) after
>> window aggregation is not clear in the current implementation, and I think
>> that’s a strong reason why we need the new TVFs syntax.
>>       With the new syntax, users should understand which time column to
>> use and how to generate it when doing interval join and etc.
>>
>> Best,
>> Pengcheng
>>
>> 发件人: Benchao Li <[email protected]>
>> 日期: 2020年10月10日 星期六 上午11:02
>> 收件人: pengcheng Liu <[email protected]>
>> 抄送: dev <[email protected]>
>> 主题: Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function
>>
>> Hi pengcheng,
>>
>> Thanks for your response.
>> I knew that the original time attribute column will be retained after the
>> TVF,
>> what I'm questioning is how do we get the time attribute column after
>> Aggregation.
>> Your answer did not remove my doubts about this.
>>
>> It's ok if we did not plan to integrate new TVF aggregate with old "time
>> attribute scenarios"
>> listed in my previous email in this FLIP. However it's good to elaborate
>> more on this, and
>> leave it to the future plan.
>>
>> pengcheng Liu <[email protected]<mailto:
>> [email protected]>> 于2020年10月10日周六 上午10:45写道：
>> Hi，Benchao,
>>     In TVFs, the time attributes is just passed through from parent rels,
>> and the TVFs just add two
>>     additional window attributes(i.e. window_start & window_end). Also, I
>> think the time columns can be not only a time attribute
>>     with type of `TimeIndicatorType` but also a regular column with type
>> of `Timestamp`.
>>
>>     For cascaded window operations, we can use window_start/window_end of
>> the previous window result directly to
>>     indicate operating on the same window, or use  new DESCRIPTOR column
>> to assign new windows, in case of the change of
>>     the time column(e.g. in some case, the original timestamp is
>> inaccurate and need some conversion to be used).
>>
>>     You can check the definition or signature of these TVFs in the FLIP.
>>      e.g.
>>   SELECT * FROM TABLE(
>>    TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
>>     In the example, the `bidtime` is the time attribute column, which is
>> the first operand of the DESCRIPTOR function.
>>
>>     +1 start voting.
>>
>> Benchao Li <[email protected]<mailto:[email protected]>>
>> 于2020年10月10日周六 上午10:08写道：
>> Hi Jark,
>>
>> 2 & 3 sounds good to me.
>>
>> Regarding time attribute,
>> I still have some questions, I knew it's easy to support cascaded window
>> aggregate using new TVFs.
>> However there are some other places where need time attribute:
>> - CEP
>> - interval join
>> - order by
>> - over window
>> If there is no time attribute column, how do we integrate these old
>> features with the new TVFs.
>> E.g.
>> StreamA -> new window aggregate -> interval join -> Sink
>>                                                          /
>> StreamB -----------------------------------
>>
>>
>> Jark Wu <[email protected]<mailto:[email protected]>> 于2020年10月9日周五
>> 下午11:51写道：
>> Hi Benchao,
>>
>> 1) time attribute
>> Yes. We don't need time attribute auxiliary function. Because the new
>> window operations are all based on the
>>  window_start and window_end columns instead of on the time attributes. So
>> we don't need to propagate time attributes.
>> Cascaded window aggregate can be expressed by simply GROUP BY the
>> window_start and window_end of the previous window result.
>> I have added a cascaded window aggregate example in the Tumbling Window
>> section in the FLIP.
>> If you want to define proctime window aggregate, the time column in TVF
>> should be a proctime attribute field (or PROCTIME() function).
>>
>> 2) batch support
>> Yes. The proposed syntax/API are unified for batch and streaming. Batch
>> support is in the plan, but may not have enough time to catch up 1.12.
>>
>> 3) support `grouping sets`
>> This is not included in the FLIP, but I think it's great if we can support
>> `grouping sets`.
>> The existing window impl doesn't support this because we convert the
>> LogicalAggregate into WindowAggregate in the beginning,
>> the expand grouping sets rule can't be applied in this situation.
>> Fortunately, with the new window impl, the conversion to WindowAggregate
>> will happen at the end, so I think the expand rule can be
>>  applied and support this feature naturally.
>> Therefore, IMO, we don't need to include this feature in this FLIP to
>> avoid
>> the FLIP being too large.
>> This can be a follow-up issue (maybe just add tests and docs) after the
>> FLIP.
>>
>> Best,
>> Jark
>>
>>
>> On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <[email protected]<mailto:
>> [email protected]>> wrote:
>>
>> > Hi，Benchao,
>> >         Welcome to join the discussion, yes, this new syntax can make
>> SQL
>> > more clear and simpler.
>> >         For your first question, the `window_start` and `window_end`
>> > columns will be added automatically,
>> >         so we don't need to use auxiliary group functions to infer or
>> > access the window properties.
>> >
>> >         For the `grouping sets` on TVFs, I think it's interesting if we
>> > can support it, as we already supported `grouping sets`
>> >         on streaming aggregates in blink planner. But I'm not sure if it
>> > will be included into this FLIP.
>> >
>> >         cc @Jark Wu
>> >
>> > Best,
>> > Pengcheng
>> >
>> >
>> > 在 2020/10/9 下午5:25，“Benchao Li”<[email protected]<mailto:
>> [email protected]>> 写入:
>> >
>> >     Thanks Jark for bringing this discussion, I like this FLIP very
>> much.
>> >
>> >     Especially the cumulate window, it's much like the current TUMBLE
>> > window +
>> >     Fast Emit (which is an undocumented experimental feature), however,
>> > it's
>> >     more powerful.
>> >
>> >     And This will make the SQL semantic more standard, especially for
>> the
>> >     HOPPING window.
>> >
>> >     Regarding time attribute,
>> >     It seems that we don't need a specific function to infer the time
>> > attribute
>> >     like
>> >     `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and
>> >     `window_end`
>> >     column a time attribute column automatically?
>> >     - If not, what will be the time attribute of the result relation of
>> > these
>> >     TVFs?
>> >       Especially after the window aggregation.
>> >     - If yes, then how do we handle proctime?
>> >
>> >     Regarding batch operators,
>> >     It's great to hear that we can reuse the batch operators in
>> continuous
>> >     batch mode
>> >     as you mentioned in the FLIP.
>> >     Current window aggregate could also be used in batch mode with
>> > rowtime. Do
>> >     you plan
>> >     to support these TVFs for batch mode in this FLIP? Hence the
>> Table/SQL
>> > is a
>> >     unified
>> >     API, it's great if we can keep the features complete both in
>> streaming
>> > and
>> >     batch mode.
>> >
>> >     There is one more question, I don't know whether it should be
>> > considered in
>> >     this FLIP.
>> >     Does the new window support `grouping sets`? (It's not supported in
>> old
>> >     window impl).
>> >
>> >     Jark Wu <[email protected]<mailto:[email protected]>> 于2020年10月9日周五
>> 下午4:14写道：
>> >
>> >     > Hi all,
>> >     >
>> >     > I know we have a lot of discussion and development on going right
>> > now but
>> >     > it would be great if we can get FLIP-145 into a votable state.
>> >     > If there are no objections, I would like to start voting in the
>> next
>> > days.
>> >     >
>> >     > Best,
>> >     > Jark
>> >     >
>> >     > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[email protected]<mailto:
>> [email protected]>> wrote:
>> >     >
>> >     > > Hi everyone,
>> >     > >
>> >     > > I have added a section for Performance Optimization to describe
>> > how to
>> >     > > improve the performance in the short-term and long-term
>> >     > > and sketch the future performance potential under the new window
>> > API.
>> >     > > Introducing the window API is just the first step, we will
>> >     > > continuously improve the performance to make it powerful and
>> > useful.
>> >     > >
>> >     > > Best,
>> >     > > Jark
>> >     > >
>> >     > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[email protected]<mailto:
>> [email protected]>> wrote:
>> >     > >
>> >     > >> Hi Pengcheng,
>> >     > >>
>> >     > >> Yes, the window TVF is part of the FLIP. Welcome to contribute
>> > and join
>> >     > >> the discussion.
>> >     > >> Regarding the SESSION window aggregation, users can use the
>> > existing
>> >     > >> grouped session window function.
>> >     > >>
>> >     > >> Best,
>> >     > >> Jark
>> >     > >>
>> >     > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng <
>> > [email protected]<mailto:[email protected]>
>> >     > >
>> >     > >> wrote:
>> >     > >>
>> >     > >>> Hi Jark,
>> >     > >>>         Thanks for reply, yes, I think it's a good feature, it
>> > can
>> >     > >>> improve the NRT scenarios
>> >     > >>>         as you mentioned in the FLIP. Also, I think it can
>> > improve the
>> >     > >>> streaming SQL greatly,
>> >     > >>>         it can support richer window operations in flink SQL
>> and
>> > bring
>> >     > >>> great convenience to users.
>> >     > >>>         (we are now only supported group window in flink).
>> >     > >>>
>> >     > >>>         Regarding the SESSION window, I think it's especially
>> > useful
>> >     > for
>> >     > >>> user behavior analysis(e.g.
>> >     > >>>         counting user visits on a news website or social
>> > platform), but
>> >     > >>> I agree that we can keep it
>> >     > >>>         out of the FLIP now to catch up 1.12.
>> >     > >>>
>> >     > >>>         Recently, I've done some work on the stream planner
>> with
>> > the
>> >     > >>> TVFs, and I'm willing to contribute
>> >     > >>>         to this part. Is it in the plan of this FLIP?
>> >     > >>>
>> >     > >>>         Best,
>> >     > >>>         PengchengLiu
>> >     > >>>
>> >     > >>>
>> >     > >>> 在 2020/9/26 下午11:09，“Jark Wu”<[email protected]<mailto:
>> [email protected]>> 写入:
>> >     > >>>
>> >     > >>>     Hi pengcheng,
>> >     > >>>
>> >     > >>>     That's great to see you also have the need of window join.
>> >     > >>>     You are right, the windowing TVF is a powerful feature
>> which
>> > can
>> >     > >>> support
>> >     > >>>     more operations in the future.
>> >     > >>>     I think it as of the date time "partition" selection in
>> > batch SQL
>> >     > >>> jobs,
>> >     > >>>     with this new syntax, I think it is possible
>> >     > >>>      to migrate traditional batch SQL jobs to Flink SQL by
>> > changing a
>> >     > >>> few lines.
>> >     > >>>
>> >     > >>>     Regarding the SESSION window, this is on purpose to keep
>> it
>> > out of
>> >     > >>> the
>> >     > >>>     FLIP, because we want to keep the
>> >     > >>>     FLIP small to catch up 1.12 and SESSION TVF is rarely
>> useful
>> > (e.g.
>> >     > >>> session
>> >     > >>>     window join?).
>> >     > >>>
>> >     > >>>     Best,
>> >     > >>>     Jark
>> >     > >>>
>> >     > >>>     On Fri, 25 Sep 2020 at 22:59, liupengcheng <
>> >     > >>> [email protected]<mailto:
>> [email protected]>>
>> >     > >>>     wrote:
>> >     > >>>
>> >     > >>>     > Hi, Jark,
>> >     > >>>     >         I'm very interested in this feature, and I'm
>> also
>> > working
>> >     > >>> on this
>> >     > >>>     > recently.
>> >     > >>>     >         I just have a glance at the FLIP, it's good,
>> but I
>> > found
>> >     > >>> that
>> >     > >>>     > there is no plan to add SESSION windows.
>> >     > >>>     >         Also, I think there can be more things we can do
>> > based on
>> >     > >>> this new
>> >     > >>>     > syntax. For example,
>> >     > >>>     >         - window sort support
>> >     > >>>     >         - window union/intersect/minus support
>> >     > >>>     >         - Improve dimension table join
>> >     > >>>     >         We can have more deep discussion on this new
>> > feature
>> >     > later
>> >     > >>> .
>> >     > >>>     >         I've also opened an jira that is related to this
>> > feature
>> >     > >>> recently:
>> >     > >>>     > https://issues.apache.org/jira/browse/FLINK-18830
>> >     > >>>     >
>> >     > >>>     > Best!
>> >     > >>>     > PengchengLiu
>> >     > >>>     >
>> >     > >>>     > 在 2020/9/25 下午10:30，“Jark Wu”<[email protected]<mailto:
>> [email protected]>> 写入:
>> >     > >>>     >
>> >     > >>>     >     Hi everyone,
>> >     > >>>     >
>> >     > >>>     >     I want to start a FLIP about supporting windowing
>> >     > table-valued
>> >     > >>>     > functions
>> >     > >>>     >     (TVF).
>> >     > >>>     >     The main purpose of this FLIP is to improve the near
>> >     > real-time
>> >     > >>> (NRT)
>> >     > >>>     >     experience of Flink.
>> >     > >>>     >
>> >     > >>>     >     FLIP-145:
>> >     > >>>     >
>> >     > >>>     >
>> >     > >>>
>> >     >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function
>> >     > >>>     >
>> >     > >>>     >     We want to introduce TUMBLE, HOP, CUMULATE windowing
>> > TVFs,
>> >     > the
>> >     > >>>     > CUMULATE is
>> >     > >>>     >     a new kind of window.
>> >     > >>>     >     With the windowing TVFs, we can support richer
>> > operations on
>> >     > >>> windows,
>> >     > >>>     >     including window join, window TopN and so on.
>> >     > >>>     >     This makes things simple: we only need to assign
>> > windows at
>> >     > the
>> >     > >>>     > beginning
>> >     > >>>     >     of the query, and then apply operations after that
>> like
>> >     > >>> traditional
>> >     > >>>     > batch
>> >     > >>>     >     SQL.
>> >     > >>>     >     We hope it can help to reduce the learning curve of
>> > windows,
>> >     > >>> improve
>> >     > >>>     > NRT
>> >     > >>>     >     for Flink, and attract more batch users.
>> >     > >>>     >
>> >     > >>>     >     A simple code snippet for 10 minutes tumbling window
>> >     > aggregate:
>> >     > >>>     >
>> >     > >>>     >     SELECT window_start, window_end, SUM(price)
>> >     > >>>     >     FROM TABLE(
>> >     > >>>     >         TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL
>> > '10'
>> >     > >>> MINUTES))
>> >     > >>>     >     GROUP BY window_start, window_end;
>> >     > >>>     >
>> >     > >>>     >     I'm looking forward to your feedback.
>> >     > >>>     >
>> >     > >>>     >     Best,
>> >     > >>>     >     Jark
>> >     > >>>     >
>> >     > >>>     >
>> >     > >>>     >
>> >     > >>>
>> >     > >>>
>> >     > >>>
>> >     >
>> >
>> >
>> >     --
>> >
>> >     Best,
>> >     Benchao Li
>> >
>>
>>
>> --
>>
>> Best,
>> Benchao Li
>>
>>
>> --
>>
>> Best,
>> Benchao Li
>>
>

-- 

Best,
Benchao Li

Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function

Reply via email to