Re: How to contribute to Streaming Table API and StreamSQL

Jark Wu Fri, 17 Jun 2016 04:19:37 -0700

Hi Fabian, 

Yea, we can immediately start to work on non-windowed aggregates. But it seems 
that Calcite’s StreamSQL doesn’t support non-windowed aggregates (also not 
included in roadmap). So we may need to propose this function back to Calcite 
community?


- Jark Wu 

> 在 2016年6月17日，下午5:41，Fabian Hueske <fhue...@gmail.com> 写道：
> 
> Hi Jark Wu,
> 
> I agree about the non-windowed aggregates. If there are actual use cases
> for this operator, we should definitely support it.
> Since it does not depend on windows or time, we can immediately start to
> work on it. In principle, it should be rather easy to implement.
> However, we have to check how well it integrates with the current state of
> Calcite.
> 
> I think forking off a feature branch is a good idea. We have done that
> before (e.g., for porting the Table API on top of Calcite), but it is not
> so common in the Flink community.
> So I would first send a note to the dev list and check that nobody objects.
> 
> I think we can decouple the development of the Table API and SQL. Although
> it is desirable to have the same feature set in both APIs, I would not be
> strict about it.
> However, the Table API does also depend on Calcite because all Table API
> queries go through Calcite's logical plan representation and optimizer. By
> decoupling the SQL and Table API feature development, we do not need to
> wait for the SQL parser but still might still need certain features in the
> logical plan or optimizer. I hope we can solve a lot with custom RelNodes
> and optimizer rules which should eventually be contributed back to Calcite.
> 
> Best, Fabian
> 
> 
> 2016-06-17 9:48 GMT+02:00 Jark Wu <wuchong...@alibaba-inc.com>:
> 
>> Hi Fabian,
>> 
>> There are a lot of our business are using non-windowed aggregations. And
>> there is a little difference between non-windowed aggregate and Row window
>> operator, as the later is bound to a certain window and emit the result of
>> the N rows preceding for every incoming row. However the former emit the
>> aggregate result of the whole elements. So I suggest to add them for more
>> complete semantic.
>> 
>> Regarding the windowed aggregate task, I’m agree with that and I'm looking
>> forward as soon as possible to see the corresponding JIRA issues created.
>> After that, we can start working on an independent branch without waiting
>> for 1.1 released. But I’m still a little concerned about Calcite’s support,
>> as we must waiting for Calcite supporting correspond syntax and the
>> version released. If we can separate the task into Table API and SQL  , we
>> may not be blocked by Calcite too much.
>> 
>> What do you think?
>> 
>> - Jark Wu
>> 
>>> 在 2016年6月16日，下午8:37，Fabian Hueske <fhue...@gmail.com> 写道：
>>> 
>>> Hi Jark,
>>> 
>>> thanks for sharing Blink's Streaming Table API. It seems to be close to
>> the
>>> DataStream API, while the Table API draft I shared is more similar to
>>> Calcite's proposal.
>>> You are right, the current draft does not include running (non-windowed)
>>> aggregates. We were not sure how useful these are since these aggregates
>>> are unbound and might become meaningless after being applied on a very
>> long
>>> stream. However, we can certainly add them, if users request them.
>>> An alternative to running aggregates could be what I called "Row window
>>> operators" in the streaming Table API draft. These operators emit an
>>> aggregate for each incoming row, however the aggregate is bound to a
>>> certain window around the row like the 10 rows preceding the row for
>> which
>>> the aggregate is computed. Calcite calls these windows "Sliding windows"
>>> (Attention: This is different from Flink's terminology, in Flink sliding
>>> windows are something different). Row windows are similar to running
>>> aggregates in that they emit a row for each incoming row. You can also
>>> think of them as a (Flink) sliding count window which is evaluate for
>> each
>>> incoming record.
>>> 
>>> Further differences are the support of Scalar UDFs in the Table API and
>> the
>>> support for joins which have not been drafted for the Table API yet.
>>> Scalar UDFs are definitely also on our roadmap and with upcoming support
>>> for side inputs, the DataStream API will also support more types of
>> joins.
>>> 
>>> Regarding the current state of Stream SQL in Calcite I am not up to date.
>>> 
>>> I would propose to start with the effort of adding support for windowed
>>> aggregates as follows:
>>> 
>>> 1) Add support to define a timestamp / watermark extractor to tables.
>> This
>>> includes to define a "quasi-monotone" column in a Table's schema. Calcite
>>> will use this information to reason about the validity of a query (making
>>> sure that grouping includes at least one monotone attribute).
>>> 2) Add support for sorting a stream on the timestamp attribute. While
>>> sorting itself is not very exciting, it is an easy operation and can be
>>> immediately implemented without worrying about API questions. This will
>>> also show how well Calcite supports the reasoning about monotone
>> attributes.
>>> 3) Add support for tumbling windows.
>>> 
>>> In each of these steps we might need to get involved with the Calcite
>>> community, depending on Calcite's current support for "quasi-monotone"
>>> attributes, etc.
>>> 
>>> What do you think?
>>> 
>>> Best, Fabian
>>> 
>>> 
>>> 2016-06-14 11:03 GMT+02:00 Jark Wu <wuchong...@alibaba-inc.com>:
>>> 
>>>> Hi Fabian,
>>>> 
>>>> It’s great to hear that we are going to start it!
>>>> 
>>>> I’m glad to share our current Streaming Table API [1]. I find that that
>>>> all aggregation functions are scoped to the defined window in Flink
>> Stream
>>>> Table API design [2] and Calcite StreamSQL desgin [3]. I’m thinking
>> that do
>>>> we need global aggregation? The global aggregation means that
>> aggregation
>>>> is applied only on grouped key not including window which is supported
>> in
>>>> DataStream `datastream.keyBy(f1).sum(f2)`.
>>>> 
>>>> Since the window syntax of StreamSQL is not implemented yet, so will we
>>>> help Calcite community with that first or work code for window+agg Table
>>>> API first ?
>>>> 
>>>> 
>>>> [1]
>>>> 
>> https://docs.google.com/document/d/1KMUzvBAWSyQ39T8MyxUi0zNHyvLUnyGMPA7_RLSDpFw/edit?usp=sharing
>>>> <
>>>> 
>> https://docs.google.com/document/d/1KMUzvBAWSyQ39T8MyxUi0zNHyvLUnyGMPA7_RLSDpFw/edit?usp=sharing
>>>>> 
>>>> [2]
>>>> 
>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E/edit#
>>>> <
>>>> 
>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E/edit#
>>>>> 
>>>> [3] https://calcite.apache.org/docs/stream.html#tumbling-windows <
>>>> https://calcite.apache.org/docs/stream.html#tumbling-windows>
>>>> 
>>>> 
>>>> - Jark Wu
>>>> 
>>>>> 在 2016年6月14日，上午1:10，Fabian Hueske <fhue...@gmail.com> 写道：
>>>>> 
>>>>> Hi Jark,
>>>>> 
>>>>> wow, that's good news!
>>>>> You are right, the streaming Table API is currently very limited. In
>> the
>>>>> last month's we changed the internal architecture and put everything on
>>>> top
>>>>> of Apache Calcite.
>>>>> For the upcoming 1.1 release, we won't add new features to the Table
>> API
>>>> /
>>>>> SQL. However for the 1.2 release, it we plan to focus on the streaming
>>>>> Table API and Stream SQL to add support for windowed aggregates and
>>>> joins,
>>>>> which corresponds to Task 7 and 9 in the design document. You are
>>>>> completely right, that we should start to add tickets to JIRA for
>> this. I
>>>>> will do that tomorrow.
>>>>> 
>>>>> It is great that you have already working code for windowed aggregates
>>>> and
>>>>> joins! Here is a link to our current API draft for windows in the Table
>>>> API
>>>>> [1]. Would be great if you could share how your API looks like. As you
>>>>> said, the internals have changed a lot by now, but we might want to
>> reuse
>>>>> your API for Table API windows and maybe the code of the runtime.
>>>> However,
>>>>> we need to go through Calcite for optimization and SQL support, so some
>>>>> parts need to be definitely changed. Stream SQL is also on the roadmap
>> of
>>>>> the Calcite community, but it might be that some features that we will
>>>> need
>>>>> are not completed yet. So, maybe we help the Calcite community with
>> that
>>>> as
>>>>> well.
>>>>> 
>>>>> If you want to contribute, you should first read the how to contribute
>>>>> guide [2] and guide for code contributions [3].
>>>>> The general rule is to first open a JIRA and later a pull request.
>>>> Changes
>>>>> that are extensive or modify current behavior (except bugs) should be
>>>>> discussed before starting to work on them.
>>>>> 
>>>>> Looking forward to work with you on Flink,
>>>>> Fabian
>>>>> 
>>>>> [1]
>>>>> 
>>>> 
>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E/edit#heading=h.3iw7frfjdcb2
>>>>> [2] http://flink.apache.org/how-to-contribute.html
>>>>> [3] http://flink.apache.org/contribute-code.html
>>>>> 
>>>>> 
>>>>> 2016-06-13 11:31 GMT+02:00 Jark Wu <wuchong...@alibaba-inc.com>:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> We have a great interest in the new Table API & SQL. In Alibaba, we
>> have
>>>>>> added a lot of features (groupBy, agg, window, join, UDF …) to
>> Streaming
>>>>>> Table API (base on Flink 1.0). Now, many jobs run on Table API in
>>>>>> production environment. But we want to keep pace with the community,
>>>> and we
>>>>>> have noticed that Flink Community reworked the Table API and also
>>>> supported
>>>>>> SQL. That is really cool. However, the Streaming Table API is still so
>>>>>> weak. So we want to contribute to accelerate the Streaming Table API
>> and
>>>>>> StreamSQL growth.
>>>>>> 
>>>>>> It seems that we have complete Task-5 and Task-6 mentioned in the Work
>>>>>> Plan <
>>>>>> 
>>>> 
>> https://docs.google.com/document/d/1TLayJNOTBle_-m1rQfgA6Ouj1oYsfqRjPcp1h2TVqdI/edit#
>>>>> .
>>>>>> So can we start Task-7 and Task-9 now? Is there any more specific
>>>> plans? I
>>>>>> think it’s better to create an umbrella JIRA like FLINK-3221 to make
>> the
>>>>>> develop plan clearer.
>>>>>> 
>>>>>> If I want to contribute code for groupBy and agg function, what
>> should I
>>>>>> do? As I didn’t find related JIRAs, can I create JIRA and pull a
>> request
>>>>>> directly?
>>>>>> 
>>>>>> Sorry for so many questions at a time.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> - Jark Wu (wuchong)
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: How to contribute to Streaming Table API and StreamSQL

Reply via email to