Hi Jark,
thanks for the deep investigation and communication with Calcite and
Beam folks.
Given the new findings, +1 to vote.
Regards,
Timo
On 09.11.20 05:22, Jark Wu wrote:
Hi all,
After some offline discussion and investigation with Timo and Danny, I have
updated the FLIP-145.
FLIP-145:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function
Here are the updates:
1. Add SESSION window syntax and examples.
2. Time Attribute: the returned value of window TVF will return 3 columns
now with additional window_time
which is a time attribute. Add a section of "Time Attribute Propagate"
to explain how to propagate time attributes and examples.
3. The old window syntax will be deprecated. We may drop the old syntax in
the future but that needs another discussion.
4. Add future work about simplifying TABLE() keyword (we already started
discussion in Calcite [1]) and supporting COUNT window.
Besides, we also investigated whether it is possible to use a nested type
"window(start, end, time)" instead of 3 columns.
However, there are some problems that are not possible for now.
- `window.start` can’t be selected in the group by query, because it is not
grouped.
Postgres supports selecting nested fields for grouped ROW columns. We
can fix this in Calcite, but this isn't a trivial work.
- WINDOW is a token in the parser, can’t be used as a column name.
Otherwise, the parsing for OVER WINDOW will fail.
- Apache Beam also considered to put wstart and wend in a separate nested
row [2]. However, that would limit these extensions
to engines supporting nested rows. Many systems don't support nested rows
well.
Therefore, we still insist on using three fields.
I would like to start a new VOTE for the updated FLIP-145 if there are no
objections.
Best,
Jark
[1]:
https://lists.apache.org/x/thread.html/ra98db08e280ddd9adeef62f456f61aedfdf7756e215cb4d66e2a52c9@%3Cdev.calcite.apache.org%3E
[2]:
https://docs.google.com/document/d/138uA7VTpbF84CFrd--cz3YVe0-AQ9ALnsavaSE2JeE4/edit?disco=AAAAHJ0EnGI
On Thu, 15 Oct 2020 at 21:03, Danny Chan <yuzhao....@gmail.com> wrote:
Hi, Timo ~
We are not forced by
the standard to do it as stated in the `One SQL to Rule it all` paper
No, slide to the SQL standard is always better, i think this is a common
routine of our Flink SQL now, without a standard, everyone can give a
preference and the discussion would easily go too far apart.
We can align the SQL windows more towards our regular DataStream API
windows, where you keyBy first and then apply a window operator.
I don't think current DataStream window join implement the window
semantics correctly, it joins the data set first then windowing the LHS and
RHS data together, actually each input should window its data set
separately.
As for the "key by data set first", current window TVF appends just window
attributes and thus it is very light-weight and orthorhombic, we can
combine the window TVFs with additional joins, aggregations, TopN and so on.
In SQL, people usually describe the "KEY BY" with "GROUP BY" caluse, that
means we bind strongly the window TVF and aggregate operator together which
i would definitely vote a -1.
As for the PARTTION BY, there are specific cases for the "SESSION" window
because a session often has a logic key there, we can extend the PARTTION
BY syntax because it is already in the SQL standard, i'm confused why a
Tumble window has a PARTITION key there ? What is the real use case ?
-1 for "ORDER BY" because sort on un-bounded data set does not have
meanings. For un-bounded data set we already has the watermark to handle
the out-of-orderness data, and for bounded data set, we can use the regular
sort here because current table argument allows any query actually.
Best,
Danny Chan
在 2020年10月15日 +0800 PM5:16,dev@flink.apache.org,写道:
Personally, I find this easier to explain to users than telling them the
difference why a session window has SET semantic input tables and
tumble/sliding have ROW semantic input tables.