Re: [DISCUSS] FLIP-84 Feedback Summary

godfrey he Thu, 07 May 2020 00:32:01 -0700

Hi fabian,
Thanks for you suggestions.

Agree with you that `UNQ(f2, f3)` is more clear.


A table can have only ONE primary key,
this primary key can consist of single or multiple columns. [1]
if primary key consists of single column,
we can simply use `PRI` (or `PRI(xx)`) to represent it.
if primary key have multiple columns,
we should use `PRI(xx, yy, ...)` to represent it.

A table may have multiple unique keys,
each unique key can consist of single or multiple columns. [2]
if there is only one unique key and this unique key has only single column,
we can simply use `UNQ` (or `UNQ(xx)`) to represent it.
otherwise, we should use `UNQ(xx, yy, ...)` to represent it.
(a corner case: two unique keys with same columns, like `UNQ(f2, f3)`,
`UNQ(f2, f3)`,
we can forbid this case or add a unique name for each key in the future)

primary key and unique key with multiple columns example:
create table MyTable (
  f0 BIGINT NOT NULL,
  f1 ROW<q1 STRING, q2 TIMESTAMP(3)>,
  f2 VARCHAR<256>,
  f3 AS f0 + 1,
  f4 TIMESTAMP(3) NOT NULL,
  f5 BIGINT NOT NULL,
 * PRIMARY KEY (f0, f5)*,
  *UNIQUE (f3, f2)*,
  WATERMARK f4 AS f4 - INTERVAL '3' SECOND
) with (...)

+--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
| name | type                                                          |
null   | key              | compute column | watermark
       |
+--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
| f0       | BIGINT                                                     |
false | PRI(f0, f5)   |  (NULL)               |   (NULL)
            |
+--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
| f1       | ROW<q1 STRING, q2 TIMESTAMP(3)> | true   | (NULL)        |
(NULL)               |  (NULL)                                 |
+--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
| f2       | VARCHAR<256>                                     | true   |
UNQ(f2, f3) |  (NULL)               |  (NULL)
   |
+--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
| f3       | BIGINT                                                     |
false | UNQ(f2, f3) |  f0 + 1                  |  (NULL)
             |
+--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
| f4       | TIMESTAMP(3)                                        | false |
(NULL)        |  (NULL)               | f4 - INTERVAL '3' SECOND |
+--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
| f5       | BIGINT                                                     |
false | PRI(f0, f5)   |  (NULL)               |   (NULL)
            |
+--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+

"Regarding to the watermark on nested columns", that's a good approach
which can both support watermark on nested columns in the future and keep
current table form.

[1] https://www.w3schools.com/sql/sql_primarykey.asp
[2] https://www.w3schools.com/sql/sql_unique.ASP

Best,
Godfrey

Fabian Hueske <fhue...@gmail.com> 于2020年5月7日周四 上午12:03写道：

> Hi Godfrey,
>
> This looks good to me.
>
> One side note, indicating unique constraints with "UNQ" is probably not
> enough.
> There might be multiple unique constraints and users would like to know
> which field combinations are unique.
> So in your example above, "UNQ(f2, f3)" might be a better marker.
>
> Just as a thought, if we would later add support for watermark on nested
> columns, we could add a row just for the nested field (in addition to the
> top-level field) like this:
>
>
> +------------------------+---------------------------+-------+-----------+-------------+-----------------------------------------------------------+
> | f4.nested.rowtime | TIMESTAMP(3)        | false | (NULL) |  (NULL)
> | f4.nested.rowtime - INTERVAL '3' SECOND |
>
> +------------------------+---------------------------+-------+-----------+-------------+-----------------------------------------------------------+
>
> Thanks,
> Fabian
>
> Am Mi., 6. Mai 2020 um 17:51 Uhr schrieb godfrey he <godfre...@gmail.com>:
>
>> Hi @fhue...@gmail.com @Timo Walther <twal...@apache.org>  @Dawid
>> Wysakowicz <dwysakow...@apache.org>
>> What do you think we limit watermark must be defined on top-level column ?
>>
>> if we do that, we can add an expression column to represent watermark
>> like compute column,
>> An example of all cases:
>> create table MyTable (
>>   f0 BIGINT NOT NULL,
>>   f1 ROW<q1 STRING, q2 TIMESTAMP(3)>,
>>   f2 VARCHAR<256>,
>>   f3 AS f0 + 1,
>>   f4 TIMESTAMP(3) NOT NULL,
>>   PRIMARY KEY (f0),
>>   UNIQUE (f3, f2),
>>   WATERMARK f4 AS f4 - INTERVAL '3' SECOND
>> ) with (...)
>>
>>
>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>> | name | type                                                          |
>> null   | key        | compute column | watermark
>>  |
>>
>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>> | f0       | BIGINT                                                     |
>> false | PRI       |  (NULL)               |   (NULL)
>>         |
>>
>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>> | f1       | ROW<q1 STRING, q2 TIMESTAMP(3)> | true   | (NULL) |  (NULL)
>>              |  (NULL)                                 |
>>
>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>> | f2       | VARCHAR<256>                                     | true   |
>> UNQ     |  (NULL)               |  (NULL)                                 |
>>
>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>> | f3       | BIGINT                                                     |
>> false | UNQ     |  f0 + 1                  |  (NULL)
>>          |
>>
>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>> | f4       | TIMESTAMP(3)                                        | false
>> | (NULL) |  (NULL)                | f4 - INTERVAL '3' SECOND |
>>
>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>>
>> WDYT ?
>>
>> Best,
>> Godfrey
>>
>>
>>
>> godfrey he <godfre...@gmail.com> 于2020年4月30日周四 下午11:57写道：
>>
>>> Hi Fabian,
>>>
>>> the broken example is:
>>>
>>> create table MyTable (
>>>
>>>     f0 BIGINT NOT NULL,
>>>
>>>     f1 ROW<q1 STRING, q2 TIMESTAMP(3)>,
>>>
>>>     f2 VARCHAR<256>,
>>>
>>>     f3 AS f0 + 1,
>>>
>>>     PRIMARY KEY (f0),
>>>
>>> UNIQUE (f3, f2),
>>>
>>>     WATERMARK f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>>
>>> ) with (...)
>>>
>>>
>>> name
>>>
>>> type
>>>
>>> key
>>>
>>> compute column
>>>
>>> watermark
>>>
>>> f0
>>>
>>> BIGINT NOT NULL
>>>
>>> PRI
>>>
>>> (NULL)
>>>
>>> f1
>>>
>>> ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>>
>>> UNQ
>>>
>>> (NULL)
>>>
>>> f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>>
>>> f2
>>>
>>> VARCHAR<256>
>>>
>>> (NULL)
>>>
>>> NULL
>>>
>>> f3
>>>
>>> BIGINT NOT NULL
>>>
>>> UNQ
>>>
>>> f0 + 1
>>>
>>>
>>> or we add a column to represent nullability.
>>>
>>> name
>>>
>>> type
>>>
>>> null
>>>
>>> key
>>>
>>> compute column
>>>
>>> watermark
>>>
>>> f0
>>>
>>> BIGINT
>>>
>>> false
>>>
>>> PRI
>>>
>>> (NULL)
>>>
>>> f1
>>>
>>> ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>>
>>> true
>>>
>>> UNQ
>>>
>>> (NULL)
>>>
>>> f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>>
>>> f2
>>>
>>> VARCHAR<256>
>>>
>>> true
>>>
>>> (NULL)
>>>
>>> NULL
>>>
>>> f3
>>>
>>> BIGINT
>>>
>>> false
>>>
>>> UNQ
>>>
>>> f0 + 1
>>>
>>>
>>>
>>>
>>> Hi Jark,
>>> If we can limit watermark must be defined on top-level column,
>>> this will become more simple.
>>>
>>> Best,
>>> Godfrey
>>>
>>> Jark Wu <imj...@gmail.com> 于2020年4月30日周四 下午11:38写道：
>>>
>>>> Hi,
>>>>
>>>> I'm in favor of Fabian's proposal.
>>>> First, watermark is not a column, but a metadata just like primary key,
>>>> so
>>>> shouldn't stand with columns.
>>>> Second, AFAIK, primary key can only be defined on top-level columns.
>>>> Third, I think watermark can also follow primary key than only allow to
>>>> define on top-level columns.
>>>>
>>>> I have to admit that in FLIP-66, watermark can define on nested fields.
>>>> However, during implementation, I found that it's too complicated to do
>>>> that. We have refactor time-based physical nodes,
>>>> we have to use code generation to access event-time, we have to refactor
>>>> FlinkTypeFactory to support a complex nested rowtime.
>>>> There is not much value of this feature, but introduce a lot of
>>>> complexity
>>>> in code base.
>>>> So I think we can force watermark define on top-level columns. If user
>>>> want
>>>> to define on nested columns,
>>>> he/she can use computed column to be a top-level column.
>>>>
>>>> Best,
>>>> Jark
>>>>
>>>>
>>>> On Thu, 30 Apr 2020 at 17:55, Fabian Hueske <fhue...@gmail.com> wrote:
>>>>
>>>> > Hi Godfrey,
>>>> >
>>>> > The formatting of your example seems to be broken.
>>>> > Could you send them again please?
>>>> >
>>>> > Regarding your points
>>>> > > because watermark express can be a sub-column, just like `f1.q2` in
>>>> above
>>>> > example I give.
>>>> >
>>>> > I would put the watermark information in the row of the top-level
>>>> field and
>>>> > indicate to which nested field the watermark refers.
>>>> > Don't we have to solve the same issue for primary keys that are
>>>> defined on
>>>> > a nested field?
>>>> >
>>>> > > A boolean flag can't represent such info. and I do know whether we
>>>> will
>>>> > support complex watermark expression involving multiple columns in the
>>>> > future. such as: "WATERMARK FOR ts as ts + f1 + interval '1' second"
>>>> >
>>>> > You are right, a simple binary flag is definitely not sufficient to
>>>> display
>>>> > the watermark information.
>>>> > I would put the expression string into the field, i.e., "ts + f1 +
>>>> interval
>>>> > '1' second"
>>>> >
>>>> >
>>>> > For me the most important point of why to not show the watermark as a
>>>> row
>>>> > in the table is that it is not field that can be queried but meta
>>>> > information on an existing field.
>>>> > For the user it is important to know that a certain field has a
>>>> watermark.
>>>> > Otherwise, certain queries cannot be correctly specified.
>>>> > Also there might be support for multiple watermarks that are defined
>>>> of
>>>> > different fields at some point. Would those be printed in multiple
>>>> rows?
>>>> >
>>>> > Best,
>>>> > Fabian
>>>> >
>>>> >
>>>> > Am Do., 30. Apr. 2020 um 11:25 Uhr schrieb godfrey he <
>>>> godfre...@gmail.com
>>>> > >:
>>>> >
>>>> > > Hi Fabian, Aljoscha
>>>> > >
>>>> > > Thanks for the feedback.
>>>> > >
>>>> > > Agree with you that we can deal with primary key as you mentioned.
>>>> > > now, the type column has contained the nullability attribute, e.g.
>>>> BIGINT
>>>> > > NOT NULL.
>>>> > > (I'm also ok that we use two columns to represent type just like
>>>> mysql)
>>>> > >
>>>> > > >Why I treat `watermark` as a special row ?
>>>> > > because watermark express can be a sub-column, just like `f1.q2` in
>>>> above
>>>> > > example I give.
>>>> > > A boolean flag can't represent such info. and I do know whether we
>>>> will
>>>> > > support complex
>>>> > > watermark expression involving multiple columns in the future. such
>>>> as:
>>>> > > "WATERMARK FOR ts as ts + f1 + interval '1' second"
>>>> > >
>>>> > > If we do not support complex watermark expression, we can add a
>>>> watermark
>>>> > > column.
>>>> > >
>>>> > > for example:
>>>> > >
>>>> > > create table MyTable (
>>>> > >
>>>> > >     f0 BIGINT NOT NULL,
>>>> > >
>>>> > >     f1 ROW<q1 STRING, q2 TIMESTAMP(3)>,
>>>> > >
>>>> > >     f2 VARCHAR<256>,
>>>> > >
>>>> > >     f3 AS f0 + 1,
>>>> > >
>>>> > >     PRIMARY KEY (f0),
>>>> > >
>>>> > > UNIQUE (f3, f2),
>>>> > >
>>>> > >     WATERMARK f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>>> > >
>>>> > > ) with (...)
>>>> > >
>>>> > >
>>>> > > name
>>>> > >
>>>> > > type
>>>> > >
>>>> > > key
>>>> > >
>>>> > > compute column
>>>> > >
>>>> > > watermark
>>>> > >
>>>> > > f0
>>>> > >
>>>> > > BIGINT NOT NULL
>>>> > >
>>>> > > PRI
>>>> > >
>>>> > > (NULL)
>>>> > >
>>>> > > f1
>>>> > >
>>>> > > ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>>> > >
>>>> > > UNQ
>>>> > >
>>>> > > (NULL)
>>>> > >
>>>> > > f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>>> > >
>>>> > > f2
>>>> > >
>>>> > > VARCHAR<256>
>>>> > >
>>>> > > (NULL)
>>>> > >
>>>> > > NULL
>>>> > >
>>>> > > f3
>>>> > >
>>>> > > BIGINT NOT NULL
>>>> > >
>>>> > > UNQ
>>>> > >
>>>> > > f0 + 1
>>>> > >
>>>> > >
>>>> > > or we add a column to represent nullability.
>>>> > >
>>>> > > name
>>>> > >
>>>> > > type
>>>> > >
>>>> > > null
>>>> > >
>>>> > > key
>>>> > >
>>>> > > compute column
>>>> > >
>>>> > > watermark
>>>> > >
>>>> > > f0
>>>> > >
>>>> > > BIGINT
>>>> > >
>>>> > > false
>>>> > >
>>>> > > PRI
>>>> > >
>>>> > > (NULL)
>>>> > >
>>>> > > f1
>>>> > >
>>>> > > ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>>> > >
>>>> > > true
>>>> > >
>>>> > > UNQ
>>>> > >
>>>> > > (NULL)
>>>> > >
>>>> > > f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>>> > >
>>>> > > f2
>>>> > >
>>>> > > VARCHAR<256>
>>>> > >
>>>> > > true
>>>> > >
>>>> > > (NULL)
>>>> > >
>>>> > > NULL
>>>> > >
>>>> > > f3
>>>> > >
>>>> > > BIGINT
>>>> > >
>>>> > > false
>>>> > >
>>>> > > UNQ
>>>> > >
>>>> > > f0 + 1
>>>> > >
>>>> > >
>>>> > > Personally, I like the second one. (we need do some changes on
>>>> > LogicalType
>>>> > > to get type name without nullability)
>>>> > >
>>>> > >
>>>> > > Best,
>>>> > > Godfrey
>>>> > >
>>>> > >
>>>> > > Aljoscha Krettek <aljos...@apache.org> 于2020年4月29日周三 下午5:47写道：
>>>> > >
>>>> > > > +1 I like the general idea of printing the results as a table.
>>>> > > >
>>>> > > > On the specifics I don't know enough but Fabians suggestions
>>>> seems to
>>>> > > > make sense to me.
>>>> > > >
>>>> > > > Aljoscha
>>>> > > >
>>>> > > > On 29.04.20 10:56, Fabian Hueske wrote:
>>>> > > > > Hi Godfrey,
>>>> > > > >
>>>> > > > > Thanks for starting this discussion!
>>>> > > > >
>>>> > > > > In my mind, WATERMARK is a property (or constraint) of a field,
>>>> just
>>>> > > like
>>>> > > > > PRIMARY KEY.
>>>> > > > > Take this example from MySQL:
>>>> > > > >
>>>> > > > > mysql> CREATE TABLE people (id INT NOT NULL, name VARCHAR(128)
>>>> NOT
>>>> > > NULL,
>>>> > > > > age INT, PRIMARY KEY (id));
>>>> > > > > Query OK, 0 rows affected (0.06 sec)
>>>> > > > >
>>>> > > > > mysql> describe people;
>>>> > > > > +-------+--------------+------+-----+---------+-------+
>>>> > > > > | Field | Type         | Null | Key | Default | Extra |
>>>> > > > > +-------+--------------+------+-----+---------+-------+
>>>> > > > > | id    | int          | NO   | PRI | NULL    |       |
>>>> > > > > | name  | varchar(128) | NO   |     | NULL    |       |
>>>> > > > > | age   | int          | YES  |     | NULL    |       |
>>>> > > > > +-------+--------------+------+-----+---------+-------+
>>>> > > > > 3 rows in set (0.01 sec)
>>>> > > > >
>>>> > > > > Here, PRIMARY KEY is marked in the Key column of the id field.
>>>> > > > > We could do the same for watermarks by adding a Watermark
>>>> column.
>>>> > > > >
>>>> > > > > Best, Fabian
>>>> > > > >
>>>> > > > >
>>>> > > > > Am Mi., 29. Apr. 2020 um 10:43 Uhr schrieb godfrey he <
>>>> > > > godfre...@gmail.com>:
>>>> > > > >
>>>> > > > >> Hi everyone,
>>>> > > > >>
>>>> > > > >> I would like to bring up a discussion about the result type of
>>>> > > describe
>>>> > > > >> statement,
>>>> > > > >> which is introduced in FLIP-84[1].
>>>> > > > >> In previous version, we define the result type of `describe`
>>>> > statement
>>>> > > > is a
>>>> > > > >> single column as following
>>>> > > > >>
>>>> > > > >> Statement
>>>> > > > >>
>>>> > > > >> Result Schema
>>>> > > > >>
>>>> > > > >> Result Value
>>>> > > > >>
>>>> > > > >> Result Kind
>>>> > > > >>
>>>> > > > >> Examples
>>>> > > > >>
>>>> > > > >> DESCRIBE xx
>>>> > > > >>
>>>> > > > >> field name: result
>>>> > > > >>
>>>> > > > >> field type: VARCHAR(n)
>>>> > > > >>
>>>> > > > >> (n is the max length of values)
>>>> > > > >>
>>>> > > > >> describe the detail of an object
>>>> > > > >>
>>>> > > > >> (single row)
>>>> > > > >>
>>>> > > > >> SUCCESS_WITH_CONTENT
>>>> > > > >>
>>>> > > > >> DESCRIBE table_name
>>>> > > > >>
>>>> > > > >> for "describe table_name", the result value is the `toString`
>>>> value
>>>> > of
>>>> > > > >> `TableSchema`, which is an unstructured data.
>>>> > > > >> It's hard to for user to use this info.
>>>> > > > >>
>>>> > > > >> for example:
>>>> > > > >>
>>>> > > > >> TableSchema schema = TableSchema.builder()
>>>> > > > >>     .field("f0", DataTypes.BIGINT())
>>>> > > > >>     .field("f1", DataTypes.ROW(
>>>> > > > >>        DataTypes.FIELD("q1", DataTypes.STRING()),
>>>> > > > >>        DataTypes.FIELD("q2", DataTypes.TIMESTAMP(3))))
>>>> > > > >>     .field("f2", DataTypes.STRING())
>>>> > > > >>     .field("f3", DataTypes.BIGINT(), "f0 + 1")
>>>> > > > >>     .watermark("f1.q2", WATERMARK_EXPRESSION,
>>>> WATERMARK_DATATYPE)
>>>> > > > >>     .build();
>>>> > > > >>
>>>> > > > >> its `toString` value is:
>>>> > > > >> root
>>>> > > > >>   |-- f0: BIGINT
>>>> > > > >>   |-- f1: ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>>> > > > >>   |-- f2: STRING
>>>> > > > >>   |-- f3: BIGINT AS f0 + 1
>>>> > > > >>   |-- WATERMARK FOR f1.q2 AS now()
>>>> > > > >>
>>>> > > > >> For hive, MySQL, etc., the describe result is table form
>>>> including
>>>> > > field
>>>> > > > >> names and field types.
>>>> > > > >> which is more familiar with users.
>>>> > > > >> TableSchema[2] has watermark expression and compute column, we
>>>> > should
>>>> > > > also
>>>> > > > >> put them into the table:
>>>> > > > >> for compute column, it's a column level, we add a new column
>>>> named
>>>> > > > `expr`.
>>>> > > > >>   for watermark expression, it's a table level, we add a
>>>> special row
>>>> > > > named
>>>> > > > >> `WATERMARK` to represent it.
>>>> > > > >>
>>>> > > > >> The result will look like about above example:
>>>> > > > >>
>>>> > > > >> name
>>>> > > > >>
>>>> > > > >> type
>>>> > > > >>
>>>> > > > >> expr
>>>> > > > >>
>>>> > > > >> f0
>>>> > > > >>
>>>> > > > >> BIGINT
>>>> > > > >>
>>>> > > > >> (NULL)
>>>> > > > >>
>>>> > > > >> f1
>>>> > > > >>
>>>> > > > >> ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>>> > > > >>
>>>> > > > >> (NULL)
>>>> > > > >>
>>>> > > > >> f2
>>>> > > > >>
>>>> > > > >> STRING
>>>> > > > >>
>>>> > > > >> NULL
>>>> > > > >>
>>>> > > > >> f3
>>>> > > > >>
>>>> > > > >> BIGINT
>>>> > > > >>
>>>> > > > >> f0 + 1
>>>> > > > >>
>>>> > > > >> WATERMARK
>>>> > > > >>
>>>> > > > >> (NULL)
>>>> > > > >>
>>>> > > > >> f1.q2 AS now()
>>>> > > > >>
>>>> > > > >> now there is a pr FLINK-17112 [3] to implement DESCRIBE
>>>> statement.
>>>> > > > >>
>>>> > > > >> What do you think about this update?
>>>> > > > >> Any feedback are welcome~
>>>> > > > >>
>>>> > > > >> Best,
>>>> > > > >> Godfrey
>>>> > > > >>
>>>> > > > >> [1]
>>>> > > > >>
>>>> > > >
>>>> > >
>>>> >
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=134745878
>>>> > > > >> [2]
>>>> > > > >>
>>>> > > > >>
>>>> > > >
>>>> > >
>>>> >
>>>> https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/api/TableSchema.java
>>>> > > > >> [3] https://github.com/apache/flink/pull/11892
>>>> > > > >>
>>>> > > > >>
>>>> > > > >> godfrey he <godfre...@gmail.com> 于2020年4月6日周一 下午10:38写道：
>>>> > > > >>
>>>> > > > >>> Hi Timo,
>>>> > > > >>>
>>>> > > > >>> Sorry for the late reply, and thanks for your correction.
>>>> > > > >>> I missed DQL for job submission scenario.
>>>> > > > >>> I'll fix the document right away.
>>>> > > > >>>
>>>> > > > >>> Best,
>>>> > > > >>> Godfrey
>>>> > > > >>>
>>>> > > > >>> Timo Walther <twal...@apache.org> 于2020年4月3日周五 下午9:53写道：
>>>> > > > >>>
>>>> > > > >>>> Hi Godfrey,
>>>> > > > >>>>
>>>> > > > >>>> I'm sorry to jump in again but I still need to clarify some
>>>> things
>>>> > > > >>>> around TableResult.
>>>> > > > >>>>
>>>> > > > >>>> The FLIP says:
>>>> > > > >>>> "For DML, this method returns TableResult until the job is
>>>> > > submitted.
>>>> > > > >>>> For other statements, TableResult is returned until the
>>>> execution
>>>> > is
>>>> > > > >>>> finished."
>>>> > > > >>>>
>>>> > > > >>>> I thought we agreed on making every execution async? This
>>>> also
>>>> > means
>>>> > > > >>>> returning a TableResult for DQLs even though the execution
>>>> is not
>>>> > > done
>>>> > > > >>>> yet. People need access to the JobClient also for batch jobs
>>>> in
>>>> > > order
>>>> > > > to
>>>> > > > >>>> cancel long lasting queries. If people want to wait for the
>>>> > > completion
>>>> > > > >>>> they can hook into JobClient or collect().
>>>> > > > >>>>
>>>> > > > >>>> Can we rephrase this part to:
>>>> > > > >>>>
>>>> > > > >>>> The FLIP says:
>>>> > > > >>>> "For DML and DQL, this method returns TableResult once the
>>>> job has
>>>> > > > been
>>>> > > > >>>> submitted. For DDL and DCL statements, TableResult is
>>>> returned
>>>> > once
>>>> > > > the
>>>> > > > >>>> operation has finished."
>>>> > > > >>>>
>>>> > > > >>>> Regards,
>>>> > > > >>>> Timo
>>>> > > > >>>>
>>>> > > > >>>>
>>>> > > > >>>> On 02.04.20 05:27, godfrey he wrote:
>>>> > > > >>>>> Hi Aljoscha, Dawid, Timo,
>>>> > > > >>>>>
>>>> > > > >>>>> Thanks so much for the detailed explanation.
>>>> > > > >>>>> Agree with you that the multiline story is not completed
>>>> now, and
>>>> > > we
>>>> > > > >> can
>>>> > > > >>>>> keep discussion.
>>>> > > > >>>>> I will add current discussions and conclusions to the FLIP.
>>>> > > > >>>>>
>>>> > > > >>>>> Best,
>>>> > > > >>>>> Godfrey
>>>> > > > >>>>>
>>>> > > > >>>>>
>>>> > > > >>>>>
>>>> > > > >>>>> Timo Walther <twal...@apache.org> 于2020年4月1日周三 下午11:27写道：
>>>> > > > >>>>>
>>>> > > > >>>>>> Hi Godfrey,
>>>> > > > >>>>>>
>>>> > > > >>>>>> first of all, I agree with Dawid. The multiline story is
>>>> not
>>>> > > > >> completed
>>>> > > > >>>>>> by this FLIP. It just verifies the big picture.
>>>> > > > >>>>>>
>>>> > > > >>>>>> 1. "control the execution logic through the proposed
>>>> method if
>>>> > > they
>>>> > > > >>>> know
>>>> > > > >>>>>> what the statements are"
>>>> > > > >>>>>>
>>>> > > > >>>>>> This is a good point that also Fabian raised in the linked
>>>> > google
>>>> > > > >> doc.
>>>> > > > >>>> I
>>>> > > > >>>>>> could also imagine to return a more complicated POJO when
>>>> > calling
>>>> > > > >>>>>> `executeMultiSql()`.
>>>> > > > >>>>>>
>>>> > > > >>>>>> The POJO would include some `getSqlProperties()` such that
>>>> a
>>>> > > > platform
>>>> > > > >>>>>> gets insights into the query before executing. We could
>>>> also
>>>> > > trigger
>>>> > > > >>>> the
>>>> > > > >>>>>> execution more explicitly instead of hiding it behind an
>>>> > iterator.
>>>> > > > >>>>>>
>>>> > > > >>>>>> 2. "there are some special commands introduced in SQL
>>>> client"
>>>> > > > >>>>>>
>>>> > > > >>>>>> For platforms and SQL Client specific commands, we could
>>>> offer a
>>>> > > > hook
>>>> > > > >>>> to
>>>> > > > >>>>>> the parser or a fallback parser in case the regular table
>>>> > > > environment
>>>> > > > >>>>>> parser cannot deal with the statement.
>>>> > > > >>>>>>
>>>> > > > >>>>>> However, all of that is future work and can be discussed
>>>> in a
>>>> > > > >> separate
>>>> > > > >>>>>> FLIP.
>>>> > > > >>>>>>
>>>> > > > >>>>>> 3. +1 for the `Iterator` instead of `Iterable`.
>>>> > > > >>>>>>
>>>> > > > >>>>>> 4. "we should convert the checked exception to unchecked
>>>> > > exception"
>>>> > > > >>>>>>
>>>> > > > >>>>>> Yes, I meant using a runtime exception instead of a checked
>>>> > > > >> exception.
>>>> > > > >>>>>> There was no consensus on putting the exception into the
>>>> > > > >> `TableResult`.
>>>> > > > >>>>>>
>>>> > > > >>>>>> Regards,
>>>> > > > >>>>>> Timo
>>>> > > > >>>>>>
>>>> > > > >>>>>> On 01.04.20 15:35, Dawid Wysakowicz wrote:
>>>> > > > >>>>>>> When considering the multi-line support I think it is
>>>> helpful
>>>> > to
>>>> > > > >> start
>>>> > > > >>>>>>> with a use case in mind. In my opinion consumers of this
>>>> method
>>>> > > > will
>>>> > > > >>>> be:
>>>> > > > >>>>>>>
>>>> > > > >>>>>>>    1. sql-client
>>>> > > > >>>>>>>    2. third-part sql based platforms
>>>> > > > >>>>>>>
>>>> > > > >>>>>>> @Godfrey As for the quit/source/... commands. I think
>>>> those
>>>> > > belong
>>>> > > > >> to
>>>> > > > >>>>>>> the responsibility of aforementioned. I think they should
>>>> not
>>>> > be
>>>> > > > >>>>>>> understandable by the TableEnvironment. What would quit
>>>> on a
>>>> > > > >>>>>>> TableEnvironment do? Moreover I think such commands
>>>> should be
>>>> > > > >> prefixed
>>>> > > > >>>>>>> appropriately. I think it's a common practice to e.g.
>>>> prefix
>>>> > > those
>>>> > > > >>>> with
>>>> > > > >>>>>>> ! or : to say they are meta commands of the tool rather
>>>> than a
>>>> > > > >> query.
>>>> > > > >>>>>>>
>>>> > > > >>>>>>> I also don't necessarily understand why platform users
>>>> need to
>>>> > > know
>>>> > > > >>>> the
>>>> > > > >>>>>>> kind of the query to use the proposed method. They should
>>>> get
>>>> > the
>>>> > > > >> type
>>>> > > > >>>>>>> from the TableResult#ResultKind. If the ResultKind is
>>>> SUCCESS,
>>>> > it
>>>> > > > >> was
>>>> > > > >>>> a
>>>> > > > >>>>>>> DCL/DDL. If SUCCESS_WITH_CONTENT it was a DML/DQL. If
>>>> that's
>>>> > not
>>>> > > > >>>> enough
>>>> > > > >>>>>>> we can enrich the TableResult with more explicit kind of
>>>> query,
>>>> > > but
>>>> > > > >> so
>>>> > > > >>>>>>> far I don't see such a need.
>>>> > > > >>>>>>>
>>>> > > > >>>>>>> @Kurt In those cases I would assume the developers want to
>>>> > > present
>>>> > > > >>>>>>> results of the queries anyway. Moreover I think it is
>>>> safe to
>>>> > > > assume
>>>> > > > >>>>>>> they can adhere to such a contract that the results must
>>>> be
>>>> > > > >> iterated.
>>>> > > > >>>>>>>
>>>> > > > >>>>>>> For direct users of TableEnvironment/Table API this
>>>> method does
>>>> > > not
>>>> > > > >>>> make
>>>> > > > >>>>>>> much sense anyway, in my opinion. I think we can rather
>>>> safely
>>>> > > > >> assume
>>>> > > > >>>> in
>>>> > > > >>>>>>> this scenario they do not want to submit multiple queries
>>>> at a
>>>> > > > >> single
>>>> > > > >>>>>> time.
>>>> > > > >>>>>>>
>>>> > > > >>>>>>> Best,
>>>> > > > >>>>>>>
>>>> > > > >>>>>>> Dawid
>>>> > > > >>>>>>>
>>>> > > > >>>>>>>
>>>> > > > >>>>>>> On 01/04/2020 15:07, Kurt Young wrote:
>>>> > > > >>>>>>>> One comment to `executeMultilineSql`, I'm afraid
>>>> sometimes
>>>> > user
>>>> > > > >> might
>>>> > > > >>>>>>>> forget to
>>>> > > > >>>>>>>> iterate the returned iterators, e.g. user submits a
>>>> bunch of
>>>> > > DDLs
>>>> > > > >> and
>>>> > > > >>>>>>>> expect the
>>>> > > > >>>>>>>> framework will execute them one by one. But it didn't.
>>>> > > > >>>>>>>>
>>>> > > > >>>>>>>> Best,
>>>> > > > >>>>>>>> Kurt
>>>> > > > >>>>>>>>
>>>> > > > >>>>>>>>
>>>> > > > >>>>>>>> On Wed, Apr 1, 2020 at 5:10 PM Aljoscha Krettek<
>>>> > > > >> aljos...@apache.org>
>>>> > > > >>>>>> wrote:
>>>> > > > >>>>>>>>
>>>> > > > >>>>>>>>> Agreed to what Dawid and Timo said.
>>>> > > > >>>>>>>>>
>>>> > > > >>>>>>>>> To answer your question about multi line SQL: no, we
>>>> don't
>>>> > > think
>>>> > > > >> we
>>>> > > > >>>>>> need
>>>> > > > >>>>>>>>> this in Flink 1.11, we only wanted to make sure that the
>>>> > > > >> interfaces
>>>> > > > >>>>>> that
>>>> > > > >>>>>>>>> we now put in place will potentially allow this in the
>>>> > future.
>>>> > > > >>>>>>>>>
>>>> > > > >>>>>>>>> Best,
>>>> > > > >>>>>>>>> Aljoscha
>>>> > > > >>>>>>>>>
>>>> > > > >>>>>>>>> On 01.04.20 09:31, godfrey he wrote:
>>>> > > > >>>>>>>>>> Hi, Timo & Dawid,
>>>> > > > >>>>>>>>>>
>>>> > > > >>>>>>>>>> Thanks so much for the effort of `multiline statements
>>>> > > > >> supporting`,
>>>> > > > >>>>>>>>>> I have a few questions about this method:
>>>> > > > >>>>>>>>>>
>>>> > > > >>>>>>>>>> 1. users can well control the execution logic through
>>>> the
>>>> > > > >> proposed
>>>> > > > >>>>>> method
>>>> > > > >>>>>>>>>>      if they know what the statements are (a statement
>>>> is a
>>>> > > > DDL, a
>>>> > > > >>>> DML
>>>> > > > >>>>>> or
>>>> > > > >>>>>>>>>> others).
>>>> > > > >>>>>>>>>> but if a statement is from a file, that means users do
>>>> not
>>>> > > know
>>>> > > > >>>> what
>>>> > > > >>>>>> the
>>>> > > > >>>>>>>>>> statements are,
>>>> > > > >>>>>>>>>> the execution behavior is unclear.
>>>> > > > >>>>>>>>>> As a platform user, I think this method is hard to use,
>>>> > unless
>>>> > > > >> the
>>>> > > > >>>>>>>>> platform
>>>> > > > >>>>>>>>>> defines
>>>> > > > >>>>>>>>>> a set of rule about the statements order, such as: no
>>>> select
>>>> > > in
>>>> > > > >> the
>>>> > > > >>>>>>>>> middle,
>>>> > > > >>>>>>>>>> dml must be at tail of sql file (which may be the most
>>>> case
>>>> > in
>>>> > > > >>>> product
>>>> > > > >>>>>>>>>> env).
>>>> > > > >>>>>>>>>> Otherwise the platform must parse the sql first, then
>>>> know
>>>> > > what
>>>> > > > >> the
>>>> > > > >>>>>>>>>> statements are.
>>>> > > > >>>>>>>>>> If do like that, the platform can handle all cases
>>>> through
>>>> > > > >>>>>> `executeSql`
>>>> > > > >>>>>>>>> and
>>>> > > > >>>>>>>>>> `StatementSet`.
>>>> > > > >>>>>>>>>>
>>>> > > > >>>>>>>>>> 2. SQL client can't also use `executeMultilineSql` to
>>>> > supports
>>>> > > > >>>>>> multiline
>>>> > > > >>>>>>>>>> statements,
>>>> > > > >>>>>>>>>>      because there are some special commands
>>>> introduced in
>>>> > SQL
>>>> > > > >>>> client,
>>>> > > > >>>>>>>>>> such as `quit`, `source`, `load jar` (not exist now,
>>>> but
>>>> > maybe
>>>> > > > we
>>>> > > > >>>> need
>>>> > > > >>>>>>>>> this
>>>> > > > >>>>>>>>>> command
>>>> > > > >>>>>>>>>>      to support dynamic table source and udf).
>>>> > > > >>>>>>>>>> Does TableEnvironment also supports those commands?
>>>> > > > >>>>>>>>>>
>>>> > > > >>>>>>>>>> 3. btw, we must have this feature in release-1.11? I
>>>> find
>>>> > > there
>>>> > > > >> are
>>>> > > > >>>>>> few
>>>> > > > >>>>>>>>>> user cases
>>>> > > > >>>>>>>>>>      in the feedback document which behavior is
>>>> unclear now.
>>>> > > > >>>>>>>>>>
>>>> > > > >>>>>>>>>> regarding to "change the return value from
>>>> `Iterable<Row` to
>>>> > > > >>>>>>>>>> `Iterator<Row`",
>>>> > > > >>>>>>>>>> I couldn't agree more with this change. Just as Dawid
>>>> > > mentioned
>>>> > > > >>>>>>>>>> "The contract of the Iterable#iterator is that it
>>>> returns a
>>>> > > new
>>>> > > > >>>>>> iterator
>>>> > > > >>>>>>>>>> each time,
>>>> > > > >>>>>>>>>>      which effectively means we can iterate the results
>>>> > > multiple
>>>> > > > >>>>>> times.",
>>>> > > > >>>>>>>>>> we does not provide iterate the results multiple times.
>>>> > > > >>>>>>>>>> If we want do that, the client must buffer all
>>>> results. but
>>>> > > it's
>>>> > > > >>>>>>>>> impossible
>>>> > > > >>>>>>>>>> for streaming job.
>>>> > > > >>>>>>>>>>
>>>> > > > >>>>>>>>>> Best,
>>>> > > > >>>>>>>>>> Godfrey
>>>> > > > >>>>>>>>>>
>>>> > > > >>>>>>>>>> Dawid Wysakowicz<dwysakow...@apache.org>  于2020年4月1日周三
>>>> > > > 上午3:14写道：
>>>> > > > >>>>>>>>>>
>>>> > > > >>>>>>>>>>> Thank you Timo for the great summary! It covers
>>>> (almost)
>>>> > all
>>>> > > > the
>>>> > > > >>>>>> topics.
>>>> > > > >>>>>>>>>>> Even though in the end we are not suggesting much
>>>> changes
>>>> > to
>>>> > > > the
>>>> > > > >>>>>> current
>>>> > > > >>>>>>>>>>> state of FLIP I think it is important to lay out all
>>>> > possible
>>>> > > > >> use
>>>> > > > >>>>>> cases
>>>> > > > >>>>>>>>>>> so that we do not change the execution model every
>>>> release.
>>>> > > > >>>>>>>>>>>
>>>> > > > >>>>>>>>>>> There is one additional thing we discussed. Could we
>>>> change
>>>> > > the
>>>> > > > >>>>>> result
>>>> > > > >>>>>>>>>>> type of TableResult#collect to Iterator<Row>? Even
>>>> though
>>>> > > those
>>>> > > > >>>>>>>>>>> interfaces do not differ much. I think Iterator better
>>>> > > > describes
>>>> > > > >>>> that
>>>> > > > >>>>>>>>>>> the results might not be materialized on the client
>>>> side,
>>>> > but
>>>> > > > >> can
>>>> > > > >>>> be
>>>> > > > >>>>>>>>>>> retrieved on a per record basis. The contract of the
>>>> > > > >>>>>> Iterable#iterator
>>>> > > > >>>>>>>>>>> is that it returns a new iterator each time, which
>>>> > > effectively
>>>> > > > >>>> means
>>>> > > > >>>>>> we
>>>> > > > >>>>>>>>>>> can iterate the results multiple times. Iterating the
>>>> > results
>>>> > > > is
>>>> > > > >>>> not
>>>> > > > >>>>>>>>>>> possible when we don't retrieve all the results from
>>>> the
>>>> > > > cluster
>>>> > > > >>>> at
>>>> > > > >>>>>>>>> once.
>>>> > > > >>>>>>>>>>> I think we should also use Iterator for
>>>> > > > >>>>>>>>>>> TableEnvironment#executeMultilineSql(String
>>>> statements):
>>>> > > > >>>>>>>>>>> Iterator<TableResult>.
>>>> > > > >>>>>>>>>>>
>>>> > > > >>>>>>>>>>> Best,
>>>> > > > >>>>>>>>>>>
>>>> > > > >>>>>>>>>>> Dawid
>>>> > > > >>>>>>>>>>>
>>>> > > > >>>>>>>>>>> On 31/03/2020 19:27, Timo Walther wrote:
>>>> > > > >>>>>>>>>>>> Hi Godfrey,
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> Aljoscha, Dawid, Klou, and I had another discussion
>>>> around
>>>> > > > >>>> FLIP-84.
>>>> > > > >>>>>> In
>>>> > > > >>>>>>>>>>>> particular, we discussed how the current status of
>>>> the
>>>> > FLIP
>>>> > > > and
>>>> > > > >>>> the
>>>> > > > >>>>>>>>>>>> future requirements around multiline statements,
>>>> > async/sync,
>>>> > > > >>>>>> collect()
>>>> > > > >>>>>>>>>>>> fit together.
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> We also updated the FLIP-84 Feedback Summary
>>>> document [1]
>>>> > > with
>>>> > > > >>>> some
>>>> > > > >>>>>>>>>>>> use cases.
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> We believe that we found a good solution that also
>>>> fits to
>>>> > > > what
>>>> > > > >>>> is
>>>> > > > >>>>>> in
>>>> > > > >>>>>>>>>>>> the current FLIP. So no bigger changes necessary,
>>>> which is
>>>> > > > >> great!
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> Our findings were:
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> 1. Async vs sync submission of Flink jobs:
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> Having a blocking `execute()` in DataStream API was
>>>> > rather a
>>>> > > > >>>>>> mistake.
>>>> > > > >>>>>>>>>>>> Instead all submissions should be async because this
>>>> > allows
>>>> > > > >>>>>> supporting
>>>> > > > >>>>>>>>>>>> both modes if necessary. Thus, submitting all queries
>>>> > async
>>>> > > > >>>> sounds
>>>> > > > >>>>>>>>>>>> good to us. If users want to run a job sync, they
>>>> can use
>>>> > > the
>>>> > > > >>>>>>>>>>>> JobClient and wait for completion (or collect() in
>>>> case of
>>>> > > > >> batch
>>>> > > > >>>>>> jobs).
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> 2. Multi-statement execution:
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> For the multi-statement execution, we don't see a
>>>> > > > >> contradication
>>>> > > > >>>>>> with
>>>> > > > >>>>>>>>>>>> the async execution behavior. We imagine a method
>>>> like:
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> TableEnvironment#executeMultilineSql(String
>>>> statements):
>>>> > > > >>>>>>>>>>>> Iterable<TableResult>
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> Where the `Iterator#next()` method would trigger the
>>>> next
>>>> > > > >>>> statement
>>>> > > > >>>>>>>>>>>> submission. This allows a caller to decide
>>>> synchronously
>>>> > > when
>>>> > > > >> to
>>>> > > > >>>>>>>>>>>> submit statements async to the cluster. Thus, a
>>>> service
>>>> > such
>>>> > > > as
>>>> > > > >>>> the
>>>> > > > >>>>>>>>>>>> SQL Client can handle the result of each statement
>>>> > > > individually
>>>> > > > >>>> and
>>>> > > > >>>>>>>>>>>> process statement by statement sequentially.
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> 3. The role of TableResult and result retrieval in
>>>> general
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> `TableResult` is similar to `JobClient`. Instead of
>>>> > > returning
>>>> > > > a
>>>> > > > >>>>>>>>>>>> `CompletableFuture` of something, it is a concrete
>>>> util
>>>> > > class
>>>> > > > >>>> where
>>>> > > > >>>>>>>>>>>> some methods have the behavior of completable future
>>>> (e.g.
>>>> > > > >>>>>> collect(),
>>>> > > > >>>>>>>>>>>> print()) and some are already completed
>>>> (getTableSchema(),
>>>> > > > >>>>>>>>>>>> getResultKind()).
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> `StatementSet#execute()` returns a single
>>>> `TableResult`
>>>> > > > because
>>>> > > > >>>> the
>>>> > > > >>>>>>>>>>>> order is undefined in a set and all statements have
>>>> the
>>>> > same
>>>> > > > >>>> schema.
>>>> > > > >>>>>>>>>>>> Its `collect()` will return a row for each executed
>>>> > `INSERT
>>>> > > > >>>> INTO` in
>>>> > > > >>>>>>>>>>>> the order of statement definition.
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> For simple `SELECT * FROM ...`, the query execution
>>>> might
>>>> > > > block
>>>> > > > >>>>>> until
>>>> > > > >>>>>>>>>>>> `collect()` is called to pull buffered rows from the
>>>> job
>>>> > > (from
>>>> > > > >>>>>>>>>>>> socket/REST API what ever we will use in the
>>>> future). We
>>>> > can
>>>> > > > >> say
>>>> > > > >>>>>> that
>>>> > > > >>>>>>>>>>>> a statement finished successfully, when the
>>>> > > > >>>>>> `collect#Iterator#hasNext`
>>>> > > > >>>>>>>>>>>> has returned false.
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> I hope this summarizes our discussion
>>>> > @Dawid/Aljoscha/Klou?
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> It would be great if we can add these findings to
>>>> the FLIP
>>>> > > > >>>> before we
>>>> > > > >>>>>>>>>>>> start voting.
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> One minor thing: some `execute()` methods still
>>>> throw a
>>>> > > > checked
>>>> > > > >>>>>>>>>>>> exception; can we remove that from the FLIP? Also the
>>>> > above
>>>> > > > >>>>>> mentioned
>>>> > > > >>>>>>>>>>>> `Iterator#next()` would trigger an execution without
>>>> > > throwing
>>>> > > > a
>>>> > > > >>>>>>>>>>>> checked exception.
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> Thanks,
>>>> > > > >>>>>>>>>>>> Timo
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>> [1]
>>>> > > > >>>>>>>>>>>>
>>>> > > > >>>>>>>>>
>>>> > > > >>>>>>
>>>> > > > >>>>
>>>> > > > >>
>>>> > > >
>>>> > >
>>>> >
>>>> https://docs.google.com/document/d/1ueLjQWRPdLTFB_TReAyhseAX-1N3j4WYWD0F02Uau0E/edit#
>>>> > > > >>>>>>>>>>>> On 31.03.20 06:28, godfrey he wrote:
>>>> > > > >>>>>>>>>>>>> Hi, Timo & Jark
>>>> > > > >>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>> Thanks for your explanation.
>>>> > > > >>>>>>>>>>>>> Agree with you that async execution should always be
>>>> > async,
>>>> > > > >>>>>>>>>>>>> and sync execution scenario can be covered  by async
>>>> > > > >> execution.
>>>> > > > >>>>>>>>>>>>> It helps provide an unified entry point for batch
>>>> and
>>>> > > > >> streaming.
>>>> > > > >>>>>>>>>>>>> I think we can also use sync execution for some
>>>> testing.
>>>> > > > >>>>>>>>>>>>> So, I agree with you that we provide `executeSql`
>>>> method
>>>> > > and
>>>> > > > >>>> it's
>>>> > > > >>>>>>>>> async
>>>> > > > >>>>>>>>>>>>> method.
>>>> > > > >>>>>>>>>>>>> If we want sync method in the future, we can add
>>>> method
>>>> > > named
>>>> > > > >>>>>>>>>>>>> `executeSqlSync`.
>>>> > > > >>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>> I think we've reached an agreement. I will update
>>>> the
>>>> > > > >> document,
>>>> > > > >>>> and
>>>> > > > >>>>>>>>>>>>> start
>>>> > > > >>>>>>>>>>>>> voting process.
>>>> > > > >>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>> Best,
>>>> > > > >>>>>>>>>>>>> Godfrey
>>>> > > > >>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>> Jark Wu<imj...@gmail.com>  于2020年3月31日周二 上午12:46写道：
>>>> > > > >>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>> Hi,
>>>> > > > >>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>> I didn't follow the full discussion.
>>>> > > > >>>>>>>>>>>>>> But I share the same concern with Timo that
>>>> streaming
>>>> > > > queries
>>>> > > > >>>>>> should
>>>> > > > >>>>>>>>>>>>>> always
>>>> > > > >>>>>>>>>>>>>> be async.
>>>> > > > >>>>>>>>>>>>>> Otherwise, I can image it will cause a lot of
>>>> confusion
>>>> > > and
>>>> > > > >>>>>> problems
>>>> > > > >>>>>>>>> if
>>>> > > > >>>>>>>>>>>>>> users don't deeply keep the "sync" in mind (e.g.
>>>> client
>>>> > > > >> hangs).
>>>> > > > >>>>>>>>>>>>>> Besides, the streaming mode is still the majority
>>>> use
>>>> > > cases
>>>> > > > >> of
>>>> > > > >>>>>> Flink
>>>> > > > >>>>>>>>>>>>>> and
>>>> > > > >>>>>>>>>>>>>> Flink SQL. We should put the usability at a high
>>>> > priority.
>>>> > > > >>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>> Best,
>>>> > > > >>>>>>>>>>>>>> Jark
>>>> > > > >>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>> On Mon, 30 Mar 2020 at 23:27, Timo Walther<
>>>> > > > >> twal...@apache.org>
>>>> > > > >>>>>>>>> wrote:
>>>> > > > >>>>>>>>>>>>>>> Hi Godfrey,
>>>> > > > >>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>> maybe I wasn't expressing my biggest concern
>>>> enough in
>>>> > my
>>>> > > > >> last
>>>> > > > >>>>>> mail.
>>>> > > > >>>>>>>>>>>>>>> Even in a singleline and sync execution, I think
>>>> that
>>>> > > > >>>> streaming
>>>> > > > >>>>>>>>>>>>>>> queries
>>>> > > > >>>>>>>>>>>>>>> should not block the execution. Otherwise it is
>>>> not
>>>> > > > possible
>>>> > > > >>>> to
>>>> > > > >>>>>> call
>>>> > > > >>>>>>>>>>>>>>> collect() or print() on them afterwards.
>>>> > > > >>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>> "there are too many things need to discuss for
>>>> > > multiline":
>>>> > > > >>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>> True, I don't want to solve all of them right
>>>> now. But
>>>> > > what
>>>> > > > >> I
>>>> > > > >>>>>> know
>>>> > > > >>>>>>>>> is
>>>> > > > >>>>>>>>>>>>>>> that our newly introduced methods should fit into
>>>> a
>>>> > > > >> multiline
>>>> > > > >>>>>>>>>>>>>>> execution.
>>>> > > > >>>>>>>>>>>>>>> There is no big difference of calling
>>>> `executeSql(A),
>>>> > > > >>>>>>>>>>>>>>> executeSql(B)` and
>>>> > > > >>>>>>>>>>>>>>> processing a multiline file `A;\nB;`.
>>>> > > > >>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>> I think the example that you mentioned can simply
>>>> be
>>>> > > > >> undefined
>>>> > > > >>>>>> for
>>>> > > > >>>>>>>>>>>>>>> now.
>>>> > > > >>>>>>>>>>>>>>> Currently, no catalog is modifying data but just
>>>> > > metadata.
>>>> > > > >>>> This
>>>> > > > >>>>>> is a
>>>> > > > >>>>>>>>>>>>>>> separate discussion.
>>>> > > > >>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>> "result of the second statement is
>>>> indeterministic":
>>>> > > > >>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>> Sure this is indeterministic. But this is the
>>>> > > implementers
>>>> > > > >>>> fault
>>>> > > > >>>>>>>>>>>>>>> and we
>>>> > > > >>>>>>>>>>>>>>> cannot forbid such pipelines.
>>>> > > > >>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>> How about we always execute streaming queries
>>>> async? It
>>>> > > > >> would
>>>> > > > >>>>>>>>> unblock
>>>> > > > >>>>>>>>>>>>>>> executeSql() and multiline statements.
>>>> > > > >>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>> Having a `executeSqlAsync()` is useful for batch.
>>>> > > However,
>>>> > > > I
>>>> > > > >>>>>> don't
>>>> > > > >>>>>>>>>>>>>>> want
>>>> > > > >>>>>>>>>>>>>>> `sync/async` be the new batch/stream flag. The
>>>> > execution
>>>> > > > >>>> behavior
>>>> > > > >>>>>>>>>>>>>>> should
>>>> > > > >>>>>>>>>>>>>>> come from the query itself.
>>>> > > > >>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>> Regards,
>>>> > > > >>>>>>>>>>>>>>> Timo
>>>> > > > >>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>> On 30.03.20 11:12, godfrey he wrote:
>>>> > > > >>>>>>>>>>>>>>>> Hi Timo,
>>>> > > > >>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>> Agree with you that streaming queries is our top
>>>> > > priority,
>>>> > > > >>>>>>>>>>>>>>>> but I think there are too many things need to
>>>> discuss
>>>> > > for
>>>> > > > >>>>>> multiline
>>>> > > > >>>>>>>>>>>>>>>> statements:
>>>> > > > >>>>>>>>>>>>>>>> e.g.
>>>> > > > >>>>>>>>>>>>>>>> 1. what's the behaivor of DDL and DML mixing for
>>>> async
>>>> > > > >>>>>> execution:
>>>> > > > >>>>>>>>>>>>>>>> create table t1 xxx;
>>>> > > > >>>>>>>>>>>>>>>> create table t2 xxx;
>>>> > > > >>>>>>>>>>>>>>>> insert into t2 select * from t1 where xxx;
>>>> > > > >>>>>>>>>>>>>>>> drop table t1; // t1 may be a MySQL table, the
>>>> data
>>>> > will
>>>> > > > >>>> also be
>>>> > > > >>>>>>>>>>>>>> deleted.
>>>> > > > >>>>>>>>>>>>>>>> t1 is dropped when "insert" job is running.
>>>> > > > >>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>> 2. what's the behaivor of unified scenario for
>>>> async
>>>> > > > >>>> execution:
>>>> > > > >>>>>>>>>>>>>>>> (as you
>>>> > > > >>>>>>>>>>>>>>>> mentioned)
>>>> > > > >>>>>>>>>>>>>>>> INSERT INTO t1 SELECT * FROM s;
>>>> > > > >>>>>>>>>>>>>>>> INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT
>>>> STREAM;
>>>> > > > >>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>> The result of the second statement is
>>>> indeterministic,
>>>> > > > >>>> because
>>>> > > > >>>>>> the
>>>> > > > >>>>>>>>>>>>>> first
>>>> > > > >>>>>>>>>>>>>>>> statement maybe is running.
>>>> > > > >>>>>>>>>>>>>>>> I think we need to put a lot of effort to define
>>>> the
>>>> > > > >>>> behavior of
>>>> > > > >>>>>>>>>>>>>>> logically
>>>> > > > >>>>>>>>>>>>>>>> related queries.
>>>> > > > >>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>> In this FLIP, I suggest we only handle single
>>>> > statement,
>>>> > > > >> and
>>>> > > > >>>> we
>>>> > > > >>>>>>>>> also
>>>> > > > >>>>>>>>>>>>>>>> introduce an async execute method
>>>> > > > >>>>>>>>>>>>>>>> which is more important and more often used for
>>>> users.
>>>> > > > >>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>> Dor the sync methods (like
>>>> > `TableEnvironment.executeSql`
>>>> > > > >> and
>>>> > > > >>>>>>>>>>>>>>>> `StatementSet.execute`),
>>>> > > > >>>>>>>>>>>>>>>> the result will be returned until the job is
>>>> finished.
>>>> > > The
>>>> > > > >>>>>>>>> following
>>>> > > > >>>>>>>>>>>>>>>> methods will be introduced in this FLIP:
>>>> > > > >>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>        /**
>>>> > > > >>>>>>>>>>>>>>>>         * Asynchronously execute the given single
>>>> > > > statement
>>>> > > > >>>>>>>>>>>>>>>>         */
>>>> > > > >>>>>>>>>>>>>>>> TableEnvironment.executeSqlAsync(String
>>>> statement):
>>>> > > > >>>> TableResult
>>>> > > > >>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>> /**
>>>> > > > >>>>>>>>>>>>>>>>        * Asynchronously execute the dml
>>>> statements as
>>>> > a
>>>> > > > >> batch
>>>> > > > >>>>>>>>>>>>>>>>        */
>>>> > > > >>>>>>>>>>>>>>>> StatementSet.executeAsync(): TableResult
>>>> > > > >>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>> public interface TableResult {
>>>> > > > >>>>>>>>>>>>>>>>          /**
>>>> > > > >>>>>>>>>>>>>>>>           * return JobClient for DQL and DML in
>>>> async
>>>> > > > mode,
>>>> > > > >>>> else
>>>> > > > >>>>>>>>> return
>>>> > > > >>>>>>>>>>>>>>>> Optional.empty
>>>> > > > >>>>>>>>>>>>>>>>           */
>>>> > > > >>>>>>>>>>>>>>>>          Optional<JobClient> getJobClient();
>>>> > > > >>>>>>>>>>>>>>>> }
>>>> > > > >>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>> what do you think?
>>>> > > > >>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>> Best,
>>>> > > > >>>>>>>>>>>>>>>> Godfrey
>>>> > > > >>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>> Timo Walther<twal...@apache.org>  于2020年3月26日周四
>>>> > > 下午9:15写道：
>>>> > > > >>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> Hi Godfrey,
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> executing streaming queries must be our top
>>>> priority
>>>> > > > >> because
>>>> > > > >>>>>> this
>>>> > > > >>>>>>>>> is
>>>> > > > >>>>>>>>>>>>>>>>> what distinguishes Flink from competitors. If we
>>>> > change
>>>> > > > >> the
>>>> > > > >>>>>>>>>>>>>>>>> execution
>>>> > > > >>>>>>>>>>>>>>>>> behavior, we should think about the other cases
>>>> as
>>>> > well
>>>> > > > to
>>>> > > > >>>> not
>>>> > > > >>>>>>>>> break
>>>> > > > >>>>>>>>>>>>>> the
>>>> > > > >>>>>>>>>>>>>>>>> API a third time.
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> I fear that just having an async execute method
>>>> will
>>>> > > not
>>>> > > > >> be
>>>> > > > >>>>>> enough
>>>> > > > >>>>>>>>>>>>>>>>> because users should be able to mix streaming
>>>> and
>>>> > batch
>>>> > > > >>>> queries
>>>> > > > >>>>>>>>> in a
>>>> > > > >>>>>>>>>>>>>>>>> unified scenario.
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> If I remember it correctly, we had some
>>>> discussions
>>>> > in
>>>> > > > the
>>>> > > > >>>> past
>>>> > > > >>>>>>>>>>>>>>>>> about
>>>> > > > >>>>>>>>>>>>>>>>> what decides about the execution mode of a
>>>> query.
>>>> > > > >>>> Currently, we
>>>> > > > >>>>>>>>>>>>>>>>> would
>>>> > > > >>>>>>>>>>>>>>>>> like to let the query decide, not derive it
>>>> from the
>>>> > > > >>>> sources.
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> So I could image a multiline pipeline as:
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> USE CATALOG 'mycat';
>>>> > > > >>>>>>>>>>>>>>>>> INSERT INTO t1 SELECT * FROM s;
>>>> > > > >>>>>>>>>>>>>>>>> INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT
>>>> STREAM;
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> For executeMultilineSql():
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> sync because regular SQL
>>>> > > > >>>>>>>>>>>>>>>>> sync because regular Batch SQL
>>>> > > > >>>>>>>>>>>>>>>>> async because Streaming SQL
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> For executeAsyncMultilineSql():
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> What we should not start for
>>>> > > executeAsyncMultilineSql():
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> sync because DDL
>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> What are you thoughts here?
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> Regards,
>>>> > > > >>>>>>>>>>>>>>>>> Timo
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>> On 26.03.20 12:50, godfrey he wrote:
>>>> > > > >>>>>>>>>>>>>>>>>> Hi Timo,
>>>> > > > >>>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>>> I agree with you that streaming queries mostly
>>>> need
>>>> > > > async
>>>> > > > >>>>>>>>>>>>>>>>>> execution.
>>>> > > > >>>>>>>>>>>>>>>>>> In fact, our original plan is only introducing
>>>> sync
>>>> > > > >>>> methods in
>>>> > > > >>>>>>>>> this
>>>> > > > >>>>>>>>>>>>>>> FLIP,
>>>> > > > >>>>>>>>>>>>>>>>>> and async methods (like "executeSqlAsync")
>>>> will be
>>>> > > > >>>> introduced
>>>> > > > >>>>>> in
>>>> > > > >>>>>>>>>>>>>>>>>> the
>>>> > > > >>>>>>>>>>>>>>>>> future
>>>> > > > >>>>>>>>>>>>>>>>>> which is mentioned in the appendix.
>>>> > > > >>>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>>> Maybe the async methods also need to be
>>>> considered
>>>> > in
>>>> > > > >> this
>>>> > > > >>>>>> FLIP.
>>>> > > > >>>>>>>>>>>>>>>>>>
>>>> > > > >>>>>>>>>>>>>>>>>> I think sync methods is also useful for
>>>> streaming
>>>> > > which
>>>>
>>>

Re: [DISCUSS] FLIP-84 Feedback Summary

Reply via email to