Re: [DISCUSS] FLIP-84 Feedback Summary

Fabian Hueske Thu, 07 May 2020 01:00:03 -0700

Thanks for the update Godfrey!

+1 to this approach.


Since there can be only one primary key, I'd also be fine to just use `PRI`
even if it is composite, but `PRI(f0, f5)` might be more convenient for
users.

Thanks, Fabian

Am Do., 7. Mai 2020 um 09:31 Uhr schrieb godfrey he <godfre...@gmail.com>:

> Hi fabian,
> Thanks for you suggestions.
>
> Agree with you that `UNQ(f2, f3)` is more clear.
>
> A table can have only ONE primary key,
> this primary key can consist of single or multiple columns. [1]
> if primary key consists of single column,
> we can simply use `PRI` (or `PRI(xx)`) to represent it.
> if primary key have multiple columns,
> we should use `PRI(xx, yy, ...)` to represent it.
>
> A table may have multiple unique keys,
> each unique key can consist of single or multiple columns. [2]
> if there is only one unique key and this unique key has only single column,
> we can simply use `UNQ` (or `UNQ(xx)`) to represent it.
> otherwise, we should use `UNQ(xx, yy, ...)` to represent it.
> (a corner case: two unique keys with same columns, like `UNQ(f2, f3)`,
> `UNQ(f2, f3)`,
> we can forbid this case or add a unique name for each key in the future)
>
> primary key and unique key with multiple columns example:
> create table MyTable (
>   f0 BIGINT NOT NULL,
>   f1 ROW<q1 STRING, q2 TIMESTAMP(3)>,
>   f2 VARCHAR<256>,
>   f3 AS f0 + 1,
>   f4 TIMESTAMP(3) NOT NULL,
>   f5 BIGINT NOT NULL,
>  * PRIMARY KEY (f0, f5)*,
>   *UNIQUE (f3, f2)*,
>   WATERMARK f4 AS f4 - INTERVAL '3' SECOND
> ) with (...)
>
>
> +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
> | name | type                                                          |
> null   | key              | compute column | watermark
>        |
>
> +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
> | f0       | BIGINT                                                     |
> false | PRI(f0, f5)   |  (NULL)               |   (NULL)
>             |
>
> +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
> | f1       | ROW<q1 STRING, q2 TIMESTAMP(3)> | true   | (NULL)        |
> (NULL)               |  (NULL)                                 |
>
> +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
> | f2       | VARCHAR<256>                                     | true   |
> UNQ(f2, f3) |  (NULL)               |  (NULL)
>    |
>
> +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
> | f3       | BIGINT                                                     |
> false | UNQ(f2, f3) |  f0 + 1                  |  (NULL)
>              |
>
> +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
> | f4       | TIMESTAMP(3)                                        | false |
> (NULL)        |  (NULL)               | f4 - INTERVAL '3' SECOND |
>
> +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
> | f5       | BIGINT                                                     |
> false | PRI(f0, f5)   |  (NULL)               |   (NULL)
>             |
>
> +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+
>
> "Regarding to the watermark on nested columns", that's a good approach
> which can both support watermark on nested columns in the future and keep
> current table form.
>
> [1] https://www.w3schools.com/sql/sql_primarykey.asp
> [2] https://www.w3schools.com/sql/sql_unique.ASP
>
> Best,
> Godfrey
>
> Fabian Hueske <fhue...@gmail.com> 于2020年5月7日周四 上午12:03写道：
>
>> Hi Godfrey,
>>
>> This looks good to me.
>>
>> One side note, indicating unique constraints with "UNQ" is probably not
>> enough.
>> There might be multiple unique constraints and users would like to know
>> which field combinations are unique.
>> So in your example above, "UNQ(f2, f3)" might be a better marker.
>>
>> Just as a thought, if we would later add support for watermark on nested
>> columns, we could add a row just for the nested field (in addition to the
>> top-level field) like this:
>>
>>
>> +------------------------+---------------------------+-------+-----------+-------------+-----------------------------------------------------------+
>> | f4.nested.rowtime | TIMESTAMP(3)        | false | (NULL) |  (NULL)
>> | f4.nested.rowtime - INTERVAL '3' SECOND |
>>
>> +------------------------+---------------------------+-------+-----------+-------------+-----------------------------------------------------------+
>>
>> Thanks,
>> Fabian
>>
>> Am Mi., 6. Mai 2020 um 17:51 Uhr schrieb godfrey he <godfre...@gmail.com
>> >:
>>
>>> Hi @fhue...@gmail.com @Timo Walther <twal...@apache.org>  @Dawid
>>> Wysakowicz <dwysakow...@apache.org>
>>> What do you think we limit watermark must be defined on top-level column
>>> ?
>>>
>>> if we do that, we can add an expression column to represent watermark
>>> like compute column,
>>> An example of all cases:
>>> create table MyTable (
>>>   f0 BIGINT NOT NULL,
>>>   f1 ROW<q1 STRING, q2 TIMESTAMP(3)>,
>>>   f2 VARCHAR<256>,
>>>   f3 AS f0 + 1,
>>>   f4 TIMESTAMP(3) NOT NULL,
>>>   PRIMARY KEY (f0),
>>>   UNIQUE (f3, f2),
>>>   WATERMARK f4 AS f4 - INTERVAL '3' SECOND
>>> ) with (...)
>>>
>>>
>>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>>> | name | type                                                          |
>>> null   | key        | compute column | watermark
>>>  |
>>>
>>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>>> | f0       | BIGINT
>>>  | false | PRI       |  (NULL)               |   (NULL)
>>>             |
>>>
>>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>>> | f1       | ROW<q1 STRING, q2 TIMESTAMP(3)> | true   | (NULL) |
>>> (NULL)               |  (NULL)                                 |
>>>
>>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>>> | f2       | VARCHAR<256>                                     | true   |
>>> UNQ     |  (NULL)               |  (NULL)                                 |
>>>
>>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>>> | f3       | BIGINT
>>>  | false | UNQ     |  f0 + 1                  |  (NULL)
>>>              |
>>>
>>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>>> | f4       | TIMESTAMP(3)                                        | false
>>> | (NULL) |  (NULL)                | f4 - INTERVAL '3' SECOND |
>>>
>>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>>>
>>> WDYT ?
>>>
>>> Best,
>>> Godfrey
>>>
>>>
>>>
>>> godfrey he <godfre...@gmail.com> 于2020年4月30日周四 下午11:57写道：
>>>
>>>> Hi Fabian,
>>>>
>>>> the broken example is:
>>>>
>>>> create table MyTable (
>>>>
>>>>     f0 BIGINT NOT NULL,
>>>>
>>>>     f1 ROW<q1 STRING, q2 TIMESTAMP(3)>,
>>>>
>>>>     f2 VARCHAR<256>,
>>>>
>>>>     f3 AS f0 + 1,
>>>>
>>>>     PRIMARY KEY (f0),
>>>>
>>>> UNIQUE (f3, f2),
>>>>
>>>>     WATERMARK f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>>>
>>>> ) with (...)
>>>>
>>>>
>>>> name
>>>>
>>>> type
>>>>
>>>> key
>>>>
>>>> compute column
>>>>
>>>> watermark
>>>>
>>>> f0
>>>>
>>>> BIGINT NOT NULL
>>>>
>>>> PRI
>>>>
>>>> (NULL)
>>>>
>>>> f1
>>>>
>>>> ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>>>
>>>> UNQ
>>>>
>>>> (NULL)
>>>>
>>>> f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>>>
>>>> f2
>>>>
>>>> VARCHAR<256>
>>>>
>>>> (NULL)
>>>>
>>>> NULL
>>>>
>>>> f3
>>>>
>>>> BIGINT NOT NULL
>>>>
>>>> UNQ
>>>>
>>>> f0 + 1
>>>>
>>>>
>>>> or we add a column to represent nullability.
>>>>
>>>> name
>>>>
>>>> type
>>>>
>>>> null
>>>>
>>>> key
>>>>
>>>> compute column
>>>>
>>>> watermark
>>>>
>>>> f0
>>>>
>>>> BIGINT
>>>>
>>>> false
>>>>
>>>> PRI
>>>>
>>>> (NULL)
>>>>
>>>> f1
>>>>
>>>> ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>>>
>>>> true
>>>>
>>>> UNQ
>>>>
>>>> (NULL)
>>>>
>>>> f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>>>
>>>> f2
>>>>
>>>> VARCHAR<256>
>>>>
>>>> true
>>>>
>>>> (NULL)
>>>>
>>>> NULL
>>>>
>>>> f3
>>>>
>>>> BIGINT
>>>>
>>>> false
>>>>
>>>> UNQ
>>>>
>>>> f0 + 1
>>>>
>>>>
>>>>
>>>>
>>>> Hi Jark,
>>>> If we can limit watermark must be defined on top-level column,
>>>> this will become more simple.
>>>>
>>>> Best,
>>>> Godfrey
>>>>
>>>> Jark Wu <imj...@gmail.com> 于2020年4月30日周四 下午11:38写道：
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm in favor of Fabian's proposal.
>>>>> First, watermark is not a column, but a metadata just like primary
>>>>> key, so
>>>>> shouldn't stand with columns.
>>>>> Second, AFAIK, primary key can only be defined on top-level columns.
>>>>> Third, I think watermark can also follow primary key than only allow to
>>>>> define on top-level columns.
>>>>>
>>>>> I have to admit that in FLIP-66, watermark can define on nested fields.
>>>>> However, during implementation, I found that it's too complicated to do
>>>>> that. We have refactor time-based physical nodes,
>>>>> we have to use code generation to access event-time, we have to
>>>>> refactor
>>>>> FlinkTypeFactory to support a complex nested rowtime.
>>>>> There is not much value of this feature, but introduce a lot of
>>>>> complexity
>>>>> in code base.
>>>>> So I think we can force watermark define on top-level columns. If user
>>>>> want
>>>>> to define on nested columns,
>>>>> he/she can use computed column to be a top-level column.
>>>>>
>>>>> Best,
>>>>> Jark
>>>>>
>>>>>
>>>>> On Thu, 30 Apr 2020 at 17:55, Fabian Hueske <fhue...@gmail.com> wrote:
>>>>>
>>>>> > Hi Godfrey,
>>>>> >
>>>>> > The formatting of your example seems to be broken.
>>>>> > Could you send them again please?
>>>>> >
>>>>> > Regarding your points
>>>>> > > because watermark express can be a sub-column, just like `f1.q2`
>>>>> in above
>>>>> > example I give.
>>>>> >
>>>>> > I would put the watermark information in the row of the top-level
>>>>> field and
>>>>> > indicate to which nested field the watermark refers.
>>>>> > Don't we have to solve the same issue for primary keys that are
>>>>> defined on
>>>>> > a nested field?
>>>>> >
>>>>> > > A boolean flag can't represent such info. and I do know whether we
>>>>> will
>>>>> > support complex watermark expression involving multiple columns in
>>>>> the
>>>>> > future. such as: "WATERMARK FOR ts as ts + f1 + interval '1' second"
>>>>> >
>>>>> > You are right, a simple binary flag is definitely not sufficient to
>>>>> display
>>>>> > the watermark information.
>>>>> > I would put the expression string into the field, i.e., "ts + f1 +
>>>>> interval
>>>>> > '1' second"
>>>>> >
>>>>> >
>>>>> > For me the most important point of why to not show the watermark as
>>>>> a row
>>>>> > in the table is that it is not field that can be queried but meta
>>>>> > information on an existing field.
>>>>> > For the user it is important to know that a certain field has a
>>>>> watermark.
>>>>> > Otherwise, certain queries cannot be correctly specified.
>>>>> > Also there might be support for multiple watermarks that are defined
>>>>> of
>>>>> > different fields at some point. Would those be printed in multiple
>>>>> rows?
>>>>> >
>>>>> > Best,
>>>>> > Fabian
>>>>> >
>>>>> >
>>>>> > Am Do., 30. Apr. 2020 um 11:25 Uhr schrieb godfrey he <
>>>>> godfre...@gmail.com
>>>>> > >:
>>>>> >
>>>>> > > Hi Fabian, Aljoscha
>>>>> > >
>>>>> > > Thanks for the feedback.
>>>>> > >
>>>>> > > Agree with you that we can deal with primary key as you mentioned.
>>>>> > > now, the type column has contained the nullability attribute, e.g.
>>>>> BIGINT
>>>>> > > NOT NULL.
>>>>> > > (I'm also ok that we use two columns to represent type just like
>>>>> mysql)
>>>>> > >
>>>>> > > >Why I treat `watermark` as a special row ?
>>>>> > > because watermark express can be a sub-column, just like `f1.q2`
>>>>> in above
>>>>> > > example I give.
>>>>> > > A boolean flag can't represent such info. and I do know whether we
>>>>> will
>>>>> > > support complex
>>>>> > > watermark expression involving multiple columns in the future.
>>>>> such as:
>>>>> > > "WATERMARK FOR ts as ts + f1 + interval '1' second"
>>>>> > >
>>>>> > > If we do not support complex watermark expression, we can add a
>>>>> watermark
>>>>> > > column.
>>>>> > >
>>>>> > > for example:
>>>>> > >
>>>>> > > create table MyTable (
>>>>> > >
>>>>> > >     f0 BIGINT NOT NULL,
>>>>> > >
>>>>> > >     f1 ROW<q1 STRING, q2 TIMESTAMP(3)>,
>>>>> > >
>>>>> > >     f2 VARCHAR<256>,
>>>>> > >
>>>>> > >     f3 AS f0 + 1,
>>>>> > >
>>>>> > >     PRIMARY KEY (f0),
>>>>> > >
>>>>> > > UNIQUE (f3, f2),
>>>>> > >
>>>>> > >     WATERMARK f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>>>> > >
>>>>> > > ) with (...)
>>>>> > >
>>>>> > >
>>>>> > > name
>>>>> > >
>>>>> > > type
>>>>> > >
>>>>> > > key
>>>>> > >
>>>>> > > compute column
>>>>> > >
>>>>> > > watermark
>>>>> > >
>>>>> > > f0
>>>>> > >
>>>>> > > BIGINT NOT NULL
>>>>> > >
>>>>> > > PRI
>>>>> > >
>>>>> > > (NULL)
>>>>> > >
>>>>> > > f1
>>>>> > >
>>>>> > > ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>>>> > >
>>>>> > > UNQ
>>>>> > >
>>>>> > > (NULL)
>>>>> > >
>>>>> > > f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>>>> > >
>>>>> > > f2
>>>>> > >
>>>>> > > VARCHAR<256>
>>>>> > >
>>>>> > > (NULL)
>>>>> > >
>>>>> > > NULL
>>>>> > >
>>>>> > > f3
>>>>> > >
>>>>> > > BIGINT NOT NULL
>>>>> > >
>>>>> > > UNQ
>>>>> > >
>>>>> > > f0 + 1
>>>>> > >
>>>>> > >
>>>>> > > or we add a column to represent nullability.
>>>>> > >
>>>>> > > name
>>>>> > >
>>>>> > > type
>>>>> > >
>>>>> > > null
>>>>> > >
>>>>> > > key
>>>>> > >
>>>>> > > compute column
>>>>> > >
>>>>> > > watermark
>>>>> > >
>>>>> > > f0
>>>>> > >
>>>>> > > BIGINT
>>>>> > >
>>>>> > > false
>>>>> > >
>>>>> > > PRI
>>>>> > >
>>>>> > > (NULL)
>>>>> > >
>>>>> > > f1
>>>>> > >
>>>>> > > ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>>>> > >
>>>>> > > true
>>>>> > >
>>>>> > > UNQ
>>>>> > >
>>>>> > > (NULL)
>>>>> > >
>>>>> > > f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>>>> > >
>>>>> > > f2
>>>>> > >
>>>>> > > VARCHAR<256>
>>>>> > >
>>>>> > > true
>>>>> > >
>>>>> > > (NULL)
>>>>> > >
>>>>> > > NULL
>>>>> > >
>>>>> > > f3
>>>>> > >
>>>>> > > BIGINT
>>>>> > >
>>>>> > > false
>>>>> > >
>>>>> > > UNQ
>>>>> > >
>>>>> > > f0 + 1
>>>>> > >
>>>>> > >
>>>>> > > Personally, I like the second one. (we need do some changes on
>>>>> > LogicalType
>>>>> > > to get type name without nullability)
>>>>> > >
>>>>> > >
>>>>> > > Best,
>>>>> > > Godfrey
>>>>> > >
>>>>> > >
>>>>> > > Aljoscha Krettek <aljos...@apache.org> 于2020年4月29日周三 下午5:47写道：
>>>>> > >
>>>>> > > > +1 I like the general idea of printing the results as a table.
>>>>> > > >
>>>>> > > > On the specifics I don't know enough but Fabians suggestions
>>>>> seems to
>>>>> > > > make sense to me.
>>>>> > > >
>>>>> > > > Aljoscha
>>>>> > > >
>>>>> > > > On 29.04.20 10:56, Fabian Hueske wrote:
>>>>> > > > > Hi Godfrey,
>>>>> > > > >
>>>>> > > > > Thanks for starting this discussion!
>>>>> > > > >
>>>>> > > > > In my mind, WATERMARK is a property (or constraint) of a
>>>>> field, just
>>>>> > > like
>>>>> > > > > PRIMARY KEY.
>>>>> > > > > Take this example from MySQL:
>>>>> > > > >
>>>>> > > > > mysql> CREATE TABLE people (id INT NOT NULL, name VARCHAR(128)
>>>>> NOT
>>>>> > > NULL,
>>>>> > > > > age INT, PRIMARY KEY (id));
>>>>> > > > > Query OK, 0 rows affected (0.06 sec)
>>>>> > > > >
>>>>> > > > > mysql> describe people;
>>>>> > > > > +-------+--------------+------+-----+---------+-------+
>>>>> > > > > | Field | Type         | Null | Key | Default | Extra |
>>>>> > > > > +-------+--------------+------+-----+---------+-------+
>>>>> > > > > | id    | int          | NO   | PRI | NULL    |       |
>>>>> > > > > | name  | varchar(128) | NO   |     | NULL    |       |
>>>>> > > > > | age   | int          | YES  |     | NULL    |       |
>>>>> > > > > +-------+--------------+------+-----+---------+-------+
>>>>> > > > > 3 rows in set (0.01 sec)
>>>>> > > > >
>>>>> > > > > Here, PRIMARY KEY is marked in the Key column of the id field.
>>>>> > > > > We could do the same for watermarks by adding a Watermark
>>>>> column.
>>>>> > > > >
>>>>> > > > > Best, Fabian
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > Am Mi., 29. Apr. 2020 um 10:43 Uhr schrieb godfrey he <
>>>>> > > > godfre...@gmail.com>:
>>>>> > > > >
>>>>> > > > >> Hi everyone,
>>>>> > > > >>
>>>>> > > > >> I would like to bring up a discussion about the result type of
>>>>> > > describe
>>>>> > > > >> statement,
>>>>> > > > >> which is introduced in FLIP-84[1].
>>>>> > > > >> In previous version, we define the result type of `describe`
>>>>> > statement
>>>>> > > > is a
>>>>> > > > >> single column as following
>>>>> > > > >>
>>>>> > > > >> Statement
>>>>> > > > >>
>>>>> > > > >> Result Schema
>>>>> > > > >>
>>>>> > > > >> Result Value
>>>>> > > > >>
>>>>> > > > >> Result Kind
>>>>> > > > >>
>>>>> > > > >> Examples
>>>>> > > > >>
>>>>> > > > >> DESCRIBE xx
>>>>> > > > >>
>>>>> > > > >> field name: result
>>>>> > > > >>
>>>>> > > > >> field type: VARCHAR(n)
>>>>> > > > >>
>>>>> > > > >> (n is the max length of values)
>>>>> > > > >>
>>>>> > > > >> describe the detail of an object
>>>>> > > > >>
>>>>> > > > >> (single row)
>>>>> > > > >>
>>>>> > > > >> SUCCESS_WITH_CONTENT
>>>>> > > > >>
>>>>> > > > >> DESCRIBE table_name
>>>>> > > > >>
>>>>> > > > >> for "describe table_name", the result value is the `toString`
>>>>> value
>>>>> > of
>>>>> > > > >> `TableSchema`, which is an unstructured data.
>>>>> > > > >> It's hard to for user to use this info.
>>>>> > > > >>
>>>>> > > > >> for example:
>>>>> > > > >>
>>>>> > > > >> TableSchema schema = TableSchema.builder()
>>>>> > > > >>     .field("f0", DataTypes.BIGINT())
>>>>> > > > >>     .field("f1", DataTypes.ROW(
>>>>> > > > >>        DataTypes.FIELD("q1", DataTypes.STRING()),
>>>>> > > > >>        DataTypes.FIELD("q2", DataTypes.TIMESTAMP(3))))
>>>>> > > > >>     .field("f2", DataTypes.STRING())
>>>>> > > > >>     .field("f3", DataTypes.BIGINT(), "f0 + 1")
>>>>> > > > >>     .watermark("f1.q2", WATERMARK_EXPRESSION,
>>>>> WATERMARK_DATATYPE)
>>>>> > > > >>     .build();
>>>>> > > > >>
>>>>> > > > >> its `toString` value is:
>>>>> > > > >> root
>>>>> > > > >>   |-- f0: BIGINT
>>>>> > > > >>   |-- f1: ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>>>> > > > >>   |-- f2: STRING
>>>>> > > > >>   |-- f3: BIGINT AS f0 + 1
>>>>> > > > >>   |-- WATERMARK FOR f1.q2 AS now()
>>>>> > > > >>
>>>>> > > > >> For hive, MySQL, etc., the describe result is table form
>>>>> including
>>>>> > > field
>>>>> > > > >> names and field types.
>>>>> > > > >> which is more familiar with users.
>>>>> > > > >> TableSchema[2] has watermark expression and compute column, we
>>>>> > should
>>>>> > > > also
>>>>> > > > >> put them into the table:
>>>>> > > > >> for compute column, it's a column level, we add a new column
>>>>> named
>>>>> > > > `expr`.
>>>>> > > > >>   for watermark expression, it's a table level, we add a
>>>>> special row
>>>>> > > > named
>>>>> > > > >> `WATERMARK` to represent it.
>>>>> > > > >>
>>>>> > > > >> The result will look like about above example:
>>>>> > > > >>
>>>>> > > > >> name
>>>>> > > > >>
>>>>> > > > >> type
>>>>> > > > >>
>>>>> > > > >> expr
>>>>> > > > >>
>>>>> > > > >> f0
>>>>> > > > >>
>>>>> > > > >> BIGINT
>>>>> > > > >>
>>>>> > > > >> (NULL)
>>>>> > > > >>
>>>>> > > > >> f1
>>>>> > > > >>
>>>>> > > > >> ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>>>> > > > >>
>>>>> > > > >> (NULL)
>>>>> > > > >>
>>>>> > > > >> f2
>>>>> > > > >>
>>>>> > > > >> STRING
>>>>> > > > >>
>>>>> > > > >> NULL
>>>>> > > > >>
>>>>> > > > >> f3
>>>>> > > > >>
>>>>> > > > >> BIGINT
>>>>> > > > >>
>>>>> > > > >> f0 + 1
>>>>> > > > >>
>>>>> > > > >> WATERMARK
>>>>> > > > >>
>>>>> > > > >> (NULL)
>>>>> > > > >>
>>>>> > > > >> f1.q2 AS now()
>>>>> > > > >>
>>>>> > > > >> now there is a pr FLINK-17112 [3] to implement DESCRIBE
>>>>> statement.
>>>>> > > > >>
>>>>> > > > >> What do you think about this update?
>>>>> > > > >> Any feedback are welcome~
>>>>> > > > >>
>>>>> > > > >> Best,
>>>>> > > > >> Godfrey
>>>>> > > > >>
>>>>> > > > >> [1]
>>>>> > > > >>
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=134745878
>>>>> > > > >> [2]
>>>>> > > > >>
>>>>> > > > >>
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/api/TableSchema.java
>>>>> > > > >> [3] https://github.com/apache/flink/pull/11892
>>>>> > > > >>
>>>>> > > > >>
>>>>> > > > >> godfrey he <godfre...@gmail.com> 于2020年4月6日周一 下午10:38写道：
>>>>> > > > >>
>>>>> > > > >>> Hi Timo,
>>>>> > > > >>>
>>>>> > > > >>> Sorry for the late reply, and thanks for your correction.
>>>>> > > > >>> I missed DQL for job submission scenario.
>>>>> > > > >>> I'll fix the document right away.
>>>>> > > > >>>
>>>>> > > > >>> Best,
>>>>> > > > >>> Godfrey
>>>>> > > > >>>
>>>>> > > > >>> Timo Walther <twal...@apache.org> 于2020年4月3日周五 下午9:53写道：
>>>>> > > > >>>
>>>>> > > > >>>> Hi Godfrey,
>>>>> > > > >>>>
>>>>> > > > >>>> I'm sorry to jump in again but I still need to clarify some
>>>>> things
>>>>> > > > >>>> around TableResult.
>>>>> > > > >>>>
>>>>> > > > >>>> The FLIP says:
>>>>> > > > >>>> "For DML, this method returns TableResult until the job is
>>>>> > > submitted.
>>>>> > > > >>>> For other statements, TableResult is returned until the
>>>>> execution
>>>>> > is
>>>>> > > > >>>> finished."
>>>>> > > > >>>>
>>>>> > > > >>>> I thought we agreed on making every execution async? This
>>>>> also
>>>>> > means
>>>>> > > > >>>> returning a TableResult for DQLs even though the execution
>>>>> is not
>>>>> > > done
>>>>> > > > >>>> yet. People need access to the JobClient also for batch
>>>>> jobs in
>>>>> > > order
>>>>> > > > to
>>>>> > > > >>>> cancel long lasting queries. If people want to wait for the
>>>>> > > completion
>>>>> > > > >>>> they can hook into JobClient or collect().
>>>>> > > > >>>>
>>>>> > > > >>>> Can we rephrase this part to:
>>>>> > > > >>>>
>>>>> > > > >>>> The FLIP says:
>>>>> > > > >>>> "For DML and DQL, this method returns TableResult once the
>>>>> job has
>>>>> > > > been
>>>>> > > > >>>> submitted. For DDL and DCL statements, TableResult is
>>>>> returned
>>>>> > once
>>>>> > > > the
>>>>> > > > >>>> operation has finished."
>>>>> > > > >>>>
>>>>> > > > >>>> Regards,
>>>>> > > > >>>> Timo
>>>>> > > > >>>>
>>>>> > > > >>>>
>>>>> > > > >>>> On 02.04.20 05:27, godfrey he wrote:
>>>>> > > > >>>>> Hi Aljoscha, Dawid, Timo,
>>>>> > > > >>>>>
>>>>> > > > >>>>> Thanks so much for the detailed explanation.
>>>>> > > > >>>>> Agree with you that the multiline story is not completed
>>>>> now, and
>>>>> > > we
>>>>> > > > >> can
>>>>> > > > >>>>> keep discussion.
>>>>> > > > >>>>> I will add current discussions and conclusions to the FLIP.
>>>>> > > > >>>>>
>>>>> > > > >>>>> Best,
>>>>> > > > >>>>> Godfrey
>>>>> > > > >>>>>
>>>>> > > > >>>>>
>>>>> > > > >>>>>
>>>>> > > > >>>>> Timo Walther <twal...@apache.org> 于2020年4月1日周三 下午11:27写道：
>>>>> > > > >>>>>
>>>>> > > > >>>>>> Hi Godfrey,
>>>>> > > > >>>>>>
>>>>> > > > >>>>>> first of all, I agree with Dawid. The multiline story is
>>>>> not
>>>>> > > > >> completed
>>>>> > > > >>>>>> by this FLIP. It just verifies the big picture.
>>>>> > > > >>>>>>
>>>>> > > > >>>>>> 1. "control the execution logic through the proposed
>>>>> method if
>>>>> > > they
>>>>> > > > >>>> know
>>>>> > > > >>>>>> what the statements are"
>>>>> > > > >>>>>>
>>>>> > > > >>>>>> This is a good point that also Fabian raised in the linked
>>>>> > google
>>>>> > > > >> doc.
>>>>> > > > >>>> I
>>>>> > > > >>>>>> could also imagine to return a more complicated POJO when
>>>>> > calling
>>>>> > > > >>>>>> `executeMultiSql()`.
>>>>> > > > >>>>>>
>>>>> > > > >>>>>> The POJO would include some `getSqlProperties()` such
>>>>> that a
>>>>> > > > platform
>>>>> > > > >>>>>> gets insights into the query before executing. We could
>>>>> also
>>>>> > > trigger
>>>>> > > > >>>> the
>>>>> > > > >>>>>> execution more explicitly instead of hiding it behind an
>>>>> > iterator.
>>>>> > > > >>>>>>
>>>>> > > > >>>>>> 2. "there are some special commands introduced in SQL
>>>>> client"
>>>>> > > > >>>>>>
>>>>> > > > >>>>>> For platforms and SQL Client specific commands, we could
>>>>> offer a
>>>>> > > > hook
>>>>> > > > >>>> to
>>>>> > > > >>>>>> the parser or a fallback parser in case the regular table
>>>>> > > > environment
>>>>> > > > >>>>>> parser cannot deal with the statement.
>>>>> > > > >>>>>>
>>>>> > > > >>>>>> However, all of that is future work and can be discussed
>>>>> in a
>>>>> > > > >> separate
>>>>> > > > >>>>>> FLIP.
>>>>> > > > >>>>>>
>>>>> > > > >>>>>> 3. +1 for the `Iterator` instead of `Iterable`.
>>>>> > > > >>>>>>
>>>>> > > > >>>>>> 4. "we should convert the checked exception to unchecked
>>>>> > > exception"
>>>>> > > > >>>>>>
>>>>> > > > >>>>>> Yes, I meant using a runtime exception instead of a
>>>>> checked
>>>>> > > > >> exception.
>>>>> > > > >>>>>> There was no consensus on putting the exception into the
>>>>> > > > >> `TableResult`.
>>>>> > > > >>>>>>
>>>>> > > > >>>>>> Regards,
>>>>> > > > >>>>>> Timo
>>>>> > > > >>>>>>
>>>>> > > > >>>>>> On 01.04.20 15:35, Dawid Wysakowicz wrote:
>>>>> > > > >>>>>>> When considering the multi-line support I think it is
>>>>> helpful
>>>>> > to
>>>>> > > > >> start
>>>>> > > > >>>>>>> with a use case in mind. In my opinion consumers of this
>>>>> method
>>>>> > > > will
>>>>> > > > >>>> be:
>>>>> > > > >>>>>>>
>>>>> > > > >>>>>>>    1. sql-client
>>>>> > > > >>>>>>>    2. third-part sql based platforms
>>>>> > > > >>>>>>>
>>>>> > > > >>>>>>> @Godfrey As for the quit/source/... commands. I think
>>>>> those
>>>>> > > belong
>>>>> > > > >> to
>>>>> > > > >>>>>>> the responsibility of aforementioned. I think they
>>>>> should not
>>>>> > be
>>>>> > > > >>>>>>> understandable by the TableEnvironment. What would quit
>>>>> on a
>>>>> > > > >>>>>>> TableEnvironment do? Moreover I think such commands
>>>>> should be
>>>>> > > > >> prefixed
>>>>> > > > >>>>>>> appropriately. I think it's a common practice to e.g.
>>>>> prefix
>>>>> > > those
>>>>> > > > >>>> with
>>>>> > > > >>>>>>> ! or : to say they are meta commands of the tool rather
>>>>> than a
>>>>> > > > >> query.
>>>>> > > > >>>>>>>
>>>>> > > > >>>>>>> I also don't necessarily understand why platform users
>>>>> need to
>>>>> > > know
>>>>> > > > >>>> the
>>>>> > > > >>>>>>> kind of the query to use the proposed method. They
>>>>> should get
>>>>> > the
>>>>> > > > >> type
>>>>> > > > >>>>>>> from the TableResult#ResultKind. If the ResultKind is
>>>>> SUCCESS,
>>>>> > it
>>>>> > > > >> was
>>>>> > > > >>>> a
>>>>> > > > >>>>>>> DCL/DDL. If SUCCESS_WITH_CONTENT it was a DML/DQL. If
>>>>> that's
>>>>> > not
>>>>> > > > >>>> enough
>>>>> > > > >>>>>>> we can enrich the TableResult with more explicit kind of
>>>>> query,
>>>>> > > but
>>>>> > > > >> so
>>>>> > > > >>>>>>> far I don't see such a need.
>>>>> > > > >>>>>>>
>>>>> > > > >>>>>>> @Kurt In those cases I would assume the developers want
>>>>> to
>>>>> > > present
>>>>> > > > >>>>>>> results of the queries anyway. Moreover I think it is
>>>>> safe to
>>>>> > > > assume
>>>>> > > > >>>>>>> they can adhere to such a contract that the results must
>>>>> be
>>>>> > > > >> iterated.
>>>>> > > > >>>>>>>
>>>>> > > > >>>>>>> For direct users of TableEnvironment/Table API this
>>>>> method does
>>>>> > > not
>>>>> > > > >>>> make
>>>>> > > > >>>>>>> much sense anyway, in my opinion. I think we can rather
>>>>> safely
>>>>> > > > >> assume
>>>>> > > > >>>> in
>>>>> > > > >>>>>>> this scenario they do not want to submit multiple
>>>>> queries at a
>>>>> > > > >> single
>>>>> > > > >>>>>> time.
>>>>> > > > >>>>>>>
>>>>> > > > >>>>>>> Best,
>>>>> > > > >>>>>>>
>>>>> > > > >>>>>>> Dawid
>>>>> > > > >>>>>>>
>>>>> > > > >>>>>>>
>>>>> > > > >>>>>>> On 01/04/2020 15:07, Kurt Young wrote:
>>>>> > > > >>>>>>>> One comment to `executeMultilineSql`, I'm afraid
>>>>> sometimes
>>>>> > user
>>>>> > > > >> might
>>>>> > > > >>>>>>>> forget to
>>>>> > > > >>>>>>>> iterate the returned iterators, e.g. user submits a
>>>>> bunch of
>>>>> > > DDLs
>>>>> > > > >> and
>>>>> > > > >>>>>>>> expect the
>>>>> > > > >>>>>>>> framework will execute them one by one. But it didn't.
>>>>> > > > >>>>>>>>
>>>>> > > > >>>>>>>> Best,
>>>>> > > > >>>>>>>> Kurt
>>>>> > > > >>>>>>>>
>>>>> > > > >>>>>>>>
>>>>> > > > >>>>>>>> On Wed, Apr 1, 2020 at 5:10 PM Aljoscha Krettek<
>>>>> > > > >> aljos...@apache.org>
>>>>> > > > >>>>>> wrote:
>>>>> > > > >>>>>>>>
>>>>> > > > >>>>>>>>> Agreed to what Dawid and Timo said.
>>>>> > > > >>>>>>>>>
>>>>> > > > >>>>>>>>> To answer your question about multi line SQL: no, we
>>>>> don't
>>>>> > > think
>>>>> > > > >> we
>>>>> > > > >>>>>> need
>>>>> > > > >>>>>>>>> this in Flink 1.11, we only wanted to make sure that
>>>>> the
>>>>> > > > >> interfaces
>>>>> > > > >>>>>> that
>>>>> > > > >>>>>>>>> we now put in place will potentially allow this in the
>>>>> > future.
>>>>> > > > >>>>>>>>>
>>>>> > > > >>>>>>>>> Best,
>>>>> > > > >>>>>>>>> Aljoscha
>>>>> > > > >>>>>>>>>
>>>>> > > > >>>>>>>>> On 01.04.20 09:31, godfrey he wrote:
>>>>> > > > >>>>>>>>>> Hi, Timo & Dawid,
>>>>> > > > >>>>>>>>>>
>>>>> > > > >>>>>>>>>> Thanks so much for the effort of `multiline statements
>>>>> > > > >> supporting`,
>>>>> > > > >>>>>>>>>> I have a few questions about this method:
>>>>> > > > >>>>>>>>>>
>>>>> > > > >>>>>>>>>> 1. users can well control the execution logic through
>>>>> the
>>>>> > > > >> proposed
>>>>> > > > >>>>>> method
>>>>> > > > >>>>>>>>>>      if they know what the statements are (a
>>>>> statement is a
>>>>> > > > DDL, a
>>>>> > > > >>>> DML
>>>>> > > > >>>>>> or
>>>>> > > > >>>>>>>>>> others).
>>>>> > > > >>>>>>>>>> but if a statement is from a file, that means users
>>>>> do not
>>>>> > > know
>>>>> > > > >>>> what
>>>>> > > > >>>>>> the
>>>>> > > > >>>>>>>>>> statements are,
>>>>> > > > >>>>>>>>>> the execution behavior is unclear.
>>>>> > > > >>>>>>>>>> As a platform user, I think this method is hard to
>>>>> use,
>>>>> > unless
>>>>> > > > >> the
>>>>> > > > >>>>>>>>> platform
>>>>> > > > >>>>>>>>>> defines
>>>>> > > > >>>>>>>>>> a set of rule about the statements order, such as: no
>>>>> select
>>>>> > > in
>>>>> > > > >> the
>>>>> > > > >>>>>>>>> middle,
>>>>> > > > >>>>>>>>>> dml must be at tail of sql file (which may be the
>>>>> most case
>>>>> > in
>>>>> > > > >>>> product
>>>>> > > > >>>>>>>>>> env).
>>>>> > > > >>>>>>>>>> Otherwise the platform must parse the sql first, then
>>>>> know
>>>>> > > what
>>>>> > > > >> the
>>>>> > > > >>>>>>>>>> statements are.
>>>>> > > > >>>>>>>>>> If do like that, the platform can handle all cases
>>>>> through
>>>>> > > > >>>>>> `executeSql`
>>>>> > > > >>>>>>>>> and
>>>>> > > > >>>>>>>>>> `StatementSet`.
>>>>> > > > >>>>>>>>>>
>>>>> > > > >>>>>>>>>> 2. SQL client can't also use `executeMultilineSql` to
>>>>> > supports
>>>>> > > > >>>>>> multiline
>>>>> > > > >>>>>>>>>> statements,
>>>>> > > > >>>>>>>>>>      because there are some special commands
>>>>> introduced in
>>>>> > SQL
>>>>> > > > >>>> client,
>>>>> > > > >>>>>>>>>> such as `quit`, `source`, `load jar` (not exist now,
>>>>> but
>>>>> > maybe
>>>>> > > > we
>>>>> > > > >>>> need
>>>>> > > > >>>>>>>>> this
>>>>> > > > >>>>>>>>>> command
>>>>> > > > >>>>>>>>>>      to support dynamic table source and udf).
>>>>> > > > >>>>>>>>>> Does TableEnvironment also supports those commands?
>>>>> > > > >>>>>>>>>>
>>>>> > > > >>>>>>>>>> 3. btw, we must have this feature in release-1.11? I
>>>>> find
>>>>> > > there
>>>>> > > > >> are
>>>>> > > > >>>>>> few
>>>>> > > > >>>>>>>>>> user cases
>>>>> > > > >>>>>>>>>>      in the feedback document which behavior is
>>>>> unclear now.
>>>>> > > > >>>>>>>>>>
>>>>> > > > >>>>>>>>>> regarding to "change the return value from
>>>>> `Iterable<Row` to
>>>>> > > > >>>>>>>>>> `Iterator<Row`",
>>>>> > > > >>>>>>>>>> I couldn't agree more with this change. Just as Dawid
>>>>> > > mentioned
>>>>> > > > >>>>>>>>>> "The contract of the Iterable#iterator is that it
>>>>> returns a
>>>>> > > new
>>>>> > > > >>>>>> iterator
>>>>> > > > >>>>>>>>>> each time,
>>>>> > > > >>>>>>>>>>      which effectively means we can iterate the
>>>>> results
>>>>> > > multiple
>>>>> > > > >>>>>> times.",
>>>>> > > > >>>>>>>>>> we does not provide iterate the results multiple
>>>>> times.
>>>>> > > > >>>>>>>>>> If we want do that, the client must buffer all
>>>>> results. but
>>>>> > > it's
>>>>> > > > >>>>>>>>> impossible
>>>>> > > > >>>>>>>>>> for streaming job.
>>>>> > > > >>>>>>>>>>
>>>>> > > > >>>>>>>>>> Best,
>>>>> > > > >>>>>>>>>> Godfrey
>>>>> > > > >>>>>>>>>>
>>>>> > > > >>>>>>>>>> Dawid Wysakowicz<dwysakow...@apache.org>
>>>>> 于2020年4月1日周三
>>>>> > > > 上午3:14写道：
>>>>> > > > >>>>>>>>>>
>>>>> > > > >>>>>>>>>>> Thank you Timo for the great summary! It covers
>>>>> (almost)
>>>>> > all
>>>>> > > > the
>>>>> > > > >>>>>> topics.
>>>>> > > > >>>>>>>>>>> Even though in the end we are not suggesting much
>>>>> changes
>>>>> > to
>>>>> > > > the
>>>>> > > > >>>>>> current
>>>>> > > > >>>>>>>>>>> state of FLIP I think it is important to lay out all
>>>>> > possible
>>>>> > > > >> use
>>>>> > > > >>>>>> cases
>>>>> > > > >>>>>>>>>>> so that we do not change the execution model every
>>>>> release.
>>>>> > > > >>>>>>>>>>>
>>>>> > > > >>>>>>>>>>> There is one additional thing we discussed. Could we
>>>>> change
>>>>> > > the
>>>>> > > > >>>>>> result
>>>>> > > > >>>>>>>>>>> type of TableResult#collect to Iterator<Row>? Even
>>>>> though
>>>>> > > those
>>>>> > > > >>>>>>>>>>> interfaces do not differ much. I think Iterator
>>>>> better
>>>>> > > > describes
>>>>> > > > >>>> that
>>>>> > > > >>>>>>>>>>> the results might not be materialized on the client
>>>>> side,
>>>>> > but
>>>>> > > > >> can
>>>>> > > > >>>> be
>>>>> > > > >>>>>>>>>>> retrieved on a per record basis. The contract of the
>>>>> > > > >>>>>> Iterable#iterator
>>>>> > > > >>>>>>>>>>> is that it returns a new iterator each time, which
>>>>> > > effectively
>>>>> > > > >>>> means
>>>>> > > > >>>>>> we
>>>>> > > > >>>>>>>>>>> can iterate the results multiple times. Iterating the
>>>>> > results
>>>>> > > > is
>>>>> > > > >>>> not
>>>>> > > > >>>>>>>>>>> possible when we don't retrieve all the results from
>>>>> the
>>>>> > > > cluster
>>>>> > > > >>>> at
>>>>> > > > >>>>>>>>> once.
>>>>> > > > >>>>>>>>>>> I think we should also use Iterator for
>>>>> > > > >>>>>>>>>>> TableEnvironment#executeMultilineSql(String
>>>>> statements):
>>>>> > > > >>>>>>>>>>> Iterator<TableResult>.
>>>>> > > > >>>>>>>>>>>
>>>>> > > > >>>>>>>>>>> Best,
>>>>> > > > >>>>>>>>>>>
>>>>> > > > >>>>>>>>>>> Dawid
>>>>> > > > >>>>>>>>>>>
>>>>> > > > >>>>>>>>>>> On 31/03/2020 19:27, Timo Walther wrote:
>>>>> > > > >>>>>>>>>>>> Hi Godfrey,
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> Aljoscha, Dawid, Klou, and I had another discussion
>>>>> around
>>>>> > > > >>>> FLIP-84.
>>>>> > > > >>>>>> In
>>>>> > > > >>>>>>>>>>>> particular, we discussed how the current status of
>>>>> the
>>>>> > FLIP
>>>>> > > > and
>>>>> > > > >>>> the
>>>>> > > > >>>>>>>>>>>> future requirements around multiline statements,
>>>>> > async/sync,
>>>>> > > > >>>>>> collect()
>>>>> > > > >>>>>>>>>>>> fit together.
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> We also updated the FLIP-84 Feedback Summary
>>>>> document [1]
>>>>> > > with
>>>>> > > > >>>> some
>>>>> > > > >>>>>>>>>>>> use cases.
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> We believe that we found a good solution that also
>>>>> fits to
>>>>> > > > what
>>>>> > > > >>>> is
>>>>> > > > >>>>>> in
>>>>> > > > >>>>>>>>>>>> the current FLIP. So no bigger changes necessary,
>>>>> which is
>>>>> > > > >> great!
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> Our findings were:
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> 1. Async vs sync submission of Flink jobs:
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> Having a blocking `execute()` in DataStream API was
>>>>> > rather a
>>>>> > > > >>>>>> mistake.
>>>>> > > > >>>>>>>>>>>> Instead all submissions should be async because this
>>>>> > allows
>>>>> > > > >>>>>> supporting
>>>>> > > > >>>>>>>>>>>> both modes if necessary. Thus, submitting all
>>>>> queries
>>>>> > async
>>>>> > > > >>>> sounds
>>>>> > > > >>>>>>>>>>>> good to us. If users want to run a job sync, they
>>>>> can use
>>>>> > > the
>>>>> > > > >>>>>>>>>>>> JobClient and wait for completion (or collect() in
>>>>> case of
>>>>> > > > >> batch
>>>>> > > > >>>>>> jobs).
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> 2. Multi-statement execution:
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> For the multi-statement execution, we don't see a
>>>>> > > > >> contradication
>>>>> > > > >>>>>> with
>>>>> > > > >>>>>>>>>>>> the async execution behavior. We imagine a method
>>>>> like:
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> TableEnvironment#executeMultilineSql(String
>>>>> statements):
>>>>> > > > >>>>>>>>>>>> Iterable<TableResult>
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> Where the `Iterator#next()` method would trigger
>>>>> the next
>>>>> > > > >>>> statement
>>>>> > > > >>>>>>>>>>>> submission. This allows a caller to decide
>>>>> synchronously
>>>>> > > when
>>>>> > > > >> to
>>>>> > > > >>>>>>>>>>>> submit statements async to the cluster. Thus, a
>>>>> service
>>>>> > such
>>>>> > > > as
>>>>> > > > >>>> the
>>>>> > > > >>>>>>>>>>>> SQL Client can handle the result of each statement
>>>>> > > > individually
>>>>> > > > >>>> and
>>>>> > > > >>>>>>>>>>>> process statement by statement sequentially.
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> 3. The role of TableResult and result retrieval in
>>>>> general
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> `TableResult` is similar to `JobClient`. Instead of
>>>>> > > returning
>>>>> > > > a
>>>>> > > > >>>>>>>>>>>> `CompletableFuture` of something, it is a concrete
>>>>> util
>>>>> > > class
>>>>> > > > >>>> where
>>>>> > > > >>>>>>>>>>>> some methods have the behavior of completable
>>>>> future (e.g.
>>>>> > > > >>>>>> collect(),
>>>>> > > > >>>>>>>>>>>> print()) and some are already completed
>>>>> (getTableSchema(),
>>>>> > > > >>>>>>>>>>>> getResultKind()).
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> `StatementSet#execute()` returns a single
>>>>> `TableResult`
>>>>> > > > because
>>>>> > > > >>>> the
>>>>> > > > >>>>>>>>>>>> order is undefined in a set and all statements have
>>>>> the
>>>>> > same
>>>>> > > > >>>> schema.
>>>>> > > > >>>>>>>>>>>> Its `collect()` will return a row for each executed
>>>>> > `INSERT
>>>>> > > > >>>> INTO` in
>>>>> > > > >>>>>>>>>>>> the order of statement definition.
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> For simple `SELECT * FROM ...`, the query execution
>>>>> might
>>>>> > > > block
>>>>> > > > >>>>>> until
>>>>> > > > >>>>>>>>>>>> `collect()` is called to pull buffered rows from
>>>>> the job
>>>>> > > (from
>>>>> > > > >>>>>>>>>>>> socket/REST API what ever we will use in the
>>>>> future). We
>>>>> > can
>>>>> > > > >> say
>>>>> > > > >>>>>> that
>>>>> > > > >>>>>>>>>>>> a statement finished successfully, when the
>>>>> > > > >>>>>> `collect#Iterator#hasNext`
>>>>> > > > >>>>>>>>>>>> has returned false.
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> I hope this summarizes our discussion
>>>>> > @Dawid/Aljoscha/Klou?
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> It would be great if we can add these findings to
>>>>> the FLIP
>>>>> > > > >>>> before we
>>>>> > > > >>>>>>>>>>>> start voting.
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> One minor thing: some `execute()` methods still
>>>>> throw a
>>>>> > > > checked
>>>>> > > > >>>>>>>>>>>> exception; can we remove that from the FLIP? Also
>>>>> the
>>>>> > above
>>>>> > > > >>>>>> mentioned
>>>>> > > > >>>>>>>>>>>> `Iterator#next()` would trigger an execution without
>>>>> > > throwing
>>>>> > > > a
>>>>> > > > >>>>>>>>>>>> checked exception.
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> Thanks,
>>>>> > > > >>>>>>>>>>>> Timo
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>> [1]
>>>>> > > > >>>>>>>>>>>>
>>>>> > > > >>>>>>>>>
>>>>> > > > >>>>>>
>>>>> > > > >>>>
>>>>> > > > >>
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> https://docs.google.com/document/d/1ueLjQWRPdLTFB_TReAyhseAX-1N3j4WYWD0F02Uau0E/edit#
>>>>> > > > >>>>>>>>>>>> On 31.03.20 06:28, godfrey he wrote:
>>>>> > > > >>>>>>>>>>>>> Hi, Timo & Jark
>>>>> > > > >>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>> Thanks for your explanation.
>>>>> > > > >>>>>>>>>>>>> Agree with you that async execution should always
>>>>> be
>>>>> > async,
>>>>> > > > >>>>>>>>>>>>> and sync execution scenario can be covered  by
>>>>> async
>>>>> > > > >> execution.
>>>>> > > > >>>>>>>>>>>>> It helps provide an unified entry point for batch
>>>>> and
>>>>> > > > >> streaming.
>>>>> > > > >>>>>>>>>>>>> I think we can also use sync execution for some
>>>>> testing.
>>>>> > > > >>>>>>>>>>>>> So, I agree with you that we provide `executeSql`
>>>>> method
>>>>> > > and
>>>>> > > > >>>> it's
>>>>> > > > >>>>>>>>> async
>>>>> > > > >>>>>>>>>>>>> method.
>>>>> > > > >>>>>>>>>>>>> If we want sync method in the future, we can add
>>>>> method
>>>>> > > named
>>>>> > > > >>>>>>>>>>>>> `executeSqlSync`.
>>>>> > > > >>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>> I think we've reached an agreement. I will update
>>>>> the
>>>>> > > > >> document,
>>>>> > > > >>>> and
>>>>> > > > >>>>>>>>>>>>> start
>>>>> > > > >>>>>>>>>>>>> voting process.
>>>>> > > > >>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>> Best,
>>>>> > > > >>>>>>>>>>>>> Godfrey
>>>>> > > > >>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>> Jark Wu<imj...@gmail.com>  于2020年3月31日周二
>>>>> 上午12:46写道：
>>>>> > > > >>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>> Hi,
>>>>> > > > >>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>> I didn't follow the full discussion.
>>>>> > > > >>>>>>>>>>>>>> But I share the same concern with Timo that
>>>>> streaming
>>>>> > > > queries
>>>>> > > > >>>>>> should
>>>>> > > > >>>>>>>>>>>>>> always
>>>>> > > > >>>>>>>>>>>>>> be async.
>>>>> > > > >>>>>>>>>>>>>> Otherwise, I can image it will cause a lot of
>>>>> confusion
>>>>> > > and
>>>>> > > > >>>>>> problems
>>>>> > > > >>>>>>>>> if
>>>>> > > > >>>>>>>>>>>>>> users don't deeply keep the "sync" in mind (e.g.
>>>>> client
>>>>> > > > >> hangs).
>>>>> > > > >>>>>>>>>>>>>> Besides, the streaming mode is still the majority
>>>>> use
>>>>> > > cases
>>>>> > > > >> of
>>>>> > > > >>>>>> Flink
>>>>> > > > >>>>>>>>>>>>>> and
>>>>> > > > >>>>>>>>>>>>>> Flink SQL. We should put the usability at a high
>>>>> > priority.
>>>>> > > > >>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>> Best,
>>>>> > > > >>>>>>>>>>>>>> Jark
>>>>> > > > >>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>> On Mon, 30 Mar 2020 at 23:27, Timo Walther<
>>>>> > > > >> twal...@apache.org>
>>>>> > > > >>>>>>>>> wrote:
>>>>> > > > >>>>>>>>>>>>>>> Hi Godfrey,
>>>>> > > > >>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>> maybe I wasn't expressing my biggest concern
>>>>> enough in
>>>>> > my
>>>>> > > > >> last
>>>>> > > > >>>>>> mail.
>>>>> > > > >>>>>>>>>>>>>>> Even in a singleline and sync execution, I think
>>>>> that
>>>>> > > > >>>> streaming
>>>>> > > > >>>>>>>>>>>>>>> queries
>>>>> > > > >>>>>>>>>>>>>>> should not block the execution. Otherwise it is
>>>>> not
>>>>> > > > possible
>>>>> > > > >>>> to
>>>>> > > > >>>>>> call
>>>>> > > > >>>>>>>>>>>>>>> collect() or print() on them afterwards.
>>>>> > > > >>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>> "there are too many things need to discuss for
>>>>> > > multiline":
>>>>> > > > >>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>> True, I don't want to solve all of them right
>>>>> now. But
>>>>> > > what
>>>>> > > > >> I
>>>>> > > > >>>>>> know
>>>>> > > > >>>>>>>>> is
>>>>> > > > >>>>>>>>>>>>>>> that our newly introduced methods should fit
>>>>> into a
>>>>> > > > >> multiline
>>>>> > > > >>>>>>>>>>>>>>> execution.
>>>>> > > > >>>>>>>>>>>>>>> There is no big difference of calling
>>>>> `executeSql(A),
>>>>> > > > >>>>>>>>>>>>>>> executeSql(B)` and
>>>>> > > > >>>>>>>>>>>>>>> processing a multiline file `A;\nB;`.
>>>>> > > > >>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>> I think the example that you mentioned can
>>>>> simply be
>>>>> > > > >> undefined
>>>>> > > > >>>>>> for
>>>>> > > > >>>>>>>>>>>>>>> now.
>>>>> > > > >>>>>>>>>>>>>>> Currently, no catalog is modifying data but just
>>>>> > > metadata.
>>>>> > > > >>>> This
>>>>> > > > >>>>>> is a
>>>>> > > > >>>>>>>>>>>>>>> separate discussion.
>>>>> > > > >>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>> "result of the second statement is
>>>>> indeterministic":
>>>>> > > > >>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>> Sure this is indeterministic. But this is the
>>>>> > > implementers
>>>>> > > > >>>> fault
>>>>> > > > >>>>>>>>>>>>>>> and we
>>>>> > > > >>>>>>>>>>>>>>> cannot forbid such pipelines.
>>>>> > > > >>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>> How about we always execute streaming queries
>>>>> async? It
>>>>> > > > >> would
>>>>> > > > >>>>>>>>> unblock
>>>>> > > > >>>>>>>>>>>>>>> executeSql() and multiline statements.
>>>>> > > > >>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>> Having a `executeSqlAsync()` is useful for batch.
>>>>> > > However,
>>>>> > > > I
>>>>> > > > >>>>>> don't
>>>>> > > > >>>>>>>>>>>>>>> want
>>>>> > > > >>>>>>>>>>>>>>> `sync/async` be the new batch/stream flag. The
>>>>> > execution
>>>>> > > > >>>> behavior
>>>>> > > > >>>>>>>>>>>>>>> should
>>>>> > > > >>>>>>>>>>>>>>> come from the query itself.
>>>>> > > > >>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>> Regards,
>>>>> > > > >>>>>>>>>>>>>>> Timo
>>>>> > > > >>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>> On 30.03.20 11:12, godfrey he wrote:
>>>>> > > > >>>>>>>>>>>>>>>> Hi Timo,
>>>>> > > > >>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>> Agree with you that streaming queries is our top
>>>>> > > priority,
>>>>> > > > >>>>>>>>>>>>>>>> but I think there are too many things need to
>>>>> discuss
>>>>> > > for
>>>>> > > > >>>>>> multiline
>>>>> > > > >>>>>>>>>>>>>>>> statements:
>>>>> > > > >>>>>>>>>>>>>>>> e.g.
>>>>> > > > >>>>>>>>>>>>>>>> 1. what's the behaivor of DDL and DML mixing
>>>>> for async
>>>>> > > > >>>>>> execution:
>>>>> > > > >>>>>>>>>>>>>>>> create table t1 xxx;
>>>>> > > > >>>>>>>>>>>>>>>> create table t2 xxx;
>>>>> > > > >>>>>>>>>>>>>>>> insert into t2 select * from t1 where xxx;
>>>>> > > > >>>>>>>>>>>>>>>> drop table t1; // t1 may be a MySQL table, the
>>>>> data
>>>>> > will
>>>>> > > > >>>> also be
>>>>> > > > >>>>>>>>>>>>>> deleted.
>>>>> > > > >>>>>>>>>>>>>>>> t1 is dropped when "insert" job is running.
>>>>> > > > >>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>> 2. what's the behaivor of unified scenario for
>>>>> async
>>>>> > > > >>>> execution:
>>>>> > > > >>>>>>>>>>>>>>>> (as you
>>>>> > > > >>>>>>>>>>>>>>>> mentioned)
>>>>> > > > >>>>>>>>>>>>>>>> INSERT INTO t1 SELECT * FROM s;
>>>>> > > > >>>>>>>>>>>>>>>> INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT
>>>>> STREAM;
>>>>> > > > >>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>> The result of the second statement is
>>>>> indeterministic,
>>>>> > > > >>>> because
>>>>> > > > >>>>>> the
>>>>> > > > >>>>>>>>>>>>>> first
>>>>> > > > >>>>>>>>>>>>>>>> statement maybe is running.
>>>>> > > > >>>>>>>>>>>>>>>> I think we need to put a lot of effort to
>>>>> define the
>>>>> > > > >>>> behavior of
>>>>> > > > >>>>>>>>>>>>>>> logically
>>>>> > > > >>>>>>>>>>>>>>>> related queries.
>>>>> > > > >>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>> In this FLIP, I suggest we only handle single
>>>>> > statement,
>>>>> > > > >> and
>>>>> > > > >>>> we
>>>>> > > > >>>>>>>>> also
>>>>> > > > >>>>>>>>>>>>>>>> introduce an async execute method
>>>>> > > > >>>>>>>>>>>>>>>> which is more important and more often used for
>>>>> users.
>>>>> > > > >>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>> Dor the sync methods (like
>>>>> > `TableEnvironment.executeSql`
>>>>> > > > >> and
>>>>> > > > >>>>>>>>>>>>>>>> `StatementSet.execute`),
>>>>> > > > >>>>>>>>>>>>>>>> the result will be returned until the job is
>>>>> finished.
>>>>> > > The
>>>>> > > > >>>>>>>>> following
>>>>> > > > >>>>>>>>>>>>>>>> methods will be introduced in this FLIP:
>>>>> > > > >>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>        /**
>>>>> > > > >>>>>>>>>>>>>>>>         * Asynchronously execute the given
>>>>> single
>>>>> > > > statement
>>>>> > > > >>>>>>>>>>>>>>>>         */
>>>>> > > > >>>>>>>>>>>>>>>> TableEnvironment.executeSqlAsync(String
>>>>> statement):
>>>>> > > > >>>> TableResult
>>>>> > > > >>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>> /**
>>>>> > > > >>>>>>>>>>>>>>>>        * Asynchronously execute the dml
>>>>> statements as
>>>>> > a
>>>>> > > > >> batch
>>>>> > > > >>>>>>>>>>>>>>>>        */
>>>>> > > > >>>>>>>>>>>>>>>> StatementSet.executeAsync(): TableResult
>>>>> > > > >>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>> public interface TableResult {
>>>>> > > > >>>>>>>>>>>>>>>>          /**
>>>>> > > > >>>>>>>>>>>>>>>>           * return JobClient for DQL and DML in
>>>>> async
>>>>> > > > mode,
>>>>> > > > >>>> else
>>>>> > > > >>>>>>>>> return
>>>>> > > > >>>>>>>>>>>>>>>> Optional.empty
>>>>> > > > >>>>>>>>>>>>>>>>           */
>>>>> > > > >>>>>>>>>>>>>>>>          Optional<JobClient> getJobClient();
>>>>> > > > >>>>>>>>>>>>>>>> }
>>>>> > > > >>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>> what do you think?
>>>>> > > > >>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>> Best,
>>>>> > > > >>>>>>>>>>>>>>>> Godfrey
>>>>> > > > >>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>> Timo Walther<twal...@apache.org>  于2020年3月26日周四
>>>>> > > 下午9:15写道：
>>>>> > > > >>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> Hi Godfrey,
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> executing streaming queries must be our top
>>>>> priority
>>>>> > > > >> because
>>>>> > > > >>>>>> this
>>>>> > > > >>>>>>>>> is
>>>>> > > > >>>>>>>>>>>>>>>>> what distinguishes Flink from competitors. If
>>>>> we
>>>>> > change
>>>>> > > > >> the
>>>>> > > > >>>>>>>>>>>>>>>>> execution
>>>>> > > > >>>>>>>>>>>>>>>>> behavior, we should think about the other
>>>>> cases as
>>>>> > well
>>>>> > > > to
>>>>> > > > >>>> not
>>>>> > > > >>>>>>>>> break
>>>>> > > > >>>>>>>>>>>>>> the
>>>>> > > > >>>>>>>>>>>>>>>>> API a third time.
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> I fear that just having an async execute
>>>>> method will
>>>>> > > not
>>>>> > > > >> be
>>>>> > > > >>>>>> enough
>>>>> > > > >>>>>>>>>>>>>>>>> because users should be able to mix streaming
>>>>> and
>>>>> > batch
>>>>> > > > >>>> queries
>>>>> > > > >>>>>>>>> in a
>>>>> > > > >>>>>>>>>>>>>>>>> unified scenario.
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> If I remember it correctly, we had some
>>>>> discussions
>>>>> > in
>>>>> > > > the
>>>>> > > > >>>> past
>>>>> > > > >>>>>>>>>>>>>>>>> about
>>>>> > > > >>>>>>>>>>>>>>>>> what decides about the execution mode of a
>>>>> query.
>>>>> > > > >>>> Currently, we
>>>>> > > > >>>>>>>>>>>>>>>>> would
>>>>> > > > >>>>>>>>>>>>>>>>> like to let the query decide, not derive it
>>>>> from the
>>>>> > > > >>>> sources.
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> So I could image a multiline pipeline as:
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> USE CATALOG 'mycat';
>>>>> > > > >>>>>>>>>>>>>>>>> INSERT INTO t1 SELECT * FROM s;
>>>>> > > > >>>>>>>>>>>>>>>>> INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT
>>>>> STREAM;
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> For executeMultilineSql():
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> sync because regular SQL
>>>>> > > > >>>>>>>>>>>>>>>>> sync because regular Batch SQL
>>>>> > > > >>>>>>>>>>>>>>>>> async because Streaming SQL
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> For executeAsyncMultilineSql():
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> What we should not start for
>>>>> > > executeAsyncMultilineSql():
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> sync because DDL
>>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> What are you thoughts here?
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> Regards,
>>>>> > > > >>>>>>>>>>>>>>>>> Timo
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>> On 26.03.20 12:50, godfrey he wrote:
>>>>> > > > >>>>>>>>>>>>>>>>>> Hi Timo,
>>>>> > > > >>>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>>> I agree with you that streaming queries
>>>>> mostly need
>>>>> > > > async
>>>>> > > > >>>>>>>>>>>>>>>>>> execution.
>>>>> > > > >>>>>>>>>>>>>>>>>> In fact, our original plan is only
>>>>> introducing sync
>>>>> > > > >>>> methods in
>>>>> > > > >>>>>>>>> this
>>>>> > > > >>>>>>>>>>>>>>> FLIP,
>>>>> > > > >>>>>>>>>>>>>>>>>> and async methods (like "executeSqlAsync")
>>>>> will be
>>>>> > > > >>>> introduced
>>>>> > > > >>>>>> in
>>>>> > > > >>>>>>>>>>>>>>>>>> the
>>>>> > > > >>>>>>>>>>>>>>>>> future
>>>>> > > > >>>>>>>>>>>>>>>>>> which is mentioned in the appendix.
>>>>> > > > >>>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>>> Maybe the async methods also need to be
>>>>> considered
>>>>> > in
>>>>> > > > >> this
>>>>> > > > >>>>>> FLIP.
>>>>> > > > >>>>>>>>>>>>>>>>>>
>>>>> > > > >>>>>>>>>>>>>>>>>> I think sync methods is also useful for
>>>>> streaming
>>>>> > > which
>>>>>
>>>>

Re: [DISCUSS] FLIP-84 Feedback Summary

Reply via email to