Re: [DISCUSS] FLIP-84 Feedback Summary

Fabian Hueske Wed, 06 May 2020 09:04:20 -0700

Hi Godfrey,

This looks good to me.


One side note, indicating unique constraints with "UNQ" is probably not
enough.
There might be multiple unique constraints and users would like to know
which field combinations are unique.
So in your example above, "UNQ(f2, f3)" might be a better marker.

Just as a thought, if we would later add support for watermark on nested
columns, we could add a row just for the nested field (in addition to the
top-level field) like this:

+------------------------+---------------------------+-------+-----------+-------------+-----------------------------------------------------------+
| f4.nested.rowtime | TIMESTAMP(3)        | false | (NULL) |  (NULL)
| f4.nested.rowtime - INTERVAL '3' SECOND |
+------------------------+---------------------------+-------+-----------+-------------+-----------------------------------------------------------+

Thanks,
Fabian

Am Mi., 6. Mai 2020 um 17:51 Uhr schrieb godfrey he <godfre...@gmail.com>:

> Hi @fhue...@gmail.com @Timo Walther <twal...@apache.org>  @Dawid
> Wysakowicz <dwysakow...@apache.org>
> What do you think we limit watermark must be defined on top-level column ?
>
> if we do that, we can add an expression column to represent watermark like
> compute column,
> An example of all cases:
> create table MyTable (
>   f0 BIGINT NOT NULL,
>   f1 ROW<q1 STRING, q2 TIMESTAMP(3)>,
>   f2 VARCHAR<256>,
>   f3 AS f0 + 1,
>   f4 TIMESTAMP(3) NOT NULL,
>   PRIMARY KEY (f0),
>   UNIQUE (f3, f2),
>   WATERMARK f4 AS f4 - INTERVAL '3' SECOND
> ) with (...)
>
>
> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
> | name | type                                                          |
> null   | key        | compute column | watermark
>  |
>
> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
> | f0       | BIGINT                                                     |
> false | PRI       |  (NULL)               |   (NULL)
>         |
>
> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
> | f1       | ROW<q1 STRING, q2 TIMESTAMP(3)> | true   | (NULL) |  (NULL)
>              |  (NULL)                                 |
>
> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
> | f2       | VARCHAR<256>                                     | true   |
> UNQ     |  (NULL)               |  (NULL)                                 |
>
> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
> | f3       | BIGINT                                                     |
> false | UNQ     |  f0 + 1                  |  (NULL)
>          |
>
> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
> | f4       | TIMESTAMP(3)                                        | false |
> (NULL) |  (NULL)                | f4 - INTERVAL '3' SECOND |
>
> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+
>
> WDYT ?
>
> Best,
> Godfrey
>
>
>
> godfrey he <godfre...@gmail.com> 于2020年4月30日周四 下午11:57写道：
>
>> Hi Fabian,
>>
>> the broken example is:
>>
>> create table MyTable (
>>
>>     f0 BIGINT NOT NULL,
>>
>>     f1 ROW<q1 STRING, q2 TIMESTAMP(3)>,
>>
>>     f2 VARCHAR<256>,
>>
>>     f3 AS f0 + 1,
>>
>>     PRIMARY KEY (f0),
>>
>> UNIQUE (f3, f2),
>>
>>     WATERMARK f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>
>> ) with (...)
>>
>>
>> name
>>
>> type
>>
>> key
>>
>> compute column
>>
>> watermark
>>
>> f0
>>
>> BIGINT NOT NULL
>>
>> PRI
>>
>> (NULL)
>>
>> f1
>>
>> ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>
>> UNQ
>>
>> (NULL)
>>
>> f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>
>> f2
>>
>> VARCHAR<256>
>>
>> (NULL)
>>
>> NULL
>>
>> f3
>>
>> BIGINT NOT NULL
>>
>> UNQ
>>
>> f0 + 1
>>
>>
>> or we add a column to represent nullability.
>>
>> name
>>
>> type
>>
>> null
>>
>> key
>>
>> compute column
>>
>> watermark
>>
>> f0
>>
>> BIGINT
>>
>> false
>>
>> PRI
>>
>> (NULL)
>>
>> f1
>>
>> ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>
>> true
>>
>> UNQ
>>
>> (NULL)
>>
>> f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>
>> f2
>>
>> VARCHAR<256>
>>
>> true
>>
>> (NULL)
>>
>> NULL
>>
>> f3
>>
>> BIGINT
>>
>> false
>>
>> UNQ
>>
>> f0 + 1
>>
>>
>>
>>
>> Hi Jark,
>> If we can limit watermark must be defined on top-level column,
>> this will become more simple.
>>
>> Best,
>> Godfrey
>>
>> Jark Wu <imj...@gmail.com> 于2020年4月30日周四 下午11:38写道：
>>
>>> Hi,
>>>
>>> I'm in favor of Fabian's proposal.
>>> First, watermark is not a column, but a metadata just like primary key,
>>> so
>>> shouldn't stand with columns.
>>> Second, AFAIK, primary key can only be defined on top-level columns.
>>> Third, I think watermark can also follow primary key than only allow to
>>> define on top-level columns.
>>>
>>> I have to admit that in FLIP-66, watermark can define on nested fields.
>>> However, during implementation, I found that it's too complicated to do
>>> that. We have refactor time-based physical nodes,
>>> we have to use code generation to access event-time, we have to refactor
>>> FlinkTypeFactory to support a complex nested rowtime.
>>> There is not much value of this feature, but introduce a lot of
>>> complexity
>>> in code base.
>>> So I think we can force watermark define on top-level columns. If user
>>> want
>>> to define on nested columns,
>>> he/she can use computed column to be a top-level column.
>>>
>>> Best,
>>> Jark
>>>
>>>
>>> On Thu, 30 Apr 2020 at 17:55, Fabian Hueske <fhue...@gmail.com> wrote:
>>>
>>> > Hi Godfrey,
>>> >
>>> > The formatting of your example seems to be broken.
>>> > Could you send them again please?
>>> >
>>> > Regarding your points
>>> > > because watermark express can be a sub-column, just like `f1.q2` in
>>> above
>>> > example I give.
>>> >
>>> > I would put the watermark information in the row of the top-level
>>> field and
>>> > indicate to which nested field the watermark refers.
>>> > Don't we have to solve the same issue for primary keys that are
>>> defined on
>>> > a nested field?
>>> >
>>> > > A boolean flag can't represent such info. and I do know whether we
>>> will
>>> > support complex watermark expression involving multiple columns in the
>>> > future. such as: "WATERMARK FOR ts as ts + f1 + interval '1' second"
>>> >
>>> > You are right, a simple binary flag is definitely not sufficient to
>>> display
>>> > the watermark information.
>>> > I would put the expression string into the field, i.e., "ts + f1 +
>>> interval
>>> > '1' second"
>>> >
>>> >
>>> > For me the most important point of why to not show the watermark as a
>>> row
>>> > in the table is that it is not field that can be queried but meta
>>> > information on an existing field.
>>> > For the user it is important to know that a certain field has a
>>> watermark.
>>> > Otherwise, certain queries cannot be correctly specified.
>>> > Also there might be support for multiple watermarks that are defined of
>>> > different fields at some point. Would those be printed in multiple
>>> rows?
>>> >
>>> > Best,
>>> > Fabian
>>> >
>>> >
>>> > Am Do., 30. Apr. 2020 um 11:25 Uhr schrieb godfrey he <
>>> godfre...@gmail.com
>>> > >:
>>> >
>>> > > Hi Fabian, Aljoscha
>>> > >
>>> > > Thanks for the feedback.
>>> > >
>>> > > Agree with you that we can deal with primary key as you mentioned.
>>> > > now, the type column has contained the nullability attribute, e.g.
>>> BIGINT
>>> > > NOT NULL.
>>> > > (I'm also ok that we use two columns to represent type just like
>>> mysql)
>>> > >
>>> > > >Why I treat `watermark` as a special row ?
>>> > > because watermark express can be a sub-column, just like `f1.q2` in
>>> above
>>> > > example I give.
>>> > > A boolean flag can't represent such info. and I do know whether we
>>> will
>>> > > support complex
>>> > > watermark expression involving multiple columns in the future. such
>>> as:
>>> > > "WATERMARK FOR ts as ts + f1 + interval '1' second"
>>> > >
>>> > > If we do not support complex watermark expression, we can add a
>>> watermark
>>> > > column.
>>> > >
>>> > > for example:
>>> > >
>>> > > create table MyTable (
>>> > >
>>> > >     f0 BIGINT NOT NULL,
>>> > >
>>> > >     f1 ROW<q1 STRING, q2 TIMESTAMP(3)>,
>>> > >
>>> > >     f2 VARCHAR<256>,
>>> > >
>>> > >     f3 AS f0 + 1,
>>> > >
>>> > >     PRIMARY KEY (f0),
>>> > >
>>> > > UNIQUE (f3, f2),
>>> > >
>>> > >     WATERMARK f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>> > >
>>> > > ) with (...)
>>> > >
>>> > >
>>> > > name
>>> > >
>>> > > type
>>> > >
>>> > > key
>>> > >
>>> > > compute column
>>> > >
>>> > > watermark
>>> > >
>>> > > f0
>>> > >
>>> > > BIGINT NOT NULL
>>> > >
>>> > > PRI
>>> > >
>>> > > (NULL)
>>> > >
>>> > > f1
>>> > >
>>> > > ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>> > >
>>> > > UNQ
>>> > >
>>> > > (NULL)
>>> > >
>>> > > f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>> > >
>>> > > f2
>>> > >
>>> > > VARCHAR<256>
>>> > >
>>> > > (NULL)
>>> > >
>>> > > NULL
>>> > >
>>> > > f3
>>> > >
>>> > > BIGINT NOT NULL
>>> > >
>>> > > UNQ
>>> > >
>>> > > f0 + 1
>>> > >
>>> > >
>>> > > or we add a column to represent nullability.
>>> > >
>>> > > name
>>> > >
>>> > > type
>>> > >
>>> > > null
>>> > >
>>> > > key
>>> > >
>>> > > compute column
>>> > >
>>> > > watermark
>>> > >
>>> > > f0
>>> > >
>>> > > BIGINT
>>> > >
>>> > > false
>>> > >
>>> > > PRI
>>> > >
>>> > > (NULL)
>>> > >
>>> > > f1
>>> > >
>>> > > ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>> > >
>>> > > true
>>> > >
>>> > > UNQ
>>> > >
>>> > > (NULL)
>>> > >
>>> > > f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND)
>>> > >
>>> > > f2
>>> > >
>>> > > VARCHAR<256>
>>> > >
>>> > > true
>>> > >
>>> > > (NULL)
>>> > >
>>> > > NULL
>>> > >
>>> > > f3
>>> > >
>>> > > BIGINT
>>> > >
>>> > > false
>>> > >
>>> > > UNQ
>>> > >
>>> > > f0 + 1
>>> > >
>>> > >
>>> > > Personally, I like the second one. (we need do some changes on
>>> > LogicalType
>>> > > to get type name without nullability)
>>> > >
>>> > >
>>> > > Best,
>>> > > Godfrey
>>> > >
>>> > >
>>> > > Aljoscha Krettek <aljos...@apache.org> 于2020年4月29日周三 下午5:47写道：
>>> > >
>>> > > > +1 I like the general idea of printing the results as a table.
>>> > > >
>>> > > > On the specifics I don't know enough but Fabians suggestions seems
>>> to
>>> > > > make sense to me.
>>> > > >
>>> > > > Aljoscha
>>> > > >
>>> > > > On 29.04.20 10:56, Fabian Hueske wrote:
>>> > > > > Hi Godfrey,
>>> > > > >
>>> > > > > Thanks for starting this discussion!
>>> > > > >
>>> > > > > In my mind, WATERMARK is a property (or constraint) of a field,
>>> just
>>> > > like
>>> > > > > PRIMARY KEY.
>>> > > > > Take this example from MySQL:
>>> > > > >
>>> > > > > mysql> CREATE TABLE people (id INT NOT NULL, name VARCHAR(128)
>>> NOT
>>> > > NULL,
>>> > > > > age INT, PRIMARY KEY (id));
>>> > > > > Query OK, 0 rows affected (0.06 sec)
>>> > > > >
>>> > > > > mysql> describe people;
>>> > > > > +-------+--------------+------+-----+---------+-------+
>>> > > > > | Field | Type         | Null | Key | Default | Extra |
>>> > > > > +-------+--------------+------+-----+---------+-------+
>>> > > > > | id    | int          | NO   | PRI | NULL    |       |
>>> > > > > | name  | varchar(128) | NO   |     | NULL    |       |
>>> > > > > | age   | int          | YES  |     | NULL    |       |
>>> > > > > +-------+--------------+------+-----+---------+-------+
>>> > > > > 3 rows in set (0.01 sec)
>>> > > > >
>>> > > > > Here, PRIMARY KEY is marked in the Key column of the id field.
>>> > > > > We could do the same for watermarks by adding a Watermark column.
>>> > > > >
>>> > > > > Best, Fabian
>>> > > > >
>>> > > > >
>>> > > > > Am Mi., 29. Apr. 2020 um 10:43 Uhr schrieb godfrey he <
>>> > > > godfre...@gmail.com>:
>>> > > > >
>>> > > > >> Hi everyone,
>>> > > > >>
>>> > > > >> I would like to bring up a discussion about the result type of
>>> > > describe
>>> > > > >> statement,
>>> > > > >> which is introduced in FLIP-84[1].
>>> > > > >> In previous version, we define the result type of `describe`
>>> > statement
>>> > > > is a
>>> > > > >> single column as following
>>> > > > >>
>>> > > > >> Statement
>>> > > > >>
>>> > > > >> Result Schema
>>> > > > >>
>>> > > > >> Result Value
>>> > > > >>
>>> > > > >> Result Kind
>>> > > > >>
>>> > > > >> Examples
>>> > > > >>
>>> > > > >> DESCRIBE xx
>>> > > > >>
>>> > > > >> field name: result
>>> > > > >>
>>> > > > >> field type: VARCHAR(n)
>>> > > > >>
>>> > > > >> (n is the max length of values)
>>> > > > >>
>>> > > > >> describe the detail of an object
>>> > > > >>
>>> > > > >> (single row)
>>> > > > >>
>>> > > > >> SUCCESS_WITH_CONTENT
>>> > > > >>
>>> > > > >> DESCRIBE table_name
>>> > > > >>
>>> > > > >> for "describe table_name", the result value is the `toString`
>>> value
>>> > of
>>> > > > >> `TableSchema`, which is an unstructured data.
>>> > > > >> It's hard to for user to use this info.
>>> > > > >>
>>> > > > >> for example:
>>> > > > >>
>>> > > > >> TableSchema schema = TableSchema.builder()
>>> > > > >>     .field("f0", DataTypes.BIGINT())
>>> > > > >>     .field("f1", DataTypes.ROW(
>>> > > > >>        DataTypes.FIELD("q1", DataTypes.STRING()),
>>> > > > >>        DataTypes.FIELD("q2", DataTypes.TIMESTAMP(3))))
>>> > > > >>     .field("f2", DataTypes.STRING())
>>> > > > >>     .field("f3", DataTypes.BIGINT(), "f0 + 1")
>>> > > > >>     .watermark("f1.q2", WATERMARK_EXPRESSION,
>>> WATERMARK_DATATYPE)
>>> > > > >>     .build();
>>> > > > >>
>>> > > > >> its `toString` value is:
>>> > > > >> root
>>> > > > >>   |-- f0: BIGINT
>>> > > > >>   |-- f1: ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>> > > > >>   |-- f2: STRING
>>> > > > >>   |-- f3: BIGINT AS f0 + 1
>>> > > > >>   |-- WATERMARK FOR f1.q2 AS now()
>>> > > > >>
>>> > > > >> For hive, MySQL, etc., the describe result is table form
>>> including
>>> > > field
>>> > > > >> names and field types.
>>> > > > >> which is more familiar with users.
>>> > > > >> TableSchema[2] has watermark expression and compute column, we
>>> > should
>>> > > > also
>>> > > > >> put them into the table:
>>> > > > >> for compute column, it's a column level, we add a new column
>>> named
>>> > > > `expr`.
>>> > > > >>   for watermark expression, it's a table level, we add a
>>> special row
>>> > > > named
>>> > > > >> `WATERMARK` to represent it.
>>> > > > >>
>>> > > > >> The result will look like about above example:
>>> > > > >>
>>> > > > >> name
>>> > > > >>
>>> > > > >> type
>>> > > > >>
>>> > > > >> expr
>>> > > > >>
>>> > > > >> f0
>>> > > > >>
>>> > > > >> BIGINT
>>> > > > >>
>>> > > > >> (NULL)
>>> > > > >>
>>> > > > >> f1
>>> > > > >>
>>> > > > >> ROW<`q1` STRING, `q2` TIMESTAMP(3)>
>>> > > > >>
>>> > > > >> (NULL)
>>> > > > >>
>>> > > > >> f2
>>> > > > >>
>>> > > > >> STRING
>>> > > > >>
>>> > > > >> NULL
>>> > > > >>
>>> > > > >> f3
>>> > > > >>
>>> > > > >> BIGINT
>>> > > > >>
>>> > > > >> f0 + 1
>>> > > > >>
>>> > > > >> WATERMARK
>>> > > > >>
>>> > > > >> (NULL)
>>> > > > >>
>>> > > > >> f1.q2 AS now()
>>> > > > >>
>>> > > > >> now there is a pr FLINK-17112 [3] to implement DESCRIBE
>>> statement.
>>> > > > >>
>>> > > > >> What do you think about this update?
>>> > > > >> Any feedback are welcome~
>>> > > > >>
>>> > > > >> Best,
>>> > > > >> Godfrey
>>> > > > >>
>>> > > > >> [1]
>>> > > > >>
>>> > > >
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=134745878
>>> > > > >> [2]
>>> > > > >>
>>> > > > >>
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/api/TableSchema.java
>>> > > > >> [3] https://github.com/apache/flink/pull/11892
>>> > > > >>
>>> > > > >>
>>> > > > >> godfrey he <godfre...@gmail.com> 于2020年4月6日周一 下午10:38写道：
>>> > > > >>
>>> > > > >>> Hi Timo,
>>> > > > >>>
>>> > > > >>> Sorry for the late reply, and thanks for your correction.
>>> > > > >>> I missed DQL for job submission scenario.
>>> > > > >>> I'll fix the document right away.
>>> > > > >>>
>>> > > > >>> Best,
>>> > > > >>> Godfrey
>>> > > > >>>
>>> > > > >>> Timo Walther <twal...@apache.org> 于2020年4月3日周五 下午9:53写道：
>>> > > > >>>
>>> > > > >>>> Hi Godfrey,
>>> > > > >>>>
>>> > > > >>>> I'm sorry to jump in again but I still need to clarify some
>>> things
>>> > > > >>>> around TableResult.
>>> > > > >>>>
>>> > > > >>>> The FLIP says:
>>> > > > >>>> "For DML, this method returns TableResult until the job is
>>> > > submitted.
>>> > > > >>>> For other statements, TableResult is returned until the
>>> execution
>>> > is
>>> > > > >>>> finished."
>>> > > > >>>>
>>> > > > >>>> I thought we agreed on making every execution async? This also
>>> > means
>>> > > > >>>> returning a TableResult for DQLs even though the execution is
>>> not
>>> > > done
>>> > > > >>>> yet. People need access to the JobClient also for batch jobs
>>> in
>>> > > order
>>> > > > to
>>> > > > >>>> cancel long lasting queries. If people want to wait for the
>>> > > completion
>>> > > > >>>> they can hook into JobClient or collect().
>>> > > > >>>>
>>> > > > >>>> Can we rephrase this part to:
>>> > > > >>>>
>>> > > > >>>> The FLIP says:
>>> > > > >>>> "For DML and DQL, this method returns TableResult once the
>>> job has
>>> > > > been
>>> > > > >>>> submitted. For DDL and DCL statements, TableResult is returned
>>> > once
>>> > > > the
>>> > > > >>>> operation has finished."
>>> > > > >>>>
>>> > > > >>>> Regards,
>>> > > > >>>> Timo
>>> > > > >>>>
>>> > > > >>>>
>>> > > > >>>> On 02.04.20 05:27, godfrey he wrote:
>>> > > > >>>>> Hi Aljoscha, Dawid, Timo,
>>> > > > >>>>>
>>> > > > >>>>> Thanks so much for the detailed explanation.
>>> > > > >>>>> Agree with you that the multiline story is not completed
>>> now, and
>>> > > we
>>> > > > >> can
>>> > > > >>>>> keep discussion.
>>> > > > >>>>> I will add current discussions and conclusions to the FLIP.
>>> > > > >>>>>
>>> > > > >>>>> Best,
>>> > > > >>>>> Godfrey
>>> > > > >>>>>
>>> > > > >>>>>
>>> > > > >>>>>
>>> > > > >>>>> Timo Walther <twal...@apache.org> 于2020年4月1日周三 下午11:27写道：
>>> > > > >>>>>
>>> > > > >>>>>> Hi Godfrey,
>>> > > > >>>>>>
>>> > > > >>>>>> first of all, I agree with Dawid. The multiline story is not
>>> > > > >> completed
>>> > > > >>>>>> by this FLIP. It just verifies the big picture.
>>> > > > >>>>>>
>>> > > > >>>>>> 1. "control the execution logic through the proposed method
>>> if
>>> > > they
>>> > > > >>>> know
>>> > > > >>>>>> what the statements are"
>>> > > > >>>>>>
>>> > > > >>>>>> This is a good point that also Fabian raised in the linked
>>> > google
>>> > > > >> doc.
>>> > > > >>>> I
>>> > > > >>>>>> could also imagine to return a more complicated POJO when
>>> > calling
>>> > > > >>>>>> `executeMultiSql()`.
>>> > > > >>>>>>
>>> > > > >>>>>> The POJO would include some `getSqlProperties()` such that a
>>> > > > platform
>>> > > > >>>>>> gets insights into the query before executing. We could also
>>> > > trigger
>>> > > > >>>> the
>>> > > > >>>>>> execution more explicitly instead of hiding it behind an
>>> > iterator.
>>> > > > >>>>>>
>>> > > > >>>>>> 2. "there are some special commands introduced in SQL
>>> client"
>>> > > > >>>>>>
>>> > > > >>>>>> For platforms and SQL Client specific commands, we could
>>> offer a
>>> > > > hook
>>> > > > >>>> to
>>> > > > >>>>>> the parser or a fallback parser in case the regular table
>>> > > > environment
>>> > > > >>>>>> parser cannot deal with the statement.
>>> > > > >>>>>>
>>> > > > >>>>>> However, all of that is future work and can be discussed in
>>> a
>>> > > > >> separate
>>> > > > >>>>>> FLIP.
>>> > > > >>>>>>
>>> > > > >>>>>> 3. +1 for the `Iterator` instead of `Iterable`.
>>> > > > >>>>>>
>>> > > > >>>>>> 4. "we should convert the checked exception to unchecked
>>> > > exception"
>>> > > > >>>>>>
>>> > > > >>>>>> Yes, I meant using a runtime exception instead of a checked
>>> > > > >> exception.
>>> > > > >>>>>> There was no consensus on putting the exception into the
>>> > > > >> `TableResult`.
>>> > > > >>>>>>
>>> > > > >>>>>> Regards,
>>> > > > >>>>>> Timo
>>> > > > >>>>>>
>>> > > > >>>>>> On 01.04.20 15:35, Dawid Wysakowicz wrote:
>>> > > > >>>>>>> When considering the multi-line support I think it is
>>> helpful
>>> > to
>>> > > > >> start
>>> > > > >>>>>>> with a use case in mind. In my opinion consumers of this
>>> method
>>> > > > will
>>> > > > >>>> be:
>>> > > > >>>>>>>
>>> > > > >>>>>>>    1. sql-client
>>> > > > >>>>>>>    2. third-part sql based platforms
>>> > > > >>>>>>>
>>> > > > >>>>>>> @Godfrey As for the quit/source/... commands. I think those
>>> > > belong
>>> > > > >> to
>>> > > > >>>>>>> the responsibility of aforementioned. I think they should
>>> not
>>> > be
>>> > > > >>>>>>> understandable by the TableEnvironment. What would quit on
>>> a
>>> > > > >>>>>>> TableEnvironment do? Moreover I think such commands should
>>> be
>>> > > > >> prefixed
>>> > > > >>>>>>> appropriately. I think it's a common practice to e.g.
>>> prefix
>>> > > those
>>> > > > >>>> with
>>> > > > >>>>>>> ! or : to say they are meta commands of the tool rather
>>> than a
>>> > > > >> query.
>>> > > > >>>>>>>
>>> > > > >>>>>>> I also don't necessarily understand why platform users
>>> need to
>>> > > know
>>> > > > >>>> the
>>> > > > >>>>>>> kind of the query to use the proposed method. They should
>>> get
>>> > the
>>> > > > >> type
>>> > > > >>>>>>> from the TableResult#ResultKind. If the ResultKind is
>>> SUCCESS,
>>> > it
>>> > > > >> was
>>> > > > >>>> a
>>> > > > >>>>>>> DCL/DDL. If SUCCESS_WITH_CONTENT it was a DML/DQL. If
>>> that's
>>> > not
>>> > > > >>>> enough
>>> > > > >>>>>>> we can enrich the TableResult with more explicit kind of
>>> query,
>>> > > but
>>> > > > >> so
>>> > > > >>>>>>> far I don't see such a need.
>>> > > > >>>>>>>
>>> > > > >>>>>>> @Kurt In those cases I would assume the developers want to
>>> > > present
>>> > > > >>>>>>> results of the queries anyway. Moreover I think it is safe
>>> to
>>> > > > assume
>>> > > > >>>>>>> they can adhere to such a contract that the results must be
>>> > > > >> iterated.
>>> > > > >>>>>>>
>>> > > > >>>>>>> For direct users of TableEnvironment/Table API this method
>>> does
>>> > > not
>>> > > > >>>> make
>>> > > > >>>>>>> much sense anyway, in my opinion. I think we can rather
>>> safely
>>> > > > >> assume
>>> > > > >>>> in
>>> > > > >>>>>>> this scenario they do not want to submit multiple queries
>>> at a
>>> > > > >> single
>>> > > > >>>>>> time.
>>> > > > >>>>>>>
>>> > > > >>>>>>> Best,
>>> > > > >>>>>>>
>>> > > > >>>>>>> Dawid
>>> > > > >>>>>>>
>>> > > > >>>>>>>
>>> > > > >>>>>>> On 01/04/2020 15:07, Kurt Young wrote:
>>> > > > >>>>>>>> One comment to `executeMultilineSql`, I'm afraid sometimes
>>> > user
>>> > > > >> might
>>> > > > >>>>>>>> forget to
>>> > > > >>>>>>>> iterate the returned iterators, e.g. user submits a bunch
>>> of
>>> > > DDLs
>>> > > > >> and
>>> > > > >>>>>>>> expect the
>>> > > > >>>>>>>> framework will execute them one by one. But it didn't.
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> Best,
>>> > > > >>>>>>>> Kurt
>>> > > > >>>>>>>>
>>> > > > >>>>>>>>
>>> > > > >>>>>>>> On Wed, Apr 1, 2020 at 5:10 PM Aljoscha Krettek<
>>> > > > >> aljos...@apache.org>
>>> > > > >>>>>> wrote:
>>> > > > >>>>>>>>
>>> > > > >>>>>>>>> Agreed to what Dawid and Timo said.
>>> > > > >>>>>>>>>
>>> > > > >>>>>>>>> To answer your question about multi line SQL: no, we
>>> don't
>>> > > think
>>> > > > >> we
>>> > > > >>>>>> need
>>> > > > >>>>>>>>> this in Flink 1.11, we only wanted to make sure that the
>>> > > > >> interfaces
>>> > > > >>>>>> that
>>> > > > >>>>>>>>> we now put in place will potentially allow this in the
>>> > future.
>>> > > > >>>>>>>>>
>>> > > > >>>>>>>>> Best,
>>> > > > >>>>>>>>> Aljoscha
>>> > > > >>>>>>>>>
>>> > > > >>>>>>>>> On 01.04.20 09:31, godfrey he wrote:
>>> > > > >>>>>>>>>> Hi, Timo & Dawid,
>>> > > > >>>>>>>>>>
>>> > > > >>>>>>>>>> Thanks so much for the effort of `multiline statements
>>> > > > >> supporting`,
>>> > > > >>>>>>>>>> I have a few questions about this method:
>>> > > > >>>>>>>>>>
>>> > > > >>>>>>>>>> 1. users can well control the execution logic through
>>> the
>>> > > > >> proposed
>>> > > > >>>>>> method
>>> > > > >>>>>>>>>>      if they know what the statements are (a statement
>>> is a
>>> > > > DDL, a
>>> > > > >>>> DML
>>> > > > >>>>>> or
>>> > > > >>>>>>>>>> others).
>>> > > > >>>>>>>>>> but if a statement is from a file, that means users do
>>> not
>>> > > know
>>> > > > >>>> what
>>> > > > >>>>>> the
>>> > > > >>>>>>>>>> statements are,
>>> > > > >>>>>>>>>> the execution behavior is unclear.
>>> > > > >>>>>>>>>> As a platform user, I think this method is hard to use,
>>> > unless
>>> > > > >> the
>>> > > > >>>>>>>>> platform
>>> > > > >>>>>>>>>> defines
>>> > > > >>>>>>>>>> a set of rule about the statements order, such as: no
>>> select
>>> > > in
>>> > > > >> the
>>> > > > >>>>>>>>> middle,
>>> > > > >>>>>>>>>> dml must be at tail of sql file (which may be the most
>>> case
>>> > in
>>> > > > >>>> product
>>> > > > >>>>>>>>>> env).
>>> > > > >>>>>>>>>> Otherwise the platform must parse the sql first, then
>>> know
>>> > > what
>>> > > > >> the
>>> > > > >>>>>>>>>> statements are.
>>> > > > >>>>>>>>>> If do like that, the platform can handle all cases
>>> through
>>> > > > >>>>>> `executeSql`
>>> > > > >>>>>>>>> and
>>> > > > >>>>>>>>>> `StatementSet`.
>>> > > > >>>>>>>>>>
>>> > > > >>>>>>>>>> 2. SQL client can't also use `executeMultilineSql` to
>>> > supports
>>> > > > >>>>>> multiline
>>> > > > >>>>>>>>>> statements,
>>> > > > >>>>>>>>>>      because there are some special commands introduced
>>> in
>>> > SQL
>>> > > > >>>> client,
>>> > > > >>>>>>>>>> such as `quit`, `source`, `load jar` (not exist now, but
>>> > maybe
>>> > > > we
>>> > > > >>>> need
>>> > > > >>>>>>>>> this
>>> > > > >>>>>>>>>> command
>>> > > > >>>>>>>>>>      to support dynamic table source and udf).
>>> > > > >>>>>>>>>> Does TableEnvironment also supports those commands?
>>> > > > >>>>>>>>>>
>>> > > > >>>>>>>>>> 3. btw, we must have this feature in release-1.11? I
>>> find
>>> > > there
>>> > > > >> are
>>> > > > >>>>>> few
>>> > > > >>>>>>>>>> user cases
>>> > > > >>>>>>>>>>      in the feedback document which behavior is unclear
>>> now.
>>> > > > >>>>>>>>>>
>>> > > > >>>>>>>>>> regarding to "change the return value from
>>> `Iterable<Row` to
>>> > > > >>>>>>>>>> `Iterator<Row`",
>>> > > > >>>>>>>>>> I couldn't agree more with this change. Just as Dawid
>>> > > mentioned
>>> > > > >>>>>>>>>> "The contract of the Iterable#iterator is that it
>>> returns a
>>> > > new
>>> > > > >>>>>> iterator
>>> > > > >>>>>>>>>> each time,
>>> > > > >>>>>>>>>>      which effectively means we can iterate the results
>>> > > multiple
>>> > > > >>>>>> times.",
>>> > > > >>>>>>>>>> we does not provide iterate the results multiple times.
>>> > > > >>>>>>>>>> If we want do that, the client must buffer all results.
>>> but
>>> > > it's
>>> > > > >>>>>>>>> impossible
>>> > > > >>>>>>>>>> for streaming job.
>>> > > > >>>>>>>>>>
>>> > > > >>>>>>>>>> Best,
>>> > > > >>>>>>>>>> Godfrey
>>> > > > >>>>>>>>>>
>>> > > > >>>>>>>>>> Dawid Wysakowicz<dwysakow...@apache.org>  于2020年4月1日周三
>>> > > > 上午3:14写道：
>>> > > > >>>>>>>>>>
>>> > > > >>>>>>>>>>> Thank you Timo for the great summary! It covers
>>> (almost)
>>> > all
>>> > > > the
>>> > > > >>>>>> topics.
>>> > > > >>>>>>>>>>> Even though in the end we are not suggesting much
>>> changes
>>> > to
>>> > > > the
>>> > > > >>>>>> current
>>> > > > >>>>>>>>>>> state of FLIP I think it is important to lay out all
>>> > possible
>>> > > > >> use
>>> > > > >>>>>> cases
>>> > > > >>>>>>>>>>> so that we do not change the execution model every
>>> release.
>>> > > > >>>>>>>>>>>
>>> > > > >>>>>>>>>>> There is one additional thing we discussed. Could we
>>> change
>>> > > the
>>> > > > >>>>>> result
>>> > > > >>>>>>>>>>> type of TableResult#collect to Iterator<Row>? Even
>>> though
>>> > > those
>>> > > > >>>>>>>>>>> interfaces do not differ much. I think Iterator better
>>> > > > describes
>>> > > > >>>> that
>>> > > > >>>>>>>>>>> the results might not be materialized on the client
>>> side,
>>> > but
>>> > > > >> can
>>> > > > >>>> be
>>> > > > >>>>>>>>>>> retrieved on a per record basis. The contract of the
>>> > > > >>>>>> Iterable#iterator
>>> > > > >>>>>>>>>>> is that it returns a new iterator each time, which
>>> > > effectively
>>> > > > >>>> means
>>> > > > >>>>>> we
>>> > > > >>>>>>>>>>> can iterate the results multiple times. Iterating the
>>> > results
>>> > > > is
>>> > > > >>>> not
>>> > > > >>>>>>>>>>> possible when we don't retrieve all the results from
>>> the
>>> > > > cluster
>>> > > > >>>> at
>>> > > > >>>>>>>>> once.
>>> > > > >>>>>>>>>>> I think we should also use Iterator for
>>> > > > >>>>>>>>>>> TableEnvironment#executeMultilineSql(String
>>> statements):
>>> > > > >>>>>>>>>>> Iterator<TableResult>.
>>> > > > >>>>>>>>>>>
>>> > > > >>>>>>>>>>> Best,
>>> > > > >>>>>>>>>>>
>>> > > > >>>>>>>>>>> Dawid
>>> > > > >>>>>>>>>>>
>>> > > > >>>>>>>>>>> On 31/03/2020 19:27, Timo Walther wrote:
>>> > > > >>>>>>>>>>>> Hi Godfrey,
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> Aljoscha, Dawid, Klou, and I had another discussion
>>> around
>>> > > > >>>> FLIP-84.
>>> > > > >>>>>> In
>>> > > > >>>>>>>>>>>> particular, we discussed how the current status of the
>>> > FLIP
>>> > > > and
>>> > > > >>>> the
>>> > > > >>>>>>>>>>>> future requirements around multiline statements,
>>> > async/sync,
>>> > > > >>>>>> collect()
>>> > > > >>>>>>>>>>>> fit together.
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> We also updated the FLIP-84 Feedback Summary document
>>> [1]
>>> > > with
>>> > > > >>>> some
>>> > > > >>>>>>>>>>>> use cases.
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> We believe that we found a good solution that also
>>> fits to
>>> > > > what
>>> > > > >>>> is
>>> > > > >>>>>> in
>>> > > > >>>>>>>>>>>> the current FLIP. So no bigger changes necessary,
>>> which is
>>> > > > >> great!
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> Our findings were:
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> 1. Async vs sync submission of Flink jobs:
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> Having a blocking `execute()` in DataStream API was
>>> > rather a
>>> > > > >>>>>> mistake.
>>> > > > >>>>>>>>>>>> Instead all submissions should be async because this
>>> > allows
>>> > > > >>>>>> supporting
>>> > > > >>>>>>>>>>>> both modes if necessary. Thus, submitting all queries
>>> > async
>>> > > > >>>> sounds
>>> > > > >>>>>>>>>>>> good to us. If users want to run a job sync, they can
>>> use
>>> > > the
>>> > > > >>>>>>>>>>>> JobClient and wait for completion (or collect() in
>>> case of
>>> > > > >> batch
>>> > > > >>>>>> jobs).
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> 2. Multi-statement execution:
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> For the multi-statement execution, we don't see a
>>> > > > >> contradication
>>> > > > >>>>>> with
>>> > > > >>>>>>>>>>>> the async execution behavior. We imagine a method
>>> like:
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> TableEnvironment#executeMultilineSql(String
>>> statements):
>>> > > > >>>>>>>>>>>> Iterable<TableResult>
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> Where the `Iterator#next()` method would trigger the
>>> next
>>> > > > >>>> statement
>>> > > > >>>>>>>>>>>> submission. This allows a caller to decide
>>> synchronously
>>> > > when
>>> > > > >> to
>>> > > > >>>>>>>>>>>> submit statements async to the cluster. Thus, a
>>> service
>>> > such
>>> > > > as
>>> > > > >>>> the
>>> > > > >>>>>>>>>>>> SQL Client can handle the result of each statement
>>> > > > individually
>>> > > > >>>> and
>>> > > > >>>>>>>>>>>> process statement by statement sequentially.
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> 3. The role of TableResult and result retrieval in
>>> general
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> `TableResult` is similar to `JobClient`. Instead of
>>> > > returning
>>> > > > a
>>> > > > >>>>>>>>>>>> `CompletableFuture` of something, it is a concrete
>>> util
>>> > > class
>>> > > > >>>> where
>>> > > > >>>>>>>>>>>> some methods have the behavior of completable future
>>> (e.g.
>>> > > > >>>>>> collect(),
>>> > > > >>>>>>>>>>>> print()) and some are already completed
>>> (getTableSchema(),
>>> > > > >>>>>>>>>>>> getResultKind()).
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> `StatementSet#execute()` returns a single
>>> `TableResult`
>>> > > > because
>>> > > > >>>> the
>>> > > > >>>>>>>>>>>> order is undefined in a set and all statements have
>>> the
>>> > same
>>> > > > >>>> schema.
>>> > > > >>>>>>>>>>>> Its `collect()` will return a row for each executed
>>> > `INSERT
>>> > > > >>>> INTO` in
>>> > > > >>>>>>>>>>>> the order of statement definition.
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> For simple `SELECT * FROM ...`, the query execution
>>> might
>>> > > > block
>>> > > > >>>>>> until
>>> > > > >>>>>>>>>>>> `collect()` is called to pull buffered rows from the
>>> job
>>> > > (from
>>> > > > >>>>>>>>>>>> socket/REST API what ever we will use in the future).
>>> We
>>> > can
>>> > > > >> say
>>> > > > >>>>>> that
>>> > > > >>>>>>>>>>>> a statement finished successfully, when the
>>> > > > >>>>>> `collect#Iterator#hasNext`
>>> > > > >>>>>>>>>>>> has returned false.
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> I hope this summarizes our discussion
>>> > @Dawid/Aljoscha/Klou?
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> It would be great if we can add these findings to the
>>> FLIP
>>> > > > >>>> before we
>>> > > > >>>>>>>>>>>> start voting.
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> One minor thing: some `execute()` methods still throw
>>> a
>>> > > > checked
>>> > > > >>>>>>>>>>>> exception; can we remove that from the FLIP? Also the
>>> > above
>>> > > > >>>>>> mentioned
>>> > > > >>>>>>>>>>>> `Iterator#next()` would trigger an execution without
>>> > > throwing
>>> > > > a
>>> > > > >>>>>>>>>>>> checked exception.
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> Thanks,
>>> > > > >>>>>>>>>>>> Timo
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>>>> [1]
>>> > > > >>>>>>>>>>>>
>>> > > > >>>>>>>>>
>>> > > > >>>>>>
>>> > > > >>>>
>>> > > > >>
>>> > > >
>>> > >
>>> >
>>> https://docs.google.com/document/d/1ueLjQWRPdLTFB_TReAyhseAX-1N3j4WYWD0F02Uau0E/edit#
>>> > > > >>>>>>>>>>>> On 31.03.20 06:28, godfrey he wrote:
>>> > > > >>>>>>>>>>>>> Hi, Timo & Jark
>>> > > > >>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>> Thanks for your explanation.
>>> > > > >>>>>>>>>>>>> Agree with you that async execution should always be
>>> > async,
>>> > > > >>>>>>>>>>>>> and sync execution scenario can be covered  by async
>>> > > > >> execution.
>>> > > > >>>>>>>>>>>>> It helps provide an unified entry point for batch and
>>> > > > >> streaming.
>>> > > > >>>>>>>>>>>>> I think we can also use sync execution for some
>>> testing.
>>> > > > >>>>>>>>>>>>> So, I agree with you that we provide `executeSql`
>>> method
>>> > > and
>>> > > > >>>> it's
>>> > > > >>>>>>>>> async
>>> > > > >>>>>>>>>>>>> method.
>>> > > > >>>>>>>>>>>>> If we want sync method in the future, we can add
>>> method
>>> > > named
>>> > > > >>>>>>>>>>>>> `executeSqlSync`.
>>> > > > >>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>> I think we've reached an agreement. I will update the
>>> > > > >> document,
>>> > > > >>>> and
>>> > > > >>>>>>>>>>>>> start
>>> > > > >>>>>>>>>>>>> voting process.
>>> > > > >>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>> Best,
>>> > > > >>>>>>>>>>>>> Godfrey
>>> > > > >>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>> Jark Wu<imj...@gmail.com>  于2020年3月31日周二 上午12:46写道：
>>> > > > >>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>> Hi,
>>> > > > >>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>> I didn't follow the full discussion.
>>> > > > >>>>>>>>>>>>>> But I share the same concern with Timo that
>>> streaming
>>> > > > queries
>>> > > > >>>>>> should
>>> > > > >>>>>>>>>>>>>> always
>>> > > > >>>>>>>>>>>>>> be async.
>>> > > > >>>>>>>>>>>>>> Otherwise, I can image it will cause a lot of
>>> confusion
>>> > > and
>>> > > > >>>>>> problems
>>> > > > >>>>>>>>> if
>>> > > > >>>>>>>>>>>>>> users don't deeply keep the "sync" in mind (e.g.
>>> client
>>> > > > >> hangs).
>>> > > > >>>>>>>>>>>>>> Besides, the streaming mode is still the majority
>>> use
>>> > > cases
>>> > > > >> of
>>> > > > >>>>>> Flink
>>> > > > >>>>>>>>>>>>>> and
>>> > > > >>>>>>>>>>>>>> Flink SQL. We should put the usability at a high
>>> > priority.
>>> > > > >>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>> Best,
>>> > > > >>>>>>>>>>>>>> Jark
>>> > > > >>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>> On Mon, 30 Mar 2020 at 23:27, Timo Walther<
>>> > > > >> twal...@apache.org>
>>> > > > >>>>>>>>> wrote:
>>> > > > >>>>>>>>>>>>>>> Hi Godfrey,
>>> > > > >>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>> maybe I wasn't expressing my biggest concern
>>> enough in
>>> > my
>>> > > > >> last
>>> > > > >>>>>> mail.
>>> > > > >>>>>>>>>>>>>>> Even in a singleline and sync execution, I think
>>> that
>>> > > > >>>> streaming
>>> > > > >>>>>>>>>>>>>>> queries
>>> > > > >>>>>>>>>>>>>>> should not block the execution. Otherwise it is not
>>> > > > possible
>>> > > > >>>> to
>>> > > > >>>>>> call
>>> > > > >>>>>>>>>>>>>>> collect() or print() on them afterwards.
>>> > > > >>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>> "there are too many things need to discuss for
>>> > > multiline":
>>> > > > >>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>> True, I don't want to solve all of them right now.
>>> But
>>> > > what
>>> > > > >> I
>>> > > > >>>>>> know
>>> > > > >>>>>>>>> is
>>> > > > >>>>>>>>>>>>>>> that our newly introduced methods should fit into a
>>> > > > >> multiline
>>> > > > >>>>>>>>>>>>>>> execution.
>>> > > > >>>>>>>>>>>>>>> There is no big difference of calling
>>> `executeSql(A),
>>> > > > >>>>>>>>>>>>>>> executeSql(B)` and
>>> > > > >>>>>>>>>>>>>>> processing a multiline file `A;\nB;`.
>>> > > > >>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>> I think the example that you mentioned can simply
>>> be
>>> > > > >> undefined
>>> > > > >>>>>> for
>>> > > > >>>>>>>>>>>>>>> now.
>>> > > > >>>>>>>>>>>>>>> Currently, no catalog is modifying data but just
>>> > > metadata.
>>> > > > >>>> This
>>> > > > >>>>>> is a
>>> > > > >>>>>>>>>>>>>>> separate discussion.
>>> > > > >>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>> "result of the second statement is
>>> indeterministic":
>>> > > > >>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>> Sure this is indeterministic. But this is the
>>> > > implementers
>>> > > > >>>> fault
>>> > > > >>>>>>>>>>>>>>> and we
>>> > > > >>>>>>>>>>>>>>> cannot forbid such pipelines.
>>> > > > >>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>> How about we always execute streaming queries
>>> async? It
>>> > > > >> would
>>> > > > >>>>>>>>> unblock
>>> > > > >>>>>>>>>>>>>>> executeSql() and multiline statements.
>>> > > > >>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>> Having a `executeSqlAsync()` is useful for batch.
>>> > > However,
>>> > > > I
>>> > > > >>>>>> don't
>>> > > > >>>>>>>>>>>>>>> want
>>> > > > >>>>>>>>>>>>>>> `sync/async` be the new batch/stream flag. The
>>> > execution
>>> > > > >>>> behavior
>>> > > > >>>>>>>>>>>>>>> should
>>> > > > >>>>>>>>>>>>>>> come from the query itself.
>>> > > > >>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>> Regards,
>>> > > > >>>>>>>>>>>>>>> Timo
>>> > > > >>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>> On 30.03.20 11:12, godfrey he wrote:
>>> > > > >>>>>>>>>>>>>>>> Hi Timo,
>>> > > > >>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>> Agree with you that streaming queries is our top
>>> > > priority,
>>> > > > >>>>>>>>>>>>>>>> but I think there are too many things need to
>>> discuss
>>> > > for
>>> > > > >>>>>> multiline
>>> > > > >>>>>>>>>>>>>>>> statements:
>>> > > > >>>>>>>>>>>>>>>> e.g.
>>> > > > >>>>>>>>>>>>>>>> 1. what's the behaivor of DDL and DML mixing for
>>> async
>>> > > > >>>>>> execution:
>>> > > > >>>>>>>>>>>>>>>> create table t1 xxx;
>>> > > > >>>>>>>>>>>>>>>> create table t2 xxx;
>>> > > > >>>>>>>>>>>>>>>> insert into t2 select * from t1 where xxx;
>>> > > > >>>>>>>>>>>>>>>> drop table t1; // t1 may be a MySQL table, the
>>> data
>>> > will
>>> > > > >>>> also be
>>> > > > >>>>>>>>>>>>>> deleted.
>>> > > > >>>>>>>>>>>>>>>> t1 is dropped when "insert" job is running.
>>> > > > >>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>> 2. what's the behaivor of unified scenario for
>>> async
>>> > > > >>>> execution:
>>> > > > >>>>>>>>>>>>>>>> (as you
>>> > > > >>>>>>>>>>>>>>>> mentioned)
>>> > > > >>>>>>>>>>>>>>>> INSERT INTO t1 SELECT * FROM s;
>>> > > > >>>>>>>>>>>>>>>> INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT
>>> STREAM;
>>> > > > >>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>> The result of the second statement is
>>> indeterministic,
>>> > > > >>>> because
>>> > > > >>>>>> the
>>> > > > >>>>>>>>>>>>>> first
>>> > > > >>>>>>>>>>>>>>>> statement maybe is running.
>>> > > > >>>>>>>>>>>>>>>> I think we need to put a lot of effort to define
>>> the
>>> > > > >>>> behavior of
>>> > > > >>>>>>>>>>>>>>> logically
>>> > > > >>>>>>>>>>>>>>>> related queries.
>>> > > > >>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>> In this FLIP, I suggest we only handle single
>>> > statement,
>>> > > > >> and
>>> > > > >>>> we
>>> > > > >>>>>>>>> also
>>> > > > >>>>>>>>>>>>>>>> introduce an async execute method
>>> > > > >>>>>>>>>>>>>>>> which is more important and more often used for
>>> users.
>>> > > > >>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>> Dor the sync methods (like
>>> > `TableEnvironment.executeSql`
>>> > > > >> and
>>> > > > >>>>>>>>>>>>>>>> `StatementSet.execute`),
>>> > > > >>>>>>>>>>>>>>>> the result will be returned until the job is
>>> finished.
>>> > > The
>>> > > > >>>>>>>>> following
>>> > > > >>>>>>>>>>>>>>>> methods will be introduced in this FLIP:
>>> > > > >>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>        /**
>>> > > > >>>>>>>>>>>>>>>>         * Asynchronously execute the given single
>>> > > > statement
>>> > > > >>>>>>>>>>>>>>>>         */
>>> > > > >>>>>>>>>>>>>>>> TableEnvironment.executeSqlAsync(String
>>> statement):
>>> > > > >>>> TableResult
>>> > > > >>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>> /**
>>> > > > >>>>>>>>>>>>>>>>        * Asynchronously execute the dml
>>> statements as
>>> > a
>>> > > > >> batch
>>> > > > >>>>>>>>>>>>>>>>        */
>>> > > > >>>>>>>>>>>>>>>> StatementSet.executeAsync(): TableResult
>>> > > > >>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>> public interface TableResult {
>>> > > > >>>>>>>>>>>>>>>>          /**
>>> > > > >>>>>>>>>>>>>>>>           * return JobClient for DQL and DML in
>>> async
>>> > > > mode,
>>> > > > >>>> else
>>> > > > >>>>>>>>> return
>>> > > > >>>>>>>>>>>>>>>> Optional.empty
>>> > > > >>>>>>>>>>>>>>>>           */
>>> > > > >>>>>>>>>>>>>>>>          Optional<JobClient> getJobClient();
>>> > > > >>>>>>>>>>>>>>>> }
>>> > > > >>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>> what do you think?
>>> > > > >>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>> Best,
>>> > > > >>>>>>>>>>>>>>>> Godfrey
>>> > > > >>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>> Timo Walther<twal...@apache.org>  于2020年3月26日周四
>>> > > 下午9:15写道：
>>> > > > >>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> Hi Godfrey,
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> executing streaming queries must be our top
>>> priority
>>> > > > >> because
>>> > > > >>>>>> this
>>> > > > >>>>>>>>> is
>>> > > > >>>>>>>>>>>>>>>>> what distinguishes Flink from competitors. If we
>>> > change
>>> > > > >> the
>>> > > > >>>>>>>>>>>>>>>>> execution
>>> > > > >>>>>>>>>>>>>>>>> behavior, we should think about the other cases
>>> as
>>> > well
>>> > > > to
>>> > > > >>>> not
>>> > > > >>>>>>>>> break
>>> > > > >>>>>>>>>>>>>> the
>>> > > > >>>>>>>>>>>>>>>>> API a third time.
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> I fear that just having an async execute method
>>> will
>>> > > not
>>> > > > >> be
>>> > > > >>>>>> enough
>>> > > > >>>>>>>>>>>>>>>>> because users should be able to mix streaming and
>>> > batch
>>> > > > >>>> queries
>>> > > > >>>>>>>>> in a
>>> > > > >>>>>>>>>>>>>>>>> unified scenario.
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> If I remember it correctly, we had some
>>> discussions
>>> > in
>>> > > > the
>>> > > > >>>> past
>>> > > > >>>>>>>>>>>>>>>>> about
>>> > > > >>>>>>>>>>>>>>>>> what decides about the execution mode of a query.
>>> > > > >>>> Currently, we
>>> > > > >>>>>>>>>>>>>>>>> would
>>> > > > >>>>>>>>>>>>>>>>> like to let the query decide, not derive it from
>>> the
>>> > > > >>>> sources.
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> So I could image a multiline pipeline as:
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> USE CATALOG 'mycat';
>>> > > > >>>>>>>>>>>>>>>>> INSERT INTO t1 SELECT * FROM s;
>>> > > > >>>>>>>>>>>>>>>>> INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT
>>> STREAM;
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> For executeMultilineSql():
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> sync because regular SQL
>>> > > > >>>>>>>>>>>>>>>>> sync because regular Batch SQL
>>> > > > >>>>>>>>>>>>>>>>> async because Streaming SQL
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> For executeAsyncMultilineSql():
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> What we should not start for
>>> > > executeAsyncMultilineSql():
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> sync because DDL
>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> What are you thoughts here?
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> Regards,
>>> > > > >>>>>>>>>>>>>>>>> Timo
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>> On 26.03.20 12:50, godfrey he wrote:
>>> > > > >>>>>>>>>>>>>>>>>> Hi Timo,
>>> > > > >>>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>>> I agree with you that streaming queries mostly
>>> need
>>> > > > async
>>> > > > >>>>>>>>>>>>>>>>>> execution.
>>> > > > >>>>>>>>>>>>>>>>>> In fact, our original plan is only introducing
>>> sync
>>> > > > >>>> methods in
>>> > > > >>>>>>>>> this
>>> > > > >>>>>>>>>>>>>>> FLIP,
>>> > > > >>>>>>>>>>>>>>>>>> and async methods (like "executeSqlAsync") will
>>> be
>>> > > > >>>> introduced
>>> > > > >>>>>> in
>>> > > > >>>>>>>>>>>>>>>>>> the
>>> > > > >>>>>>>>>>>>>>>>> future
>>> > > > >>>>>>>>>>>>>>>>>> which is mentioned in the appendix.
>>> > > > >>>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>>> Maybe the async methods also need to be
>>> considered
>>> > in
>>> > > > >> this
>>> > > > >>>>>> FLIP.
>>> > > > >>>>>>>>>>>>>>>>>>
>>> > > > >>>>>>>>>>>>>>>>>> I think sync methods is also useful for
>>> streaming
>>> > > which
>>>
>>

Re: [DISCUSS] FLIP-84 Feedback Summary

Reply via email to