Thanks for the update Godfrey! +1 to this approach.
Since there can be only one primary key, I'd also be fine to just use `PRI` even if it is composite, but `PRI(f0, f5)` might be more convenient for users. Thanks, Fabian Am Do., 7. Mai 2020 um 09:31 Uhr schrieb godfrey he <godfre...@gmail.com>: > Hi fabian, > Thanks for you suggestions. > > Agree with you that `UNQ(f2, f3)` is more clear. > > A table can have only ONE primary key, > this primary key can consist of single or multiple columns. [1] > if primary key consists of single column, > we can simply use `PRI` (or `PRI(xx)`) to represent it. > if primary key have multiple columns, > we should use `PRI(xx, yy, ...)` to represent it. > > A table may have multiple unique keys, > each unique key can consist of single or multiple columns. [2] > if there is only one unique key and this unique key has only single column, > we can simply use `UNQ` (or `UNQ(xx)`) to represent it. > otherwise, we should use `UNQ(xx, yy, ...)` to represent it. > (a corner case: two unique keys with same columns, like `UNQ(f2, f3)`, > `UNQ(f2, f3)`, > we can forbid this case or add a unique name for each key in the future) > > primary key and unique key with multiple columns example: > create table MyTable ( > f0 BIGINT NOT NULL, > f1 ROW<q1 STRING, q2 TIMESTAMP(3)>, > f2 VARCHAR<256>, > f3 AS f0 + 1, > f4 TIMESTAMP(3) NOT NULL, > f5 BIGINT NOT NULL, > * PRIMARY KEY (f0, f5)*, > *UNIQUE (f3, f2)*, > WATERMARK f4 AS f4 - INTERVAL '3' SECOND > ) with (...) > > > +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ > | name | type | > null | key | compute column | watermark > | > > +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ > | f0 | BIGINT | > false | PRI(f0, f5) | (NULL) | (NULL) > | > > +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ > | f1 | ROW<q1 STRING, q2 TIMESTAMP(3)> | true | (NULL) | > (NULL) | (NULL) | > > +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ > | f2 | VARCHAR<256> | true | > UNQ(f2, f3) | (NULL) | (NULL) > | > > +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ > | f3 | BIGINT | > false | UNQ(f2, f3) | f0 + 1 | (NULL) > | > > +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ > | f4 | TIMESTAMP(3) | false | > (NULL) | (NULL) | f4 - INTERVAL '3' SECOND | > > +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ > | f5 | BIGINT | > false | PRI(f0, f5) | (NULL) | (NULL) > | > > +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ > > "Regarding to the watermark on nested columns", that's a good approach > which can both support watermark on nested columns in the future and keep > current table form. > > [1] https://www.w3schools.com/sql/sql_primarykey.asp > [2] https://www.w3schools.com/sql/sql_unique.ASP > > Best, > Godfrey > > Fabian Hueske <fhue...@gmail.com> 于2020年5月7日周四 上午12:03写道: > >> Hi Godfrey, >> >> This looks good to me. >> >> One side note, indicating unique constraints with "UNQ" is probably not >> enough. >> There might be multiple unique constraints and users would like to know >> which field combinations are unique. >> So in your example above, "UNQ(f2, f3)" might be a better marker. >> >> Just as a thought, if we would later add support for watermark on nested >> columns, we could add a row just for the nested field (in addition to the >> top-level field) like this: >> >> >> +------------------------+---------------------------+-------+-----------+-------------+-----------------------------------------------------------+ >> | f4.nested.rowtime | TIMESTAMP(3) | false | (NULL) | (NULL) >> | f4.nested.rowtime - INTERVAL '3' SECOND | >> >> +------------------------+---------------------------+-------+-----------+-------------+-----------------------------------------------------------+ >> >> Thanks, >> Fabian >> >> Am Mi., 6. Mai 2020 um 17:51 Uhr schrieb godfrey he <godfre...@gmail.com >> >: >> >>> Hi @fhue...@gmail.com @Timo Walther <twal...@apache.org> @Dawid >>> Wysakowicz <dwysakow...@apache.org> >>> What do you think we limit watermark must be defined on top-level column >>> ? >>> >>> if we do that, we can add an expression column to represent watermark >>> like compute column, >>> An example of all cases: >>> create table MyTable ( >>> f0 BIGINT NOT NULL, >>> f1 ROW<q1 STRING, q2 TIMESTAMP(3)>, >>> f2 VARCHAR<256>, >>> f3 AS f0 + 1, >>> f4 TIMESTAMP(3) NOT NULL, >>> PRIMARY KEY (f0), >>> UNIQUE (f3, f2), >>> WATERMARK f4 AS f4 - INTERVAL '3' SECOND >>> ) with (...) >>> >>> >>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >>> | name | type | >>> null | key | compute column | watermark >>> | >>> >>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >>> | f0 | BIGINT >>> | false | PRI | (NULL) | (NULL) >>> | >>> >>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >>> | f1 | ROW<q1 STRING, q2 TIMESTAMP(3)> | true | (NULL) | >>> (NULL) | (NULL) | >>> >>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >>> | f2 | VARCHAR<256> | true | >>> UNQ | (NULL) | (NULL) | >>> >>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >>> | f3 | BIGINT >>> | false | UNQ | f0 + 1 | (NULL) >>> | >>> >>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >>> | f4 | TIMESTAMP(3) | false >>> | (NULL) | (NULL) | f4 - INTERVAL '3' SECOND | >>> >>> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >>> >>> WDYT ? >>> >>> Best, >>> Godfrey >>> >>> >>> >>> godfrey he <godfre...@gmail.com> 于2020年4月30日周四 下午11:57写道: >>> >>>> Hi Fabian, >>>> >>>> the broken example is: >>>> >>>> create table MyTable ( >>>> >>>> f0 BIGINT NOT NULL, >>>> >>>> f1 ROW<q1 STRING, q2 TIMESTAMP(3)>, >>>> >>>> f2 VARCHAR<256>, >>>> >>>> f3 AS f0 + 1, >>>> >>>> PRIMARY KEY (f0), >>>> >>>> UNIQUE (f3, f2), >>>> >>>> WATERMARK f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND) >>>> >>>> ) with (...) >>>> >>>> >>>> name >>>> >>>> type >>>> >>>> key >>>> >>>> compute column >>>> >>>> watermark >>>> >>>> f0 >>>> >>>> BIGINT NOT NULL >>>> >>>> PRI >>>> >>>> (NULL) >>>> >>>> f1 >>>> >>>> ROW<`q1` STRING, `q2` TIMESTAMP(3)> >>>> >>>> UNQ >>>> >>>> (NULL) >>>> >>>> f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND) >>>> >>>> f2 >>>> >>>> VARCHAR<256> >>>> >>>> (NULL) >>>> >>>> NULL >>>> >>>> f3 >>>> >>>> BIGINT NOT NULL >>>> >>>> UNQ >>>> >>>> f0 + 1 >>>> >>>> >>>> or we add a column to represent nullability. >>>> >>>> name >>>> >>>> type >>>> >>>> null >>>> >>>> key >>>> >>>> compute column >>>> >>>> watermark >>>> >>>> f0 >>>> >>>> BIGINT >>>> >>>> false >>>> >>>> PRI >>>> >>>> (NULL) >>>> >>>> f1 >>>> >>>> ROW<`q1` STRING, `q2` TIMESTAMP(3)> >>>> >>>> true >>>> >>>> UNQ >>>> >>>> (NULL) >>>> >>>> f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND) >>>> >>>> f2 >>>> >>>> VARCHAR<256> >>>> >>>> true >>>> >>>> (NULL) >>>> >>>> NULL >>>> >>>> f3 >>>> >>>> BIGINT >>>> >>>> false >>>> >>>> UNQ >>>> >>>> f0 + 1 >>>> >>>> >>>> >>>> >>>> Hi Jark, >>>> If we can limit watermark must be defined on top-level column, >>>> this will become more simple. >>>> >>>> Best, >>>> Godfrey >>>> >>>> Jark Wu <imj...@gmail.com> 于2020年4月30日周四 下午11:38写道: >>>> >>>>> Hi, >>>>> >>>>> I'm in favor of Fabian's proposal. >>>>> First, watermark is not a column, but a metadata just like primary >>>>> key, so >>>>> shouldn't stand with columns. >>>>> Second, AFAIK, primary key can only be defined on top-level columns. >>>>> Third, I think watermark can also follow primary key than only allow to >>>>> define on top-level columns. >>>>> >>>>> I have to admit that in FLIP-66, watermark can define on nested fields. >>>>> However, during implementation, I found that it's too complicated to do >>>>> that. We have refactor time-based physical nodes, >>>>> we have to use code generation to access event-time, we have to >>>>> refactor >>>>> FlinkTypeFactory to support a complex nested rowtime. >>>>> There is not much value of this feature, but introduce a lot of >>>>> complexity >>>>> in code base. >>>>> So I think we can force watermark define on top-level columns. If user >>>>> want >>>>> to define on nested columns, >>>>> he/she can use computed column to be a top-level column. >>>>> >>>>> Best, >>>>> Jark >>>>> >>>>> >>>>> On Thu, 30 Apr 2020 at 17:55, Fabian Hueske <fhue...@gmail.com> wrote: >>>>> >>>>> > Hi Godfrey, >>>>> > >>>>> > The formatting of your example seems to be broken. >>>>> > Could you send them again please? >>>>> > >>>>> > Regarding your points >>>>> > > because watermark express can be a sub-column, just like `f1.q2` >>>>> in above >>>>> > example I give. >>>>> > >>>>> > I would put the watermark information in the row of the top-level >>>>> field and >>>>> > indicate to which nested field the watermark refers. >>>>> > Don't we have to solve the same issue for primary keys that are >>>>> defined on >>>>> > a nested field? >>>>> > >>>>> > > A boolean flag can't represent such info. and I do know whether we >>>>> will >>>>> > support complex watermark expression involving multiple columns in >>>>> the >>>>> > future. such as: "WATERMARK FOR ts as ts + f1 + interval '1' second" >>>>> > >>>>> > You are right, a simple binary flag is definitely not sufficient to >>>>> display >>>>> > the watermark information. >>>>> > I would put the expression string into the field, i.e., "ts + f1 + >>>>> interval >>>>> > '1' second" >>>>> > >>>>> > >>>>> > For me the most important point of why to not show the watermark as >>>>> a row >>>>> > in the table is that it is not field that can be queried but meta >>>>> > information on an existing field. >>>>> > For the user it is important to know that a certain field has a >>>>> watermark. >>>>> > Otherwise, certain queries cannot be correctly specified. >>>>> > Also there might be support for multiple watermarks that are defined >>>>> of >>>>> > different fields at some point. Would those be printed in multiple >>>>> rows? >>>>> > >>>>> > Best, >>>>> > Fabian >>>>> > >>>>> > >>>>> > Am Do., 30. Apr. 2020 um 11:25 Uhr schrieb godfrey he < >>>>> godfre...@gmail.com >>>>> > >: >>>>> > >>>>> > > Hi Fabian, Aljoscha >>>>> > > >>>>> > > Thanks for the feedback. >>>>> > > >>>>> > > Agree with you that we can deal with primary key as you mentioned. >>>>> > > now, the type column has contained the nullability attribute, e.g. >>>>> BIGINT >>>>> > > NOT NULL. >>>>> > > (I'm also ok that we use two columns to represent type just like >>>>> mysql) >>>>> > > >>>>> > > >Why I treat `watermark` as a special row ? >>>>> > > because watermark express can be a sub-column, just like `f1.q2` >>>>> in above >>>>> > > example I give. >>>>> > > A boolean flag can't represent such info. and I do know whether we >>>>> will >>>>> > > support complex >>>>> > > watermark expression involving multiple columns in the future. >>>>> such as: >>>>> > > "WATERMARK FOR ts as ts + f1 + interval '1' second" >>>>> > > >>>>> > > If we do not support complex watermark expression, we can add a >>>>> watermark >>>>> > > column. >>>>> > > >>>>> > > for example: >>>>> > > >>>>> > > create table MyTable ( >>>>> > > >>>>> > > f0 BIGINT NOT NULL, >>>>> > > >>>>> > > f1 ROW<q1 STRING, q2 TIMESTAMP(3)>, >>>>> > > >>>>> > > f2 VARCHAR<256>, >>>>> > > >>>>> > > f3 AS f0 + 1, >>>>> > > >>>>> > > PRIMARY KEY (f0), >>>>> > > >>>>> > > UNIQUE (f3, f2), >>>>> > > >>>>> > > WATERMARK f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND) >>>>> > > >>>>> > > ) with (...) >>>>> > > >>>>> > > >>>>> > > name >>>>> > > >>>>> > > type >>>>> > > >>>>> > > key >>>>> > > >>>>> > > compute column >>>>> > > >>>>> > > watermark >>>>> > > >>>>> > > f0 >>>>> > > >>>>> > > BIGINT NOT NULL >>>>> > > >>>>> > > PRI >>>>> > > >>>>> > > (NULL) >>>>> > > >>>>> > > f1 >>>>> > > >>>>> > > ROW<`q1` STRING, `q2` TIMESTAMP(3)> >>>>> > > >>>>> > > UNQ >>>>> > > >>>>> > > (NULL) >>>>> > > >>>>> > > f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND) >>>>> > > >>>>> > > f2 >>>>> > > >>>>> > > VARCHAR<256> >>>>> > > >>>>> > > (NULL) >>>>> > > >>>>> > > NULL >>>>> > > >>>>> > > f3 >>>>> > > >>>>> > > BIGINT NOT NULL >>>>> > > >>>>> > > UNQ >>>>> > > >>>>> > > f0 + 1 >>>>> > > >>>>> > > >>>>> > > or we add a column to represent nullability. >>>>> > > >>>>> > > name >>>>> > > >>>>> > > type >>>>> > > >>>>> > > null >>>>> > > >>>>> > > key >>>>> > > >>>>> > > compute column >>>>> > > >>>>> > > watermark >>>>> > > >>>>> > > f0 >>>>> > > >>>>> > > BIGINT >>>>> > > >>>>> > > false >>>>> > > >>>>> > > PRI >>>>> > > >>>>> > > (NULL) >>>>> > > >>>>> > > f1 >>>>> > > >>>>> > > ROW<`q1` STRING, `q2` TIMESTAMP(3)> >>>>> > > >>>>> > > true >>>>> > > >>>>> > > UNQ >>>>> > > >>>>> > > (NULL) >>>>> > > >>>>> > > f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND) >>>>> > > >>>>> > > f2 >>>>> > > >>>>> > > VARCHAR<256> >>>>> > > >>>>> > > true >>>>> > > >>>>> > > (NULL) >>>>> > > >>>>> > > NULL >>>>> > > >>>>> > > f3 >>>>> > > >>>>> > > BIGINT >>>>> > > >>>>> > > false >>>>> > > >>>>> > > UNQ >>>>> > > >>>>> > > f0 + 1 >>>>> > > >>>>> > > >>>>> > > Personally, I like the second one. (we need do some changes on >>>>> > LogicalType >>>>> > > to get type name without nullability) >>>>> > > >>>>> > > >>>>> > > Best, >>>>> > > Godfrey >>>>> > > >>>>> > > >>>>> > > Aljoscha Krettek <aljos...@apache.org> 于2020年4月29日周三 下午5:47写道: >>>>> > > >>>>> > > > +1 I like the general idea of printing the results as a table. >>>>> > > > >>>>> > > > On the specifics I don't know enough but Fabians suggestions >>>>> seems to >>>>> > > > make sense to me. >>>>> > > > >>>>> > > > Aljoscha >>>>> > > > >>>>> > > > On 29.04.20 10:56, Fabian Hueske wrote: >>>>> > > > > Hi Godfrey, >>>>> > > > > >>>>> > > > > Thanks for starting this discussion! >>>>> > > > > >>>>> > > > > In my mind, WATERMARK is a property (or constraint) of a >>>>> field, just >>>>> > > like >>>>> > > > > PRIMARY KEY. >>>>> > > > > Take this example from MySQL: >>>>> > > > > >>>>> > > > > mysql> CREATE TABLE people (id INT NOT NULL, name VARCHAR(128) >>>>> NOT >>>>> > > NULL, >>>>> > > > > age INT, PRIMARY KEY (id)); >>>>> > > > > Query OK, 0 rows affected (0.06 sec) >>>>> > > > > >>>>> > > > > mysql> describe people; >>>>> > > > > +-------+--------------+------+-----+---------+-------+ >>>>> > > > > | Field | Type | Null | Key | Default | Extra | >>>>> > > > > +-------+--------------+------+-----+---------+-------+ >>>>> > > > > | id | int | NO | PRI | NULL | | >>>>> > > > > | name | varchar(128) | NO | | NULL | | >>>>> > > > > | age | int | YES | | NULL | | >>>>> > > > > +-------+--------------+------+-----+---------+-------+ >>>>> > > > > 3 rows in set (0.01 sec) >>>>> > > > > >>>>> > > > > Here, PRIMARY KEY is marked in the Key column of the id field. >>>>> > > > > We could do the same for watermarks by adding a Watermark >>>>> column. >>>>> > > > > >>>>> > > > > Best, Fabian >>>>> > > > > >>>>> > > > > >>>>> > > > > Am Mi., 29. Apr. 2020 um 10:43 Uhr schrieb godfrey he < >>>>> > > > godfre...@gmail.com>: >>>>> > > > > >>>>> > > > >> Hi everyone, >>>>> > > > >> >>>>> > > > >> I would like to bring up a discussion about the result type of >>>>> > > describe >>>>> > > > >> statement, >>>>> > > > >> which is introduced in FLIP-84[1]. >>>>> > > > >> In previous version, we define the result type of `describe` >>>>> > statement >>>>> > > > is a >>>>> > > > >> single column as following >>>>> > > > >> >>>>> > > > >> Statement >>>>> > > > >> >>>>> > > > >> Result Schema >>>>> > > > >> >>>>> > > > >> Result Value >>>>> > > > >> >>>>> > > > >> Result Kind >>>>> > > > >> >>>>> > > > >> Examples >>>>> > > > >> >>>>> > > > >> DESCRIBE xx >>>>> > > > >> >>>>> > > > >> field name: result >>>>> > > > >> >>>>> > > > >> field type: VARCHAR(n) >>>>> > > > >> >>>>> > > > >> (n is the max length of values) >>>>> > > > >> >>>>> > > > >> describe the detail of an object >>>>> > > > >> >>>>> > > > >> (single row) >>>>> > > > >> >>>>> > > > >> SUCCESS_WITH_CONTENT >>>>> > > > >> >>>>> > > > >> DESCRIBE table_name >>>>> > > > >> >>>>> > > > >> for "describe table_name", the result value is the `toString` >>>>> value >>>>> > of >>>>> > > > >> `TableSchema`, which is an unstructured data. >>>>> > > > >> It's hard to for user to use this info. >>>>> > > > >> >>>>> > > > >> for example: >>>>> > > > >> >>>>> > > > >> TableSchema schema = TableSchema.builder() >>>>> > > > >> .field("f0", DataTypes.BIGINT()) >>>>> > > > >> .field("f1", DataTypes.ROW( >>>>> > > > >> DataTypes.FIELD("q1", DataTypes.STRING()), >>>>> > > > >> DataTypes.FIELD("q2", DataTypes.TIMESTAMP(3)))) >>>>> > > > >> .field("f2", DataTypes.STRING()) >>>>> > > > >> .field("f3", DataTypes.BIGINT(), "f0 + 1") >>>>> > > > >> .watermark("f1.q2", WATERMARK_EXPRESSION, >>>>> WATERMARK_DATATYPE) >>>>> > > > >> .build(); >>>>> > > > >> >>>>> > > > >> its `toString` value is: >>>>> > > > >> root >>>>> > > > >> |-- f0: BIGINT >>>>> > > > >> |-- f1: ROW<`q1` STRING, `q2` TIMESTAMP(3)> >>>>> > > > >> |-- f2: STRING >>>>> > > > >> |-- f3: BIGINT AS f0 + 1 >>>>> > > > >> |-- WATERMARK FOR f1.q2 AS now() >>>>> > > > >> >>>>> > > > >> For hive, MySQL, etc., the describe result is table form >>>>> including >>>>> > > field >>>>> > > > >> names and field types. >>>>> > > > >> which is more familiar with users. >>>>> > > > >> TableSchema[2] has watermark expression and compute column, we >>>>> > should >>>>> > > > also >>>>> > > > >> put them into the table: >>>>> > > > >> for compute column, it's a column level, we add a new column >>>>> named >>>>> > > > `expr`. >>>>> > > > >> for watermark expression, it's a table level, we add a >>>>> special row >>>>> > > > named >>>>> > > > >> `WATERMARK` to represent it. >>>>> > > > >> >>>>> > > > >> The result will look like about above example: >>>>> > > > >> >>>>> > > > >> name >>>>> > > > >> >>>>> > > > >> type >>>>> > > > >> >>>>> > > > >> expr >>>>> > > > >> >>>>> > > > >> f0 >>>>> > > > >> >>>>> > > > >> BIGINT >>>>> > > > >> >>>>> > > > >> (NULL) >>>>> > > > >> >>>>> > > > >> f1 >>>>> > > > >> >>>>> > > > >> ROW<`q1` STRING, `q2` TIMESTAMP(3)> >>>>> > > > >> >>>>> > > > >> (NULL) >>>>> > > > >> >>>>> > > > >> f2 >>>>> > > > >> >>>>> > > > >> STRING >>>>> > > > >> >>>>> > > > >> NULL >>>>> > > > >> >>>>> > > > >> f3 >>>>> > > > >> >>>>> > > > >> BIGINT >>>>> > > > >> >>>>> > > > >> f0 + 1 >>>>> > > > >> >>>>> > > > >> WATERMARK >>>>> > > > >> >>>>> > > > >> (NULL) >>>>> > > > >> >>>>> > > > >> f1.q2 AS now() >>>>> > > > >> >>>>> > > > >> now there is a pr FLINK-17112 [3] to implement DESCRIBE >>>>> statement. >>>>> > > > >> >>>>> > > > >> What do you think about this update? >>>>> > > > >> Any feedback are welcome~ >>>>> > > > >> >>>>> > > > >> Best, >>>>> > > > >> Godfrey >>>>> > > > >> >>>>> > > > >> [1] >>>>> > > > >> >>>>> > > > >>>>> > > >>>>> > >>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=134745878 >>>>> > > > >> [2] >>>>> > > > >> >>>>> > > > >> >>>>> > > > >>>>> > > >>>>> > >>>>> https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/api/TableSchema.java >>>>> > > > >> [3] https://github.com/apache/flink/pull/11892 >>>>> > > > >> >>>>> > > > >> >>>>> > > > >> godfrey he <godfre...@gmail.com> 于2020年4月6日周一 下午10:38写道: >>>>> > > > >> >>>>> > > > >>> Hi Timo, >>>>> > > > >>> >>>>> > > > >>> Sorry for the late reply, and thanks for your correction. >>>>> > > > >>> I missed DQL for job submission scenario. >>>>> > > > >>> I'll fix the document right away. >>>>> > > > >>> >>>>> > > > >>> Best, >>>>> > > > >>> Godfrey >>>>> > > > >>> >>>>> > > > >>> Timo Walther <twal...@apache.org> 于2020年4月3日周五 下午9:53写道: >>>>> > > > >>> >>>>> > > > >>>> Hi Godfrey, >>>>> > > > >>>> >>>>> > > > >>>> I'm sorry to jump in again but I still need to clarify some >>>>> things >>>>> > > > >>>> around TableResult. >>>>> > > > >>>> >>>>> > > > >>>> The FLIP says: >>>>> > > > >>>> "For DML, this method returns TableResult until the job is >>>>> > > submitted. >>>>> > > > >>>> For other statements, TableResult is returned until the >>>>> execution >>>>> > is >>>>> > > > >>>> finished." >>>>> > > > >>>> >>>>> > > > >>>> I thought we agreed on making every execution async? This >>>>> also >>>>> > means >>>>> > > > >>>> returning a TableResult for DQLs even though the execution >>>>> is not >>>>> > > done >>>>> > > > >>>> yet. People need access to the JobClient also for batch >>>>> jobs in >>>>> > > order >>>>> > > > to >>>>> > > > >>>> cancel long lasting queries. If people want to wait for the >>>>> > > completion >>>>> > > > >>>> they can hook into JobClient or collect(). >>>>> > > > >>>> >>>>> > > > >>>> Can we rephrase this part to: >>>>> > > > >>>> >>>>> > > > >>>> The FLIP says: >>>>> > > > >>>> "For DML and DQL, this method returns TableResult once the >>>>> job has >>>>> > > > been >>>>> > > > >>>> submitted. For DDL and DCL statements, TableResult is >>>>> returned >>>>> > once >>>>> > > > the >>>>> > > > >>>> operation has finished." >>>>> > > > >>>> >>>>> > > > >>>> Regards, >>>>> > > > >>>> Timo >>>>> > > > >>>> >>>>> > > > >>>> >>>>> > > > >>>> On 02.04.20 05:27, godfrey he wrote: >>>>> > > > >>>>> Hi Aljoscha, Dawid, Timo, >>>>> > > > >>>>> >>>>> > > > >>>>> Thanks so much for the detailed explanation. >>>>> > > > >>>>> Agree with you that the multiline story is not completed >>>>> now, and >>>>> > > we >>>>> > > > >> can >>>>> > > > >>>>> keep discussion. >>>>> > > > >>>>> I will add current discussions and conclusions to the FLIP. >>>>> > > > >>>>> >>>>> > > > >>>>> Best, >>>>> > > > >>>>> Godfrey >>>>> > > > >>>>> >>>>> > > > >>>>> >>>>> > > > >>>>> >>>>> > > > >>>>> Timo Walther <twal...@apache.org> 于2020年4月1日周三 下午11:27写道: >>>>> > > > >>>>> >>>>> > > > >>>>>> Hi Godfrey, >>>>> > > > >>>>>> >>>>> > > > >>>>>> first of all, I agree with Dawid. The multiline story is >>>>> not >>>>> > > > >> completed >>>>> > > > >>>>>> by this FLIP. It just verifies the big picture. >>>>> > > > >>>>>> >>>>> > > > >>>>>> 1. "control the execution logic through the proposed >>>>> method if >>>>> > > they >>>>> > > > >>>> know >>>>> > > > >>>>>> what the statements are" >>>>> > > > >>>>>> >>>>> > > > >>>>>> This is a good point that also Fabian raised in the linked >>>>> > google >>>>> > > > >> doc. >>>>> > > > >>>> I >>>>> > > > >>>>>> could also imagine to return a more complicated POJO when >>>>> > calling >>>>> > > > >>>>>> `executeMultiSql()`. >>>>> > > > >>>>>> >>>>> > > > >>>>>> The POJO would include some `getSqlProperties()` such >>>>> that a >>>>> > > > platform >>>>> > > > >>>>>> gets insights into the query before executing. We could >>>>> also >>>>> > > trigger >>>>> > > > >>>> the >>>>> > > > >>>>>> execution more explicitly instead of hiding it behind an >>>>> > iterator. >>>>> > > > >>>>>> >>>>> > > > >>>>>> 2. "there are some special commands introduced in SQL >>>>> client" >>>>> > > > >>>>>> >>>>> > > > >>>>>> For platforms and SQL Client specific commands, we could >>>>> offer a >>>>> > > > hook >>>>> > > > >>>> to >>>>> > > > >>>>>> the parser or a fallback parser in case the regular table >>>>> > > > environment >>>>> > > > >>>>>> parser cannot deal with the statement. >>>>> > > > >>>>>> >>>>> > > > >>>>>> However, all of that is future work and can be discussed >>>>> in a >>>>> > > > >> separate >>>>> > > > >>>>>> FLIP. >>>>> > > > >>>>>> >>>>> > > > >>>>>> 3. +1 for the `Iterator` instead of `Iterable`. >>>>> > > > >>>>>> >>>>> > > > >>>>>> 4. "we should convert the checked exception to unchecked >>>>> > > exception" >>>>> > > > >>>>>> >>>>> > > > >>>>>> Yes, I meant using a runtime exception instead of a >>>>> checked >>>>> > > > >> exception. >>>>> > > > >>>>>> There was no consensus on putting the exception into the >>>>> > > > >> `TableResult`. >>>>> > > > >>>>>> >>>>> > > > >>>>>> Regards, >>>>> > > > >>>>>> Timo >>>>> > > > >>>>>> >>>>> > > > >>>>>> On 01.04.20 15:35, Dawid Wysakowicz wrote: >>>>> > > > >>>>>>> When considering the multi-line support I think it is >>>>> helpful >>>>> > to >>>>> > > > >> start >>>>> > > > >>>>>>> with a use case in mind. In my opinion consumers of this >>>>> method >>>>> > > > will >>>>> > > > >>>> be: >>>>> > > > >>>>>>> >>>>> > > > >>>>>>> 1. sql-client >>>>> > > > >>>>>>> 2. third-part sql based platforms >>>>> > > > >>>>>>> >>>>> > > > >>>>>>> @Godfrey As for the quit/source/... commands. I think >>>>> those >>>>> > > belong >>>>> > > > >> to >>>>> > > > >>>>>>> the responsibility of aforementioned. I think they >>>>> should not >>>>> > be >>>>> > > > >>>>>>> understandable by the TableEnvironment. What would quit >>>>> on a >>>>> > > > >>>>>>> TableEnvironment do? Moreover I think such commands >>>>> should be >>>>> > > > >> prefixed >>>>> > > > >>>>>>> appropriately. I think it's a common practice to e.g. >>>>> prefix >>>>> > > those >>>>> > > > >>>> with >>>>> > > > >>>>>>> ! or : to say they are meta commands of the tool rather >>>>> than a >>>>> > > > >> query. >>>>> > > > >>>>>>> >>>>> > > > >>>>>>> I also don't necessarily understand why platform users >>>>> need to >>>>> > > know >>>>> > > > >>>> the >>>>> > > > >>>>>>> kind of the query to use the proposed method. They >>>>> should get >>>>> > the >>>>> > > > >> type >>>>> > > > >>>>>>> from the TableResult#ResultKind. If the ResultKind is >>>>> SUCCESS, >>>>> > it >>>>> > > > >> was >>>>> > > > >>>> a >>>>> > > > >>>>>>> DCL/DDL. If SUCCESS_WITH_CONTENT it was a DML/DQL. If >>>>> that's >>>>> > not >>>>> > > > >>>> enough >>>>> > > > >>>>>>> we can enrich the TableResult with more explicit kind of >>>>> query, >>>>> > > but >>>>> > > > >> so >>>>> > > > >>>>>>> far I don't see such a need. >>>>> > > > >>>>>>> >>>>> > > > >>>>>>> @Kurt In those cases I would assume the developers want >>>>> to >>>>> > > present >>>>> > > > >>>>>>> results of the queries anyway. Moreover I think it is >>>>> safe to >>>>> > > > assume >>>>> > > > >>>>>>> they can adhere to such a contract that the results must >>>>> be >>>>> > > > >> iterated. >>>>> > > > >>>>>>> >>>>> > > > >>>>>>> For direct users of TableEnvironment/Table API this >>>>> method does >>>>> > > not >>>>> > > > >>>> make >>>>> > > > >>>>>>> much sense anyway, in my opinion. I think we can rather >>>>> safely >>>>> > > > >> assume >>>>> > > > >>>> in >>>>> > > > >>>>>>> this scenario they do not want to submit multiple >>>>> queries at a >>>>> > > > >> single >>>>> > > > >>>>>> time. >>>>> > > > >>>>>>> >>>>> > > > >>>>>>> Best, >>>>> > > > >>>>>>> >>>>> > > > >>>>>>> Dawid >>>>> > > > >>>>>>> >>>>> > > > >>>>>>> >>>>> > > > >>>>>>> On 01/04/2020 15:07, Kurt Young wrote: >>>>> > > > >>>>>>>> One comment to `executeMultilineSql`, I'm afraid >>>>> sometimes >>>>> > user >>>>> > > > >> might >>>>> > > > >>>>>>>> forget to >>>>> > > > >>>>>>>> iterate the returned iterators, e.g. user submits a >>>>> bunch of >>>>> > > DDLs >>>>> > > > >> and >>>>> > > > >>>>>>>> expect the >>>>> > > > >>>>>>>> framework will execute them one by one. But it didn't. >>>>> > > > >>>>>>>> >>>>> > > > >>>>>>>> Best, >>>>> > > > >>>>>>>> Kurt >>>>> > > > >>>>>>>> >>>>> > > > >>>>>>>> >>>>> > > > >>>>>>>> On Wed, Apr 1, 2020 at 5:10 PM Aljoscha Krettek< >>>>> > > > >> aljos...@apache.org> >>>>> > > > >>>>>> wrote: >>>>> > > > >>>>>>>> >>>>> > > > >>>>>>>>> Agreed to what Dawid and Timo said. >>>>> > > > >>>>>>>>> >>>>> > > > >>>>>>>>> To answer your question about multi line SQL: no, we >>>>> don't >>>>> > > think >>>>> > > > >> we >>>>> > > > >>>>>> need >>>>> > > > >>>>>>>>> this in Flink 1.11, we only wanted to make sure that >>>>> the >>>>> > > > >> interfaces >>>>> > > > >>>>>> that >>>>> > > > >>>>>>>>> we now put in place will potentially allow this in the >>>>> > future. >>>>> > > > >>>>>>>>> >>>>> > > > >>>>>>>>> Best, >>>>> > > > >>>>>>>>> Aljoscha >>>>> > > > >>>>>>>>> >>>>> > > > >>>>>>>>> On 01.04.20 09:31, godfrey he wrote: >>>>> > > > >>>>>>>>>> Hi, Timo & Dawid, >>>>> > > > >>>>>>>>>> >>>>> > > > >>>>>>>>>> Thanks so much for the effort of `multiline statements >>>>> > > > >> supporting`, >>>>> > > > >>>>>>>>>> I have a few questions about this method: >>>>> > > > >>>>>>>>>> >>>>> > > > >>>>>>>>>> 1. users can well control the execution logic through >>>>> the >>>>> > > > >> proposed >>>>> > > > >>>>>> method >>>>> > > > >>>>>>>>>> if they know what the statements are (a >>>>> statement is a >>>>> > > > DDL, a >>>>> > > > >>>> DML >>>>> > > > >>>>>> or >>>>> > > > >>>>>>>>>> others). >>>>> > > > >>>>>>>>>> but if a statement is from a file, that means users >>>>> do not >>>>> > > know >>>>> > > > >>>> what >>>>> > > > >>>>>> the >>>>> > > > >>>>>>>>>> statements are, >>>>> > > > >>>>>>>>>> the execution behavior is unclear. >>>>> > > > >>>>>>>>>> As a platform user, I think this method is hard to >>>>> use, >>>>> > unless >>>>> > > > >> the >>>>> > > > >>>>>>>>> platform >>>>> > > > >>>>>>>>>> defines >>>>> > > > >>>>>>>>>> a set of rule about the statements order, such as: no >>>>> select >>>>> > > in >>>>> > > > >> the >>>>> > > > >>>>>>>>> middle, >>>>> > > > >>>>>>>>>> dml must be at tail of sql file (which may be the >>>>> most case >>>>> > in >>>>> > > > >>>> product >>>>> > > > >>>>>>>>>> env). >>>>> > > > >>>>>>>>>> Otherwise the platform must parse the sql first, then >>>>> know >>>>> > > what >>>>> > > > >> the >>>>> > > > >>>>>>>>>> statements are. >>>>> > > > >>>>>>>>>> If do like that, the platform can handle all cases >>>>> through >>>>> > > > >>>>>> `executeSql` >>>>> > > > >>>>>>>>> and >>>>> > > > >>>>>>>>>> `StatementSet`. >>>>> > > > >>>>>>>>>> >>>>> > > > >>>>>>>>>> 2. SQL client can't also use `executeMultilineSql` to >>>>> > supports >>>>> > > > >>>>>> multiline >>>>> > > > >>>>>>>>>> statements, >>>>> > > > >>>>>>>>>> because there are some special commands >>>>> introduced in >>>>> > SQL >>>>> > > > >>>> client, >>>>> > > > >>>>>>>>>> such as `quit`, `source`, `load jar` (not exist now, >>>>> but >>>>> > maybe >>>>> > > > we >>>>> > > > >>>> need >>>>> > > > >>>>>>>>> this >>>>> > > > >>>>>>>>>> command >>>>> > > > >>>>>>>>>> to support dynamic table source and udf). >>>>> > > > >>>>>>>>>> Does TableEnvironment also supports those commands? >>>>> > > > >>>>>>>>>> >>>>> > > > >>>>>>>>>> 3. btw, we must have this feature in release-1.11? I >>>>> find >>>>> > > there >>>>> > > > >> are >>>>> > > > >>>>>> few >>>>> > > > >>>>>>>>>> user cases >>>>> > > > >>>>>>>>>> in the feedback document which behavior is >>>>> unclear now. >>>>> > > > >>>>>>>>>> >>>>> > > > >>>>>>>>>> regarding to "change the return value from >>>>> `Iterable<Row` to >>>>> > > > >>>>>>>>>> `Iterator<Row`", >>>>> > > > >>>>>>>>>> I couldn't agree more with this change. Just as Dawid >>>>> > > mentioned >>>>> > > > >>>>>>>>>> "The contract of the Iterable#iterator is that it >>>>> returns a >>>>> > > new >>>>> > > > >>>>>> iterator >>>>> > > > >>>>>>>>>> each time, >>>>> > > > >>>>>>>>>> which effectively means we can iterate the >>>>> results >>>>> > > multiple >>>>> > > > >>>>>> times.", >>>>> > > > >>>>>>>>>> we does not provide iterate the results multiple >>>>> times. >>>>> > > > >>>>>>>>>> If we want do that, the client must buffer all >>>>> results. but >>>>> > > it's >>>>> > > > >>>>>>>>> impossible >>>>> > > > >>>>>>>>>> for streaming job. >>>>> > > > >>>>>>>>>> >>>>> > > > >>>>>>>>>> Best, >>>>> > > > >>>>>>>>>> Godfrey >>>>> > > > >>>>>>>>>> >>>>> > > > >>>>>>>>>> Dawid Wysakowicz<dwysakow...@apache.org> >>>>> 于2020年4月1日周三 >>>>> > > > 上午3:14写道: >>>>> > > > >>>>>>>>>> >>>>> > > > >>>>>>>>>>> Thank you Timo for the great summary! It covers >>>>> (almost) >>>>> > all >>>>> > > > the >>>>> > > > >>>>>> topics. >>>>> > > > >>>>>>>>>>> Even though in the end we are not suggesting much >>>>> changes >>>>> > to >>>>> > > > the >>>>> > > > >>>>>> current >>>>> > > > >>>>>>>>>>> state of FLIP I think it is important to lay out all >>>>> > possible >>>>> > > > >> use >>>>> > > > >>>>>> cases >>>>> > > > >>>>>>>>>>> so that we do not change the execution model every >>>>> release. >>>>> > > > >>>>>>>>>>> >>>>> > > > >>>>>>>>>>> There is one additional thing we discussed. Could we >>>>> change >>>>> > > the >>>>> > > > >>>>>> result >>>>> > > > >>>>>>>>>>> type of TableResult#collect to Iterator<Row>? Even >>>>> though >>>>> > > those >>>>> > > > >>>>>>>>>>> interfaces do not differ much. I think Iterator >>>>> better >>>>> > > > describes >>>>> > > > >>>> that >>>>> > > > >>>>>>>>>>> the results might not be materialized on the client >>>>> side, >>>>> > but >>>>> > > > >> can >>>>> > > > >>>> be >>>>> > > > >>>>>>>>>>> retrieved on a per record basis. The contract of the >>>>> > > > >>>>>> Iterable#iterator >>>>> > > > >>>>>>>>>>> is that it returns a new iterator each time, which >>>>> > > effectively >>>>> > > > >>>> means >>>>> > > > >>>>>> we >>>>> > > > >>>>>>>>>>> can iterate the results multiple times. Iterating the >>>>> > results >>>>> > > > is >>>>> > > > >>>> not >>>>> > > > >>>>>>>>>>> possible when we don't retrieve all the results from >>>>> the >>>>> > > > cluster >>>>> > > > >>>> at >>>>> > > > >>>>>>>>> once. >>>>> > > > >>>>>>>>>>> I think we should also use Iterator for >>>>> > > > >>>>>>>>>>> TableEnvironment#executeMultilineSql(String >>>>> statements): >>>>> > > > >>>>>>>>>>> Iterator<TableResult>. >>>>> > > > >>>>>>>>>>> >>>>> > > > >>>>>>>>>>> Best, >>>>> > > > >>>>>>>>>>> >>>>> > > > >>>>>>>>>>> Dawid >>>>> > > > >>>>>>>>>>> >>>>> > > > >>>>>>>>>>> On 31/03/2020 19:27, Timo Walther wrote: >>>>> > > > >>>>>>>>>>>> Hi Godfrey, >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> Aljoscha, Dawid, Klou, and I had another discussion >>>>> around >>>>> > > > >>>> FLIP-84. >>>>> > > > >>>>>> In >>>>> > > > >>>>>>>>>>>> particular, we discussed how the current status of >>>>> the >>>>> > FLIP >>>>> > > > and >>>>> > > > >>>> the >>>>> > > > >>>>>>>>>>>> future requirements around multiline statements, >>>>> > async/sync, >>>>> > > > >>>>>> collect() >>>>> > > > >>>>>>>>>>>> fit together. >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> We also updated the FLIP-84 Feedback Summary >>>>> document [1] >>>>> > > with >>>>> > > > >>>> some >>>>> > > > >>>>>>>>>>>> use cases. >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> We believe that we found a good solution that also >>>>> fits to >>>>> > > > what >>>>> > > > >>>> is >>>>> > > > >>>>>> in >>>>> > > > >>>>>>>>>>>> the current FLIP. So no bigger changes necessary, >>>>> which is >>>>> > > > >> great! >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> Our findings were: >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> 1. Async vs sync submission of Flink jobs: >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> Having a blocking `execute()` in DataStream API was >>>>> > rather a >>>>> > > > >>>>>> mistake. >>>>> > > > >>>>>>>>>>>> Instead all submissions should be async because this >>>>> > allows >>>>> > > > >>>>>> supporting >>>>> > > > >>>>>>>>>>>> both modes if necessary. Thus, submitting all >>>>> queries >>>>> > async >>>>> > > > >>>> sounds >>>>> > > > >>>>>>>>>>>> good to us. If users want to run a job sync, they >>>>> can use >>>>> > > the >>>>> > > > >>>>>>>>>>>> JobClient and wait for completion (or collect() in >>>>> case of >>>>> > > > >> batch >>>>> > > > >>>>>> jobs). >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> 2. Multi-statement execution: >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> For the multi-statement execution, we don't see a >>>>> > > > >> contradication >>>>> > > > >>>>>> with >>>>> > > > >>>>>>>>>>>> the async execution behavior. We imagine a method >>>>> like: >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> TableEnvironment#executeMultilineSql(String >>>>> statements): >>>>> > > > >>>>>>>>>>>> Iterable<TableResult> >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> Where the `Iterator#next()` method would trigger >>>>> the next >>>>> > > > >>>> statement >>>>> > > > >>>>>>>>>>>> submission. This allows a caller to decide >>>>> synchronously >>>>> > > when >>>>> > > > >> to >>>>> > > > >>>>>>>>>>>> submit statements async to the cluster. Thus, a >>>>> service >>>>> > such >>>>> > > > as >>>>> > > > >>>> the >>>>> > > > >>>>>>>>>>>> SQL Client can handle the result of each statement >>>>> > > > individually >>>>> > > > >>>> and >>>>> > > > >>>>>>>>>>>> process statement by statement sequentially. >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> 3. The role of TableResult and result retrieval in >>>>> general >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> `TableResult` is similar to `JobClient`. Instead of >>>>> > > returning >>>>> > > > a >>>>> > > > >>>>>>>>>>>> `CompletableFuture` of something, it is a concrete >>>>> util >>>>> > > class >>>>> > > > >>>> where >>>>> > > > >>>>>>>>>>>> some methods have the behavior of completable >>>>> future (e.g. >>>>> > > > >>>>>> collect(), >>>>> > > > >>>>>>>>>>>> print()) and some are already completed >>>>> (getTableSchema(), >>>>> > > > >>>>>>>>>>>> getResultKind()). >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> `StatementSet#execute()` returns a single >>>>> `TableResult` >>>>> > > > because >>>>> > > > >>>> the >>>>> > > > >>>>>>>>>>>> order is undefined in a set and all statements have >>>>> the >>>>> > same >>>>> > > > >>>> schema. >>>>> > > > >>>>>>>>>>>> Its `collect()` will return a row for each executed >>>>> > `INSERT >>>>> > > > >>>> INTO` in >>>>> > > > >>>>>>>>>>>> the order of statement definition. >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> For simple `SELECT * FROM ...`, the query execution >>>>> might >>>>> > > > block >>>>> > > > >>>>>> until >>>>> > > > >>>>>>>>>>>> `collect()` is called to pull buffered rows from >>>>> the job >>>>> > > (from >>>>> > > > >>>>>>>>>>>> socket/REST API what ever we will use in the >>>>> future). We >>>>> > can >>>>> > > > >> say >>>>> > > > >>>>>> that >>>>> > > > >>>>>>>>>>>> a statement finished successfully, when the >>>>> > > > >>>>>> `collect#Iterator#hasNext` >>>>> > > > >>>>>>>>>>>> has returned false. >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> I hope this summarizes our discussion >>>>> > @Dawid/Aljoscha/Klou? >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> It would be great if we can add these findings to >>>>> the FLIP >>>>> > > > >>>> before we >>>>> > > > >>>>>>>>>>>> start voting. >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> One minor thing: some `execute()` methods still >>>>> throw a >>>>> > > > checked >>>>> > > > >>>>>>>>>>>> exception; can we remove that from the FLIP? Also >>>>> the >>>>> > above >>>>> > > > >>>>>> mentioned >>>>> > > > >>>>>>>>>>>> `Iterator#next()` would trigger an execution without >>>>> > > throwing >>>>> > > > a >>>>> > > > >>>>>>>>>>>> checked exception. >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> Thanks, >>>>> > > > >>>>>>>>>>>> Timo >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>> [1] >>>>> > > > >>>>>>>>>>>> >>>>> > > > >>>>>>>>> >>>>> > > > >>>>>> >>>>> > > > >>>> >>>>> > > > >> >>>>> > > > >>>>> > > >>>>> > >>>>> https://docs.google.com/document/d/1ueLjQWRPdLTFB_TReAyhseAX-1N3j4WYWD0F02Uau0E/edit# >>>>> > > > >>>>>>>>>>>> On 31.03.20 06:28, godfrey he wrote: >>>>> > > > >>>>>>>>>>>>> Hi, Timo & Jark >>>>> > > > >>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>> Thanks for your explanation. >>>>> > > > >>>>>>>>>>>>> Agree with you that async execution should always >>>>> be >>>>> > async, >>>>> > > > >>>>>>>>>>>>> and sync execution scenario can be covered by >>>>> async >>>>> > > > >> execution. >>>>> > > > >>>>>>>>>>>>> It helps provide an unified entry point for batch >>>>> and >>>>> > > > >> streaming. >>>>> > > > >>>>>>>>>>>>> I think we can also use sync execution for some >>>>> testing. >>>>> > > > >>>>>>>>>>>>> So, I agree with you that we provide `executeSql` >>>>> method >>>>> > > and >>>>> > > > >>>> it's >>>>> > > > >>>>>>>>> async >>>>> > > > >>>>>>>>>>>>> method. >>>>> > > > >>>>>>>>>>>>> If we want sync method in the future, we can add >>>>> method >>>>> > > named >>>>> > > > >>>>>>>>>>>>> `executeSqlSync`. >>>>> > > > >>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>> I think we've reached an agreement. I will update >>>>> the >>>>> > > > >> document, >>>>> > > > >>>> and >>>>> > > > >>>>>>>>>>>>> start >>>>> > > > >>>>>>>>>>>>> voting process. >>>>> > > > >>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>> Best, >>>>> > > > >>>>>>>>>>>>> Godfrey >>>>> > > > >>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>> Jark Wu<imj...@gmail.com> 于2020年3月31日周二 >>>>> 上午12:46写道: >>>>> > > > >>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>> Hi, >>>>> > > > >>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>> I didn't follow the full discussion. >>>>> > > > >>>>>>>>>>>>>> But I share the same concern with Timo that >>>>> streaming >>>>> > > > queries >>>>> > > > >>>>>> should >>>>> > > > >>>>>>>>>>>>>> always >>>>> > > > >>>>>>>>>>>>>> be async. >>>>> > > > >>>>>>>>>>>>>> Otherwise, I can image it will cause a lot of >>>>> confusion >>>>> > > and >>>>> > > > >>>>>> problems >>>>> > > > >>>>>>>>> if >>>>> > > > >>>>>>>>>>>>>> users don't deeply keep the "sync" in mind (e.g. >>>>> client >>>>> > > > >> hangs). >>>>> > > > >>>>>>>>>>>>>> Besides, the streaming mode is still the majority >>>>> use >>>>> > > cases >>>>> > > > >> of >>>>> > > > >>>>>> Flink >>>>> > > > >>>>>>>>>>>>>> and >>>>> > > > >>>>>>>>>>>>>> Flink SQL. We should put the usability at a high >>>>> > priority. >>>>> > > > >>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>> Best, >>>>> > > > >>>>>>>>>>>>>> Jark >>>>> > > > >>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>> On Mon, 30 Mar 2020 at 23:27, Timo Walther< >>>>> > > > >> twal...@apache.org> >>>>> > > > >>>>>>>>> wrote: >>>>> > > > >>>>>>>>>>>>>>> Hi Godfrey, >>>>> > > > >>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>> maybe I wasn't expressing my biggest concern >>>>> enough in >>>>> > my >>>>> > > > >> last >>>>> > > > >>>>>> mail. >>>>> > > > >>>>>>>>>>>>>>> Even in a singleline and sync execution, I think >>>>> that >>>>> > > > >>>> streaming >>>>> > > > >>>>>>>>>>>>>>> queries >>>>> > > > >>>>>>>>>>>>>>> should not block the execution. Otherwise it is >>>>> not >>>>> > > > possible >>>>> > > > >>>> to >>>>> > > > >>>>>> call >>>>> > > > >>>>>>>>>>>>>>> collect() or print() on them afterwards. >>>>> > > > >>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>> "there are too many things need to discuss for >>>>> > > multiline": >>>>> > > > >>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>> True, I don't want to solve all of them right >>>>> now. But >>>>> > > what >>>>> > > > >> I >>>>> > > > >>>>>> know >>>>> > > > >>>>>>>>> is >>>>> > > > >>>>>>>>>>>>>>> that our newly introduced methods should fit >>>>> into a >>>>> > > > >> multiline >>>>> > > > >>>>>>>>>>>>>>> execution. >>>>> > > > >>>>>>>>>>>>>>> There is no big difference of calling >>>>> `executeSql(A), >>>>> > > > >>>>>>>>>>>>>>> executeSql(B)` and >>>>> > > > >>>>>>>>>>>>>>> processing a multiline file `A;\nB;`. >>>>> > > > >>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>> I think the example that you mentioned can >>>>> simply be >>>>> > > > >> undefined >>>>> > > > >>>>>> for >>>>> > > > >>>>>>>>>>>>>>> now. >>>>> > > > >>>>>>>>>>>>>>> Currently, no catalog is modifying data but just >>>>> > > metadata. >>>>> > > > >>>> This >>>>> > > > >>>>>> is a >>>>> > > > >>>>>>>>>>>>>>> separate discussion. >>>>> > > > >>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>> "result of the second statement is >>>>> indeterministic": >>>>> > > > >>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>> Sure this is indeterministic. But this is the >>>>> > > implementers >>>>> > > > >>>> fault >>>>> > > > >>>>>>>>>>>>>>> and we >>>>> > > > >>>>>>>>>>>>>>> cannot forbid such pipelines. >>>>> > > > >>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>> How about we always execute streaming queries >>>>> async? It >>>>> > > > >> would >>>>> > > > >>>>>>>>> unblock >>>>> > > > >>>>>>>>>>>>>>> executeSql() and multiline statements. >>>>> > > > >>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>> Having a `executeSqlAsync()` is useful for batch. >>>>> > > However, >>>>> > > > I >>>>> > > > >>>>>> don't >>>>> > > > >>>>>>>>>>>>>>> want >>>>> > > > >>>>>>>>>>>>>>> `sync/async` be the new batch/stream flag. The >>>>> > execution >>>>> > > > >>>> behavior >>>>> > > > >>>>>>>>>>>>>>> should >>>>> > > > >>>>>>>>>>>>>>> come from the query itself. >>>>> > > > >>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>> Regards, >>>>> > > > >>>>>>>>>>>>>>> Timo >>>>> > > > >>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>> On 30.03.20 11:12, godfrey he wrote: >>>>> > > > >>>>>>>>>>>>>>>> Hi Timo, >>>>> > > > >>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>> Agree with you that streaming queries is our top >>>>> > > priority, >>>>> > > > >>>>>>>>>>>>>>>> but I think there are too many things need to >>>>> discuss >>>>> > > for >>>>> > > > >>>>>> multiline >>>>> > > > >>>>>>>>>>>>>>>> statements: >>>>> > > > >>>>>>>>>>>>>>>> e.g. >>>>> > > > >>>>>>>>>>>>>>>> 1. what's the behaivor of DDL and DML mixing >>>>> for async >>>>> > > > >>>>>> execution: >>>>> > > > >>>>>>>>>>>>>>>> create table t1 xxx; >>>>> > > > >>>>>>>>>>>>>>>> create table t2 xxx; >>>>> > > > >>>>>>>>>>>>>>>> insert into t2 select * from t1 where xxx; >>>>> > > > >>>>>>>>>>>>>>>> drop table t1; // t1 may be a MySQL table, the >>>>> data >>>>> > will >>>>> > > > >>>> also be >>>>> > > > >>>>>>>>>>>>>> deleted. >>>>> > > > >>>>>>>>>>>>>>>> t1 is dropped when "insert" job is running. >>>>> > > > >>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>> 2. what's the behaivor of unified scenario for >>>>> async >>>>> > > > >>>> execution: >>>>> > > > >>>>>>>>>>>>>>>> (as you >>>>> > > > >>>>>>>>>>>>>>>> mentioned) >>>>> > > > >>>>>>>>>>>>>>>> INSERT INTO t1 SELECT * FROM s; >>>>> > > > >>>>>>>>>>>>>>>> INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT >>>>> STREAM; >>>>> > > > >>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>> The result of the second statement is >>>>> indeterministic, >>>>> > > > >>>> because >>>>> > > > >>>>>> the >>>>> > > > >>>>>>>>>>>>>> first >>>>> > > > >>>>>>>>>>>>>>>> statement maybe is running. >>>>> > > > >>>>>>>>>>>>>>>> I think we need to put a lot of effort to >>>>> define the >>>>> > > > >>>> behavior of >>>>> > > > >>>>>>>>>>>>>>> logically >>>>> > > > >>>>>>>>>>>>>>>> related queries. >>>>> > > > >>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>> In this FLIP, I suggest we only handle single >>>>> > statement, >>>>> > > > >> and >>>>> > > > >>>> we >>>>> > > > >>>>>>>>> also >>>>> > > > >>>>>>>>>>>>>>>> introduce an async execute method >>>>> > > > >>>>>>>>>>>>>>>> which is more important and more often used for >>>>> users. >>>>> > > > >>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>> Dor the sync methods (like >>>>> > `TableEnvironment.executeSql` >>>>> > > > >> and >>>>> > > > >>>>>>>>>>>>>>>> `StatementSet.execute`), >>>>> > > > >>>>>>>>>>>>>>>> the result will be returned until the job is >>>>> finished. >>>>> > > The >>>>> > > > >>>>>>>>> following >>>>> > > > >>>>>>>>>>>>>>>> methods will be introduced in this FLIP: >>>>> > > > >>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>> /** >>>>> > > > >>>>>>>>>>>>>>>> * Asynchronously execute the given >>>>> single >>>>> > > > statement >>>>> > > > >>>>>>>>>>>>>>>> */ >>>>> > > > >>>>>>>>>>>>>>>> TableEnvironment.executeSqlAsync(String >>>>> statement): >>>>> > > > >>>> TableResult >>>>> > > > >>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>> /** >>>>> > > > >>>>>>>>>>>>>>>> * Asynchronously execute the dml >>>>> statements as >>>>> > a >>>>> > > > >> batch >>>>> > > > >>>>>>>>>>>>>>>> */ >>>>> > > > >>>>>>>>>>>>>>>> StatementSet.executeAsync(): TableResult >>>>> > > > >>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>> public interface TableResult { >>>>> > > > >>>>>>>>>>>>>>>> /** >>>>> > > > >>>>>>>>>>>>>>>> * return JobClient for DQL and DML in >>>>> async >>>>> > > > mode, >>>>> > > > >>>> else >>>>> > > > >>>>>>>>> return >>>>> > > > >>>>>>>>>>>>>>>> Optional.empty >>>>> > > > >>>>>>>>>>>>>>>> */ >>>>> > > > >>>>>>>>>>>>>>>> Optional<JobClient> getJobClient(); >>>>> > > > >>>>>>>>>>>>>>>> } >>>>> > > > >>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>> what do you think? >>>>> > > > >>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>> Best, >>>>> > > > >>>>>>>>>>>>>>>> Godfrey >>>>> > > > >>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>> Timo Walther<twal...@apache.org> 于2020年3月26日周四 >>>>> > > 下午9:15写道: >>>>> > > > >>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> Hi Godfrey, >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> executing streaming queries must be our top >>>>> priority >>>>> > > > >> because >>>>> > > > >>>>>> this >>>>> > > > >>>>>>>>> is >>>>> > > > >>>>>>>>>>>>>>>>> what distinguishes Flink from competitors. If >>>>> we >>>>> > change >>>>> > > > >> the >>>>> > > > >>>>>>>>>>>>>>>>> execution >>>>> > > > >>>>>>>>>>>>>>>>> behavior, we should think about the other >>>>> cases as >>>>> > well >>>>> > > > to >>>>> > > > >>>> not >>>>> > > > >>>>>>>>> break >>>>> > > > >>>>>>>>>>>>>> the >>>>> > > > >>>>>>>>>>>>>>>>> API a third time. >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> I fear that just having an async execute >>>>> method will >>>>> > > not >>>>> > > > >> be >>>>> > > > >>>>>> enough >>>>> > > > >>>>>>>>>>>>>>>>> because users should be able to mix streaming >>>>> and >>>>> > batch >>>>> > > > >>>> queries >>>>> > > > >>>>>>>>> in a >>>>> > > > >>>>>>>>>>>>>>>>> unified scenario. >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> If I remember it correctly, we had some >>>>> discussions >>>>> > in >>>>> > > > the >>>>> > > > >>>> past >>>>> > > > >>>>>>>>>>>>>>>>> about >>>>> > > > >>>>>>>>>>>>>>>>> what decides about the execution mode of a >>>>> query. >>>>> > > > >>>> Currently, we >>>>> > > > >>>>>>>>>>>>>>>>> would >>>>> > > > >>>>>>>>>>>>>>>>> like to let the query decide, not derive it >>>>> from the >>>>> > > > >>>> sources. >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> So I could image a multiline pipeline as: >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> USE CATALOG 'mycat'; >>>>> > > > >>>>>>>>>>>>>>>>> INSERT INTO t1 SELECT * FROM s; >>>>> > > > >>>>>>>>>>>>>>>>> INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT >>>>> STREAM; >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> For executeMultilineSql(): >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> sync because regular SQL >>>>> > > > >>>>>>>>>>>>>>>>> sync because regular Batch SQL >>>>> > > > >>>>>>>>>>>>>>>>> async because Streaming SQL >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> For executeAsyncMultilineSql(): >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async >>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async >>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> What we should not start for >>>>> > > executeAsyncMultilineSql(): >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> sync because DDL >>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async >>>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> What are you thoughts here? >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> Regards, >>>>> > > > >>>>>>>>>>>>>>>>> Timo >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>> On 26.03.20 12:50, godfrey he wrote: >>>>> > > > >>>>>>>>>>>>>>>>>> Hi Timo, >>>>> > > > >>>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>>> I agree with you that streaming queries >>>>> mostly need >>>>> > > > async >>>>> > > > >>>>>>>>>>>>>>>>>> execution. >>>>> > > > >>>>>>>>>>>>>>>>>> In fact, our original plan is only >>>>> introducing sync >>>>> > > > >>>> methods in >>>>> > > > >>>>>>>>> this >>>>> > > > >>>>>>>>>>>>>>> FLIP, >>>>> > > > >>>>>>>>>>>>>>>>>> and async methods (like "executeSqlAsync") >>>>> will be >>>>> > > > >>>> introduced >>>>> > > > >>>>>> in >>>>> > > > >>>>>>>>>>>>>>>>>> the >>>>> > > > >>>>>>>>>>>>>>>>> future >>>>> > > > >>>>>>>>>>>>>>>>>> which is mentioned in the appendix. >>>>> > > > >>>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>>> Maybe the async methods also need to be >>>>> considered >>>>> > in >>>>> > > > >> this >>>>> > > > >>>>>> FLIP. >>>>> > > > >>>>>>>>>>>>>>>>>> >>>>> > > > >>>>>>>>>>>>>>>>>> I think sync methods is also useful for >>>>> streaming >>>>> > > which >>>>> >>>>