Hi fabian, Thanks for you suggestions. Agree with you that `UNQ(f2, f3)` is more clear.
A table can have only ONE primary key, this primary key can consist of single or multiple columns. [1] if primary key consists of single column, we can simply use `PRI` (or `PRI(xx)`) to represent it. if primary key have multiple columns, we should use `PRI(xx, yy, ...)` to represent it. A table may have multiple unique keys, each unique key can consist of single or multiple columns. [2] if there is only one unique key and this unique key has only single column, we can simply use `UNQ` (or `UNQ(xx)`) to represent it. otherwise, we should use `UNQ(xx, yy, ...)` to represent it. (a corner case: two unique keys with same columns, like `UNQ(f2, f3)`, `UNQ(f2, f3)`, we can forbid this case or add a unique name for each key in the future) primary key and unique key with multiple columns example: create table MyTable ( f0 BIGINT NOT NULL, f1 ROW<q1 STRING, q2 TIMESTAMP(3)>, f2 VARCHAR<256>, f3 AS f0 + 1, f4 TIMESTAMP(3) NOT NULL, f5 BIGINT NOT NULL, * PRIMARY KEY (f0, f5)*, *UNIQUE (f3, f2)*, WATERMARK f4 AS f4 - INTERVAL '3' SECOND ) with (...) +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ | name | type | null | key | compute column | watermark | +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ | f0 | BIGINT | false | PRI(f0, f5) | (NULL) | (NULL) | +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ | f1 | ROW<q1 STRING, q2 TIMESTAMP(3)> | true | (NULL) | (NULL) | (NULL) | +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ | f2 | VARCHAR<256> | true | UNQ(f2, f3) | (NULL) | (NULL) | +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ | f3 | BIGINT | false | UNQ(f2, f3) | f0 + 1 | (NULL) | +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ | f4 | TIMESTAMP(3) | false | (NULL) | (NULL) | f4 - INTERVAL '3' SECOND | +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ | f5 | BIGINT | false | PRI(f0, f5) | (NULL) | (NULL) | +--------+------------------------------------------------------+-------+----------------+-----------------------+--------------------------------------+ "Regarding to the watermark on nested columns", that's a good approach which can both support watermark on nested columns in the future and keep current table form. [1] https://www.w3schools.com/sql/sql_primarykey.asp [2] https://www.w3schools.com/sql/sql_unique.ASP Best, Godfrey Fabian Hueske <fhue...@gmail.com> 于2020年5月7日周四 上午12:03写道: > Hi Godfrey, > > This looks good to me. > > One side note, indicating unique constraints with "UNQ" is probably not > enough. > There might be multiple unique constraints and users would like to know > which field combinations are unique. > So in your example above, "UNQ(f2, f3)" might be a better marker. > > Just as a thought, if we would later add support for watermark on nested > columns, we could add a row just for the nested field (in addition to the > top-level field) like this: > > > +------------------------+---------------------------+-------+-----------+-------------+-----------------------------------------------------------+ > | f4.nested.rowtime | TIMESTAMP(3) | false | (NULL) | (NULL) > | f4.nested.rowtime - INTERVAL '3' SECOND | > > +------------------------+---------------------------+-------+-----------+-------------+-----------------------------------------------------------+ > > Thanks, > Fabian > > Am Mi., 6. Mai 2020 um 17:51 Uhr schrieb godfrey he <godfre...@gmail.com>: > >> Hi @fhue...@gmail.com @Timo Walther <twal...@apache.org> @Dawid >> Wysakowicz <dwysakow...@apache.org> >> What do you think we limit watermark must be defined on top-level column ? >> >> if we do that, we can add an expression column to represent watermark >> like compute column, >> An example of all cases: >> create table MyTable ( >> f0 BIGINT NOT NULL, >> f1 ROW<q1 STRING, q2 TIMESTAMP(3)>, >> f2 VARCHAR<256>, >> f3 AS f0 + 1, >> f4 TIMESTAMP(3) NOT NULL, >> PRIMARY KEY (f0), >> UNIQUE (f3, f2), >> WATERMARK f4 AS f4 - INTERVAL '3' SECOND >> ) with (...) >> >> >> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >> | name | type | >> null | key | compute column | watermark >> | >> >> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >> | f0 | BIGINT | >> false | PRI | (NULL) | (NULL) >> | >> >> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >> | f1 | ROW<q1 STRING, q2 TIMESTAMP(3)> | true | (NULL) | (NULL) >> | (NULL) | >> >> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >> | f2 | VARCHAR<256> | true | >> UNQ | (NULL) | (NULL) | >> >> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >> | f3 | BIGINT | >> false | UNQ | f0 + 1 | (NULL) >> | >> >> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >> | f4 | TIMESTAMP(3) | false >> | (NULL) | (NULL) | f4 - INTERVAL '3' SECOND | >> >> +--------+------------------------------------------------------+-------+-----------+-----------------------+--------------------------------------+ >> >> WDYT ? >> >> Best, >> Godfrey >> >> >> >> godfrey he <godfre...@gmail.com> 于2020年4月30日周四 下午11:57写道: >> >>> Hi Fabian, >>> >>> the broken example is: >>> >>> create table MyTable ( >>> >>> f0 BIGINT NOT NULL, >>> >>> f1 ROW<q1 STRING, q2 TIMESTAMP(3)>, >>> >>> f2 VARCHAR<256>, >>> >>> f3 AS f0 + 1, >>> >>> PRIMARY KEY (f0), >>> >>> UNIQUE (f3, f2), >>> >>> WATERMARK f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND) >>> >>> ) with (...) >>> >>> >>> name >>> >>> type >>> >>> key >>> >>> compute column >>> >>> watermark >>> >>> f0 >>> >>> BIGINT NOT NULL >>> >>> PRI >>> >>> (NULL) >>> >>> f1 >>> >>> ROW<`q1` STRING, `q2` TIMESTAMP(3)> >>> >>> UNQ >>> >>> (NULL) >>> >>> f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND) >>> >>> f2 >>> >>> VARCHAR<256> >>> >>> (NULL) >>> >>> NULL >>> >>> f3 >>> >>> BIGINT NOT NULL >>> >>> UNQ >>> >>> f0 + 1 >>> >>> >>> or we add a column to represent nullability. >>> >>> name >>> >>> type >>> >>> null >>> >>> key >>> >>> compute column >>> >>> watermark >>> >>> f0 >>> >>> BIGINT >>> >>> false >>> >>> PRI >>> >>> (NULL) >>> >>> f1 >>> >>> ROW<`q1` STRING, `q2` TIMESTAMP(3)> >>> >>> true >>> >>> UNQ >>> >>> (NULL) >>> >>> f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND) >>> >>> f2 >>> >>> VARCHAR<256> >>> >>> true >>> >>> (NULL) >>> >>> NULL >>> >>> f3 >>> >>> BIGINT >>> >>> false >>> >>> UNQ >>> >>> f0 + 1 >>> >>> >>> >>> >>> Hi Jark, >>> If we can limit watermark must be defined on top-level column, >>> this will become more simple. >>> >>> Best, >>> Godfrey >>> >>> Jark Wu <imj...@gmail.com> 于2020年4月30日周四 下午11:38写道: >>> >>>> Hi, >>>> >>>> I'm in favor of Fabian's proposal. >>>> First, watermark is not a column, but a metadata just like primary key, >>>> so >>>> shouldn't stand with columns. >>>> Second, AFAIK, primary key can only be defined on top-level columns. >>>> Third, I think watermark can also follow primary key than only allow to >>>> define on top-level columns. >>>> >>>> I have to admit that in FLIP-66, watermark can define on nested fields. >>>> However, during implementation, I found that it's too complicated to do >>>> that. We have refactor time-based physical nodes, >>>> we have to use code generation to access event-time, we have to refactor >>>> FlinkTypeFactory to support a complex nested rowtime. >>>> There is not much value of this feature, but introduce a lot of >>>> complexity >>>> in code base. >>>> So I think we can force watermark define on top-level columns. If user >>>> want >>>> to define on nested columns, >>>> he/she can use computed column to be a top-level column. >>>> >>>> Best, >>>> Jark >>>> >>>> >>>> On Thu, 30 Apr 2020 at 17:55, Fabian Hueske <fhue...@gmail.com> wrote: >>>> >>>> > Hi Godfrey, >>>> > >>>> > The formatting of your example seems to be broken. >>>> > Could you send them again please? >>>> > >>>> > Regarding your points >>>> > > because watermark express can be a sub-column, just like `f1.q2` in >>>> above >>>> > example I give. >>>> > >>>> > I would put the watermark information in the row of the top-level >>>> field and >>>> > indicate to which nested field the watermark refers. >>>> > Don't we have to solve the same issue for primary keys that are >>>> defined on >>>> > a nested field? >>>> > >>>> > > A boolean flag can't represent such info. and I do know whether we >>>> will >>>> > support complex watermark expression involving multiple columns in the >>>> > future. such as: "WATERMARK FOR ts as ts + f1 + interval '1' second" >>>> > >>>> > You are right, a simple binary flag is definitely not sufficient to >>>> display >>>> > the watermark information. >>>> > I would put the expression string into the field, i.e., "ts + f1 + >>>> interval >>>> > '1' second" >>>> > >>>> > >>>> > For me the most important point of why to not show the watermark as a >>>> row >>>> > in the table is that it is not field that can be queried but meta >>>> > information on an existing field. >>>> > For the user it is important to know that a certain field has a >>>> watermark. >>>> > Otherwise, certain queries cannot be correctly specified. >>>> > Also there might be support for multiple watermarks that are defined >>>> of >>>> > different fields at some point. Would those be printed in multiple >>>> rows? >>>> > >>>> > Best, >>>> > Fabian >>>> > >>>> > >>>> > Am Do., 30. Apr. 2020 um 11:25 Uhr schrieb godfrey he < >>>> godfre...@gmail.com >>>> > >: >>>> > >>>> > > Hi Fabian, Aljoscha >>>> > > >>>> > > Thanks for the feedback. >>>> > > >>>> > > Agree with you that we can deal with primary key as you mentioned. >>>> > > now, the type column has contained the nullability attribute, e.g. >>>> BIGINT >>>> > > NOT NULL. >>>> > > (I'm also ok that we use two columns to represent type just like >>>> mysql) >>>> > > >>>> > > >Why I treat `watermark` as a special row ? >>>> > > because watermark express can be a sub-column, just like `f1.q2` in >>>> above >>>> > > example I give. >>>> > > A boolean flag can't represent such info. and I do know whether we >>>> will >>>> > > support complex >>>> > > watermark expression involving multiple columns in the future. such >>>> as: >>>> > > "WATERMARK FOR ts as ts + f1 + interval '1' second" >>>> > > >>>> > > If we do not support complex watermark expression, we can add a >>>> watermark >>>> > > column. >>>> > > >>>> > > for example: >>>> > > >>>> > > create table MyTable ( >>>> > > >>>> > > f0 BIGINT NOT NULL, >>>> > > >>>> > > f1 ROW<q1 STRING, q2 TIMESTAMP(3)>, >>>> > > >>>> > > f2 VARCHAR<256>, >>>> > > >>>> > > f3 AS f0 + 1, >>>> > > >>>> > > PRIMARY KEY (f0), >>>> > > >>>> > > UNIQUE (f3, f2), >>>> > > >>>> > > WATERMARK f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND) >>>> > > >>>> > > ) with (...) >>>> > > >>>> > > >>>> > > name >>>> > > >>>> > > type >>>> > > >>>> > > key >>>> > > >>>> > > compute column >>>> > > >>>> > > watermark >>>> > > >>>> > > f0 >>>> > > >>>> > > BIGINT NOT NULL >>>> > > >>>> > > PRI >>>> > > >>>> > > (NULL) >>>> > > >>>> > > f1 >>>> > > >>>> > > ROW<`q1` STRING, `q2` TIMESTAMP(3)> >>>> > > >>>> > > UNQ >>>> > > >>>> > > (NULL) >>>> > > >>>> > > f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND) >>>> > > >>>> > > f2 >>>> > > >>>> > > VARCHAR<256> >>>> > > >>>> > > (NULL) >>>> > > >>>> > > NULL >>>> > > >>>> > > f3 >>>> > > >>>> > > BIGINT NOT NULL >>>> > > >>>> > > UNQ >>>> > > >>>> > > f0 + 1 >>>> > > >>>> > > >>>> > > or we add a column to represent nullability. >>>> > > >>>> > > name >>>> > > >>>> > > type >>>> > > >>>> > > null >>>> > > >>>> > > key >>>> > > >>>> > > compute column >>>> > > >>>> > > watermark >>>> > > >>>> > > f0 >>>> > > >>>> > > BIGINT >>>> > > >>>> > > false >>>> > > >>>> > > PRI >>>> > > >>>> > > (NULL) >>>> > > >>>> > > f1 >>>> > > >>>> > > ROW<`q1` STRING, `q2` TIMESTAMP(3)> >>>> > > >>>> > > true >>>> > > >>>> > > UNQ >>>> > > >>>> > > (NULL) >>>> > > >>>> > > f1.q2 AS (`f1.q2` - INTERVAL '3' SECOND) >>>> > > >>>> > > f2 >>>> > > >>>> > > VARCHAR<256> >>>> > > >>>> > > true >>>> > > >>>> > > (NULL) >>>> > > >>>> > > NULL >>>> > > >>>> > > f3 >>>> > > >>>> > > BIGINT >>>> > > >>>> > > false >>>> > > >>>> > > UNQ >>>> > > >>>> > > f0 + 1 >>>> > > >>>> > > >>>> > > Personally, I like the second one. (we need do some changes on >>>> > LogicalType >>>> > > to get type name without nullability) >>>> > > >>>> > > >>>> > > Best, >>>> > > Godfrey >>>> > > >>>> > > >>>> > > Aljoscha Krettek <aljos...@apache.org> 于2020年4月29日周三 下午5:47写道: >>>> > > >>>> > > > +1 I like the general idea of printing the results as a table. >>>> > > > >>>> > > > On the specifics I don't know enough but Fabians suggestions >>>> seems to >>>> > > > make sense to me. >>>> > > > >>>> > > > Aljoscha >>>> > > > >>>> > > > On 29.04.20 10:56, Fabian Hueske wrote: >>>> > > > > Hi Godfrey, >>>> > > > > >>>> > > > > Thanks for starting this discussion! >>>> > > > > >>>> > > > > In my mind, WATERMARK is a property (or constraint) of a field, >>>> just >>>> > > like >>>> > > > > PRIMARY KEY. >>>> > > > > Take this example from MySQL: >>>> > > > > >>>> > > > > mysql> CREATE TABLE people (id INT NOT NULL, name VARCHAR(128) >>>> NOT >>>> > > NULL, >>>> > > > > age INT, PRIMARY KEY (id)); >>>> > > > > Query OK, 0 rows affected (0.06 sec) >>>> > > > > >>>> > > > > mysql> describe people; >>>> > > > > +-------+--------------+------+-----+---------+-------+ >>>> > > > > | Field | Type | Null | Key | Default | Extra | >>>> > > > > +-------+--------------+------+-----+---------+-------+ >>>> > > > > | id | int | NO | PRI | NULL | | >>>> > > > > | name | varchar(128) | NO | | NULL | | >>>> > > > > | age | int | YES | | NULL | | >>>> > > > > +-------+--------------+------+-----+---------+-------+ >>>> > > > > 3 rows in set (0.01 sec) >>>> > > > > >>>> > > > > Here, PRIMARY KEY is marked in the Key column of the id field. >>>> > > > > We could do the same for watermarks by adding a Watermark >>>> column. >>>> > > > > >>>> > > > > Best, Fabian >>>> > > > > >>>> > > > > >>>> > > > > Am Mi., 29. Apr. 2020 um 10:43 Uhr schrieb godfrey he < >>>> > > > godfre...@gmail.com>: >>>> > > > > >>>> > > > >> Hi everyone, >>>> > > > >> >>>> > > > >> I would like to bring up a discussion about the result type of >>>> > > describe >>>> > > > >> statement, >>>> > > > >> which is introduced in FLIP-84[1]. >>>> > > > >> In previous version, we define the result type of `describe` >>>> > statement >>>> > > > is a >>>> > > > >> single column as following >>>> > > > >> >>>> > > > >> Statement >>>> > > > >> >>>> > > > >> Result Schema >>>> > > > >> >>>> > > > >> Result Value >>>> > > > >> >>>> > > > >> Result Kind >>>> > > > >> >>>> > > > >> Examples >>>> > > > >> >>>> > > > >> DESCRIBE xx >>>> > > > >> >>>> > > > >> field name: result >>>> > > > >> >>>> > > > >> field type: VARCHAR(n) >>>> > > > >> >>>> > > > >> (n is the max length of values) >>>> > > > >> >>>> > > > >> describe the detail of an object >>>> > > > >> >>>> > > > >> (single row) >>>> > > > >> >>>> > > > >> SUCCESS_WITH_CONTENT >>>> > > > >> >>>> > > > >> DESCRIBE table_name >>>> > > > >> >>>> > > > >> for "describe table_name", the result value is the `toString` >>>> value >>>> > of >>>> > > > >> `TableSchema`, which is an unstructured data. >>>> > > > >> It's hard to for user to use this info. >>>> > > > >> >>>> > > > >> for example: >>>> > > > >> >>>> > > > >> TableSchema schema = TableSchema.builder() >>>> > > > >> .field("f0", DataTypes.BIGINT()) >>>> > > > >> .field("f1", DataTypes.ROW( >>>> > > > >> DataTypes.FIELD("q1", DataTypes.STRING()), >>>> > > > >> DataTypes.FIELD("q2", DataTypes.TIMESTAMP(3)))) >>>> > > > >> .field("f2", DataTypes.STRING()) >>>> > > > >> .field("f3", DataTypes.BIGINT(), "f0 + 1") >>>> > > > >> .watermark("f1.q2", WATERMARK_EXPRESSION, >>>> WATERMARK_DATATYPE) >>>> > > > >> .build(); >>>> > > > >> >>>> > > > >> its `toString` value is: >>>> > > > >> root >>>> > > > >> |-- f0: BIGINT >>>> > > > >> |-- f1: ROW<`q1` STRING, `q2` TIMESTAMP(3)> >>>> > > > >> |-- f2: STRING >>>> > > > >> |-- f3: BIGINT AS f0 + 1 >>>> > > > >> |-- WATERMARK FOR f1.q2 AS now() >>>> > > > >> >>>> > > > >> For hive, MySQL, etc., the describe result is table form >>>> including >>>> > > field >>>> > > > >> names and field types. >>>> > > > >> which is more familiar with users. >>>> > > > >> TableSchema[2] has watermark expression and compute column, we >>>> > should >>>> > > > also >>>> > > > >> put them into the table: >>>> > > > >> for compute column, it's a column level, we add a new column >>>> named >>>> > > > `expr`. >>>> > > > >> for watermark expression, it's a table level, we add a >>>> special row >>>> > > > named >>>> > > > >> `WATERMARK` to represent it. >>>> > > > >> >>>> > > > >> The result will look like about above example: >>>> > > > >> >>>> > > > >> name >>>> > > > >> >>>> > > > >> type >>>> > > > >> >>>> > > > >> expr >>>> > > > >> >>>> > > > >> f0 >>>> > > > >> >>>> > > > >> BIGINT >>>> > > > >> >>>> > > > >> (NULL) >>>> > > > >> >>>> > > > >> f1 >>>> > > > >> >>>> > > > >> ROW<`q1` STRING, `q2` TIMESTAMP(3)> >>>> > > > >> >>>> > > > >> (NULL) >>>> > > > >> >>>> > > > >> f2 >>>> > > > >> >>>> > > > >> STRING >>>> > > > >> >>>> > > > >> NULL >>>> > > > >> >>>> > > > >> f3 >>>> > > > >> >>>> > > > >> BIGINT >>>> > > > >> >>>> > > > >> f0 + 1 >>>> > > > >> >>>> > > > >> WATERMARK >>>> > > > >> >>>> > > > >> (NULL) >>>> > > > >> >>>> > > > >> f1.q2 AS now() >>>> > > > >> >>>> > > > >> now there is a pr FLINK-17112 [3] to implement DESCRIBE >>>> statement. >>>> > > > >> >>>> > > > >> What do you think about this update? >>>> > > > >> Any feedback are welcome~ >>>> > > > >> >>>> > > > >> Best, >>>> > > > >> Godfrey >>>> > > > >> >>>> > > > >> [1] >>>> > > > >> >>>> > > > >>>> > > >>>> > >>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=134745878 >>>> > > > >> [2] >>>> > > > >> >>>> > > > >> >>>> > > > >>>> > > >>>> > >>>> https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/api/TableSchema.java >>>> > > > >> [3] https://github.com/apache/flink/pull/11892 >>>> > > > >> >>>> > > > >> >>>> > > > >> godfrey he <godfre...@gmail.com> 于2020年4月6日周一 下午10:38写道: >>>> > > > >> >>>> > > > >>> Hi Timo, >>>> > > > >>> >>>> > > > >>> Sorry for the late reply, and thanks for your correction. >>>> > > > >>> I missed DQL for job submission scenario. >>>> > > > >>> I'll fix the document right away. >>>> > > > >>> >>>> > > > >>> Best, >>>> > > > >>> Godfrey >>>> > > > >>> >>>> > > > >>> Timo Walther <twal...@apache.org> 于2020年4月3日周五 下午9:53写道: >>>> > > > >>> >>>> > > > >>>> Hi Godfrey, >>>> > > > >>>> >>>> > > > >>>> I'm sorry to jump in again but I still need to clarify some >>>> things >>>> > > > >>>> around TableResult. >>>> > > > >>>> >>>> > > > >>>> The FLIP says: >>>> > > > >>>> "For DML, this method returns TableResult until the job is >>>> > > submitted. >>>> > > > >>>> For other statements, TableResult is returned until the >>>> execution >>>> > is >>>> > > > >>>> finished." >>>> > > > >>>> >>>> > > > >>>> I thought we agreed on making every execution async? This >>>> also >>>> > means >>>> > > > >>>> returning a TableResult for DQLs even though the execution >>>> is not >>>> > > done >>>> > > > >>>> yet. People need access to the JobClient also for batch jobs >>>> in >>>> > > order >>>> > > > to >>>> > > > >>>> cancel long lasting queries. If people want to wait for the >>>> > > completion >>>> > > > >>>> they can hook into JobClient or collect(). >>>> > > > >>>> >>>> > > > >>>> Can we rephrase this part to: >>>> > > > >>>> >>>> > > > >>>> The FLIP says: >>>> > > > >>>> "For DML and DQL, this method returns TableResult once the >>>> job has >>>> > > > been >>>> > > > >>>> submitted. For DDL and DCL statements, TableResult is >>>> returned >>>> > once >>>> > > > the >>>> > > > >>>> operation has finished." >>>> > > > >>>> >>>> > > > >>>> Regards, >>>> > > > >>>> Timo >>>> > > > >>>> >>>> > > > >>>> >>>> > > > >>>> On 02.04.20 05:27, godfrey he wrote: >>>> > > > >>>>> Hi Aljoscha, Dawid, Timo, >>>> > > > >>>>> >>>> > > > >>>>> Thanks so much for the detailed explanation. >>>> > > > >>>>> Agree with you that the multiline story is not completed >>>> now, and >>>> > > we >>>> > > > >> can >>>> > > > >>>>> keep discussion. >>>> > > > >>>>> I will add current discussions and conclusions to the FLIP. >>>> > > > >>>>> >>>> > > > >>>>> Best, >>>> > > > >>>>> Godfrey >>>> > > > >>>>> >>>> > > > >>>>> >>>> > > > >>>>> >>>> > > > >>>>> Timo Walther <twal...@apache.org> 于2020年4月1日周三 下午11:27写道: >>>> > > > >>>>> >>>> > > > >>>>>> Hi Godfrey, >>>> > > > >>>>>> >>>> > > > >>>>>> first of all, I agree with Dawid. The multiline story is >>>> not >>>> > > > >> completed >>>> > > > >>>>>> by this FLIP. It just verifies the big picture. >>>> > > > >>>>>> >>>> > > > >>>>>> 1. "control the execution logic through the proposed >>>> method if >>>> > > they >>>> > > > >>>> know >>>> > > > >>>>>> what the statements are" >>>> > > > >>>>>> >>>> > > > >>>>>> This is a good point that also Fabian raised in the linked >>>> > google >>>> > > > >> doc. >>>> > > > >>>> I >>>> > > > >>>>>> could also imagine to return a more complicated POJO when >>>> > calling >>>> > > > >>>>>> `executeMultiSql()`. >>>> > > > >>>>>> >>>> > > > >>>>>> The POJO would include some `getSqlProperties()` such that >>>> a >>>> > > > platform >>>> > > > >>>>>> gets insights into the query before executing. We could >>>> also >>>> > > trigger >>>> > > > >>>> the >>>> > > > >>>>>> execution more explicitly instead of hiding it behind an >>>> > iterator. >>>> > > > >>>>>> >>>> > > > >>>>>> 2. "there are some special commands introduced in SQL >>>> client" >>>> > > > >>>>>> >>>> > > > >>>>>> For platforms and SQL Client specific commands, we could >>>> offer a >>>> > > > hook >>>> > > > >>>> to >>>> > > > >>>>>> the parser or a fallback parser in case the regular table >>>> > > > environment >>>> > > > >>>>>> parser cannot deal with the statement. >>>> > > > >>>>>> >>>> > > > >>>>>> However, all of that is future work and can be discussed >>>> in a >>>> > > > >> separate >>>> > > > >>>>>> FLIP. >>>> > > > >>>>>> >>>> > > > >>>>>> 3. +1 for the `Iterator` instead of `Iterable`. >>>> > > > >>>>>> >>>> > > > >>>>>> 4. "we should convert the checked exception to unchecked >>>> > > exception" >>>> > > > >>>>>> >>>> > > > >>>>>> Yes, I meant using a runtime exception instead of a checked >>>> > > > >> exception. >>>> > > > >>>>>> There was no consensus on putting the exception into the >>>> > > > >> `TableResult`. >>>> > > > >>>>>> >>>> > > > >>>>>> Regards, >>>> > > > >>>>>> Timo >>>> > > > >>>>>> >>>> > > > >>>>>> On 01.04.20 15:35, Dawid Wysakowicz wrote: >>>> > > > >>>>>>> When considering the multi-line support I think it is >>>> helpful >>>> > to >>>> > > > >> start >>>> > > > >>>>>>> with a use case in mind. In my opinion consumers of this >>>> method >>>> > > > will >>>> > > > >>>> be: >>>> > > > >>>>>>> >>>> > > > >>>>>>> 1. sql-client >>>> > > > >>>>>>> 2. third-part sql based platforms >>>> > > > >>>>>>> >>>> > > > >>>>>>> @Godfrey As for the quit/source/... commands. I think >>>> those >>>> > > belong >>>> > > > >> to >>>> > > > >>>>>>> the responsibility of aforementioned. I think they should >>>> not >>>> > be >>>> > > > >>>>>>> understandable by the TableEnvironment. What would quit >>>> on a >>>> > > > >>>>>>> TableEnvironment do? Moreover I think such commands >>>> should be >>>> > > > >> prefixed >>>> > > > >>>>>>> appropriately. I think it's a common practice to e.g. >>>> prefix >>>> > > those >>>> > > > >>>> with >>>> > > > >>>>>>> ! or : to say they are meta commands of the tool rather >>>> than a >>>> > > > >> query. >>>> > > > >>>>>>> >>>> > > > >>>>>>> I also don't necessarily understand why platform users >>>> need to >>>> > > know >>>> > > > >>>> the >>>> > > > >>>>>>> kind of the query to use the proposed method. They should >>>> get >>>> > the >>>> > > > >> type >>>> > > > >>>>>>> from the TableResult#ResultKind. If the ResultKind is >>>> SUCCESS, >>>> > it >>>> > > > >> was >>>> > > > >>>> a >>>> > > > >>>>>>> DCL/DDL. If SUCCESS_WITH_CONTENT it was a DML/DQL. If >>>> that's >>>> > not >>>> > > > >>>> enough >>>> > > > >>>>>>> we can enrich the TableResult with more explicit kind of >>>> query, >>>> > > but >>>> > > > >> so >>>> > > > >>>>>>> far I don't see such a need. >>>> > > > >>>>>>> >>>> > > > >>>>>>> @Kurt In those cases I would assume the developers want to >>>> > > present >>>> > > > >>>>>>> results of the queries anyway. Moreover I think it is >>>> safe to >>>> > > > assume >>>> > > > >>>>>>> they can adhere to such a contract that the results must >>>> be >>>> > > > >> iterated. >>>> > > > >>>>>>> >>>> > > > >>>>>>> For direct users of TableEnvironment/Table API this >>>> method does >>>> > > not >>>> > > > >>>> make >>>> > > > >>>>>>> much sense anyway, in my opinion. I think we can rather >>>> safely >>>> > > > >> assume >>>> > > > >>>> in >>>> > > > >>>>>>> this scenario they do not want to submit multiple queries >>>> at a >>>> > > > >> single >>>> > > > >>>>>> time. >>>> > > > >>>>>>> >>>> > > > >>>>>>> Best, >>>> > > > >>>>>>> >>>> > > > >>>>>>> Dawid >>>> > > > >>>>>>> >>>> > > > >>>>>>> >>>> > > > >>>>>>> On 01/04/2020 15:07, Kurt Young wrote: >>>> > > > >>>>>>>> One comment to `executeMultilineSql`, I'm afraid >>>> sometimes >>>> > user >>>> > > > >> might >>>> > > > >>>>>>>> forget to >>>> > > > >>>>>>>> iterate the returned iterators, e.g. user submits a >>>> bunch of >>>> > > DDLs >>>> > > > >> and >>>> > > > >>>>>>>> expect the >>>> > > > >>>>>>>> framework will execute them one by one. But it didn't. >>>> > > > >>>>>>>> >>>> > > > >>>>>>>> Best, >>>> > > > >>>>>>>> Kurt >>>> > > > >>>>>>>> >>>> > > > >>>>>>>> >>>> > > > >>>>>>>> On Wed, Apr 1, 2020 at 5:10 PM Aljoscha Krettek< >>>> > > > >> aljos...@apache.org> >>>> > > > >>>>>> wrote: >>>> > > > >>>>>>>> >>>> > > > >>>>>>>>> Agreed to what Dawid and Timo said. >>>> > > > >>>>>>>>> >>>> > > > >>>>>>>>> To answer your question about multi line SQL: no, we >>>> don't >>>> > > think >>>> > > > >> we >>>> > > > >>>>>> need >>>> > > > >>>>>>>>> this in Flink 1.11, we only wanted to make sure that the >>>> > > > >> interfaces >>>> > > > >>>>>> that >>>> > > > >>>>>>>>> we now put in place will potentially allow this in the >>>> > future. >>>> > > > >>>>>>>>> >>>> > > > >>>>>>>>> Best, >>>> > > > >>>>>>>>> Aljoscha >>>> > > > >>>>>>>>> >>>> > > > >>>>>>>>> On 01.04.20 09:31, godfrey he wrote: >>>> > > > >>>>>>>>>> Hi, Timo & Dawid, >>>> > > > >>>>>>>>>> >>>> > > > >>>>>>>>>> Thanks so much for the effort of `multiline statements >>>> > > > >> supporting`, >>>> > > > >>>>>>>>>> I have a few questions about this method: >>>> > > > >>>>>>>>>> >>>> > > > >>>>>>>>>> 1. users can well control the execution logic through >>>> the >>>> > > > >> proposed >>>> > > > >>>>>> method >>>> > > > >>>>>>>>>> if they know what the statements are (a statement >>>> is a >>>> > > > DDL, a >>>> > > > >>>> DML >>>> > > > >>>>>> or >>>> > > > >>>>>>>>>> others). >>>> > > > >>>>>>>>>> but if a statement is from a file, that means users do >>>> not >>>> > > know >>>> > > > >>>> what >>>> > > > >>>>>> the >>>> > > > >>>>>>>>>> statements are, >>>> > > > >>>>>>>>>> the execution behavior is unclear. >>>> > > > >>>>>>>>>> As a platform user, I think this method is hard to use, >>>> > unless >>>> > > > >> the >>>> > > > >>>>>>>>> platform >>>> > > > >>>>>>>>>> defines >>>> > > > >>>>>>>>>> a set of rule about the statements order, such as: no >>>> select >>>> > > in >>>> > > > >> the >>>> > > > >>>>>>>>> middle, >>>> > > > >>>>>>>>>> dml must be at tail of sql file (which may be the most >>>> case >>>> > in >>>> > > > >>>> product >>>> > > > >>>>>>>>>> env). >>>> > > > >>>>>>>>>> Otherwise the platform must parse the sql first, then >>>> know >>>> > > what >>>> > > > >> the >>>> > > > >>>>>>>>>> statements are. >>>> > > > >>>>>>>>>> If do like that, the platform can handle all cases >>>> through >>>> > > > >>>>>> `executeSql` >>>> > > > >>>>>>>>> and >>>> > > > >>>>>>>>>> `StatementSet`. >>>> > > > >>>>>>>>>> >>>> > > > >>>>>>>>>> 2. SQL client can't also use `executeMultilineSql` to >>>> > supports >>>> > > > >>>>>> multiline >>>> > > > >>>>>>>>>> statements, >>>> > > > >>>>>>>>>> because there are some special commands >>>> introduced in >>>> > SQL >>>> > > > >>>> client, >>>> > > > >>>>>>>>>> such as `quit`, `source`, `load jar` (not exist now, >>>> but >>>> > maybe >>>> > > > we >>>> > > > >>>> need >>>> > > > >>>>>>>>> this >>>> > > > >>>>>>>>>> command >>>> > > > >>>>>>>>>> to support dynamic table source and udf). >>>> > > > >>>>>>>>>> Does TableEnvironment also supports those commands? >>>> > > > >>>>>>>>>> >>>> > > > >>>>>>>>>> 3. btw, we must have this feature in release-1.11? I >>>> find >>>> > > there >>>> > > > >> are >>>> > > > >>>>>> few >>>> > > > >>>>>>>>>> user cases >>>> > > > >>>>>>>>>> in the feedback document which behavior is >>>> unclear now. >>>> > > > >>>>>>>>>> >>>> > > > >>>>>>>>>> regarding to "change the return value from >>>> `Iterable<Row` to >>>> > > > >>>>>>>>>> `Iterator<Row`", >>>> > > > >>>>>>>>>> I couldn't agree more with this change. Just as Dawid >>>> > > mentioned >>>> > > > >>>>>>>>>> "The contract of the Iterable#iterator is that it >>>> returns a >>>> > > new >>>> > > > >>>>>> iterator >>>> > > > >>>>>>>>>> each time, >>>> > > > >>>>>>>>>> which effectively means we can iterate the results >>>> > > multiple >>>> > > > >>>>>> times.", >>>> > > > >>>>>>>>>> we does not provide iterate the results multiple times. >>>> > > > >>>>>>>>>> If we want do that, the client must buffer all >>>> results. but >>>> > > it's >>>> > > > >>>>>>>>> impossible >>>> > > > >>>>>>>>>> for streaming job. >>>> > > > >>>>>>>>>> >>>> > > > >>>>>>>>>> Best, >>>> > > > >>>>>>>>>> Godfrey >>>> > > > >>>>>>>>>> >>>> > > > >>>>>>>>>> Dawid Wysakowicz<dwysakow...@apache.org> 于2020年4月1日周三 >>>> > > > 上午3:14写道: >>>> > > > >>>>>>>>>> >>>> > > > >>>>>>>>>>> Thank you Timo for the great summary! It covers >>>> (almost) >>>> > all >>>> > > > the >>>> > > > >>>>>> topics. >>>> > > > >>>>>>>>>>> Even though in the end we are not suggesting much >>>> changes >>>> > to >>>> > > > the >>>> > > > >>>>>> current >>>> > > > >>>>>>>>>>> state of FLIP I think it is important to lay out all >>>> > possible >>>> > > > >> use >>>> > > > >>>>>> cases >>>> > > > >>>>>>>>>>> so that we do not change the execution model every >>>> release. >>>> > > > >>>>>>>>>>> >>>> > > > >>>>>>>>>>> There is one additional thing we discussed. Could we >>>> change >>>> > > the >>>> > > > >>>>>> result >>>> > > > >>>>>>>>>>> type of TableResult#collect to Iterator<Row>? Even >>>> though >>>> > > those >>>> > > > >>>>>>>>>>> interfaces do not differ much. I think Iterator better >>>> > > > describes >>>> > > > >>>> that >>>> > > > >>>>>>>>>>> the results might not be materialized on the client >>>> side, >>>> > but >>>> > > > >> can >>>> > > > >>>> be >>>> > > > >>>>>>>>>>> retrieved on a per record basis. The contract of the >>>> > > > >>>>>> Iterable#iterator >>>> > > > >>>>>>>>>>> is that it returns a new iterator each time, which >>>> > > effectively >>>> > > > >>>> means >>>> > > > >>>>>> we >>>> > > > >>>>>>>>>>> can iterate the results multiple times. Iterating the >>>> > results >>>> > > > is >>>> > > > >>>> not >>>> > > > >>>>>>>>>>> possible when we don't retrieve all the results from >>>> the >>>> > > > cluster >>>> > > > >>>> at >>>> > > > >>>>>>>>> once. >>>> > > > >>>>>>>>>>> I think we should also use Iterator for >>>> > > > >>>>>>>>>>> TableEnvironment#executeMultilineSql(String >>>> statements): >>>> > > > >>>>>>>>>>> Iterator<TableResult>. >>>> > > > >>>>>>>>>>> >>>> > > > >>>>>>>>>>> Best, >>>> > > > >>>>>>>>>>> >>>> > > > >>>>>>>>>>> Dawid >>>> > > > >>>>>>>>>>> >>>> > > > >>>>>>>>>>> On 31/03/2020 19:27, Timo Walther wrote: >>>> > > > >>>>>>>>>>>> Hi Godfrey, >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> Aljoscha, Dawid, Klou, and I had another discussion >>>> around >>>> > > > >>>> FLIP-84. >>>> > > > >>>>>> In >>>> > > > >>>>>>>>>>>> particular, we discussed how the current status of >>>> the >>>> > FLIP >>>> > > > and >>>> > > > >>>> the >>>> > > > >>>>>>>>>>>> future requirements around multiline statements, >>>> > async/sync, >>>> > > > >>>>>> collect() >>>> > > > >>>>>>>>>>>> fit together. >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> We also updated the FLIP-84 Feedback Summary >>>> document [1] >>>> > > with >>>> > > > >>>> some >>>> > > > >>>>>>>>>>>> use cases. >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> We believe that we found a good solution that also >>>> fits to >>>> > > > what >>>> > > > >>>> is >>>> > > > >>>>>> in >>>> > > > >>>>>>>>>>>> the current FLIP. So no bigger changes necessary, >>>> which is >>>> > > > >> great! >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> Our findings were: >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> 1. Async vs sync submission of Flink jobs: >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> Having a blocking `execute()` in DataStream API was >>>> > rather a >>>> > > > >>>>>> mistake. >>>> > > > >>>>>>>>>>>> Instead all submissions should be async because this >>>> > allows >>>> > > > >>>>>> supporting >>>> > > > >>>>>>>>>>>> both modes if necessary. Thus, submitting all queries >>>> > async >>>> > > > >>>> sounds >>>> > > > >>>>>>>>>>>> good to us. If users want to run a job sync, they >>>> can use >>>> > > the >>>> > > > >>>>>>>>>>>> JobClient and wait for completion (or collect() in >>>> case of >>>> > > > >> batch >>>> > > > >>>>>> jobs). >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> 2. Multi-statement execution: >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> For the multi-statement execution, we don't see a >>>> > > > >> contradication >>>> > > > >>>>>> with >>>> > > > >>>>>>>>>>>> the async execution behavior. We imagine a method >>>> like: >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> TableEnvironment#executeMultilineSql(String >>>> statements): >>>> > > > >>>>>>>>>>>> Iterable<TableResult> >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> Where the `Iterator#next()` method would trigger the >>>> next >>>> > > > >>>> statement >>>> > > > >>>>>>>>>>>> submission. This allows a caller to decide >>>> synchronously >>>> > > when >>>> > > > >> to >>>> > > > >>>>>>>>>>>> submit statements async to the cluster. Thus, a >>>> service >>>> > such >>>> > > > as >>>> > > > >>>> the >>>> > > > >>>>>>>>>>>> SQL Client can handle the result of each statement >>>> > > > individually >>>> > > > >>>> and >>>> > > > >>>>>>>>>>>> process statement by statement sequentially. >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> 3. The role of TableResult and result retrieval in >>>> general >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> `TableResult` is similar to `JobClient`. Instead of >>>> > > returning >>>> > > > a >>>> > > > >>>>>>>>>>>> `CompletableFuture` of something, it is a concrete >>>> util >>>> > > class >>>> > > > >>>> where >>>> > > > >>>>>>>>>>>> some methods have the behavior of completable future >>>> (e.g. >>>> > > > >>>>>> collect(), >>>> > > > >>>>>>>>>>>> print()) and some are already completed >>>> (getTableSchema(), >>>> > > > >>>>>>>>>>>> getResultKind()). >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> `StatementSet#execute()` returns a single >>>> `TableResult` >>>> > > > because >>>> > > > >>>> the >>>> > > > >>>>>>>>>>>> order is undefined in a set and all statements have >>>> the >>>> > same >>>> > > > >>>> schema. >>>> > > > >>>>>>>>>>>> Its `collect()` will return a row for each executed >>>> > `INSERT >>>> > > > >>>> INTO` in >>>> > > > >>>>>>>>>>>> the order of statement definition. >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> For simple `SELECT * FROM ...`, the query execution >>>> might >>>> > > > block >>>> > > > >>>>>> until >>>> > > > >>>>>>>>>>>> `collect()` is called to pull buffered rows from the >>>> job >>>> > > (from >>>> > > > >>>>>>>>>>>> socket/REST API what ever we will use in the >>>> future). We >>>> > can >>>> > > > >> say >>>> > > > >>>>>> that >>>> > > > >>>>>>>>>>>> a statement finished successfully, when the >>>> > > > >>>>>> `collect#Iterator#hasNext` >>>> > > > >>>>>>>>>>>> has returned false. >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> I hope this summarizes our discussion >>>> > @Dawid/Aljoscha/Klou? >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> It would be great if we can add these findings to >>>> the FLIP >>>> > > > >>>> before we >>>> > > > >>>>>>>>>>>> start voting. >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> One minor thing: some `execute()` methods still >>>> throw a >>>> > > > checked >>>> > > > >>>>>>>>>>>> exception; can we remove that from the FLIP? Also the >>>> > above >>>> > > > >>>>>> mentioned >>>> > > > >>>>>>>>>>>> `Iterator#next()` would trigger an execution without >>>> > > throwing >>>> > > > a >>>> > > > >>>>>>>>>>>> checked exception. >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> Thanks, >>>> > > > >>>>>>>>>>>> Timo >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>>>>> [1] >>>> > > > >>>>>>>>>>>> >>>> > > > >>>>>>>>> >>>> > > > >>>>>> >>>> > > > >>>> >>>> > > > >> >>>> > > > >>>> > > >>>> > >>>> https://docs.google.com/document/d/1ueLjQWRPdLTFB_TReAyhseAX-1N3j4WYWD0F02Uau0E/edit# >>>> > > > >>>>>>>>>>>> On 31.03.20 06:28, godfrey he wrote: >>>> > > > >>>>>>>>>>>>> Hi, Timo & Jark >>>> > > > >>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>> Thanks for your explanation. >>>> > > > >>>>>>>>>>>>> Agree with you that async execution should always be >>>> > async, >>>> > > > >>>>>>>>>>>>> and sync execution scenario can be covered by async >>>> > > > >> execution. >>>> > > > >>>>>>>>>>>>> It helps provide an unified entry point for batch >>>> and >>>> > > > >> streaming. >>>> > > > >>>>>>>>>>>>> I think we can also use sync execution for some >>>> testing. >>>> > > > >>>>>>>>>>>>> So, I agree with you that we provide `executeSql` >>>> method >>>> > > and >>>> > > > >>>> it's >>>> > > > >>>>>>>>> async >>>> > > > >>>>>>>>>>>>> method. >>>> > > > >>>>>>>>>>>>> If we want sync method in the future, we can add >>>> method >>>> > > named >>>> > > > >>>>>>>>>>>>> `executeSqlSync`. >>>> > > > >>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>> I think we've reached an agreement. I will update >>>> the >>>> > > > >> document, >>>> > > > >>>> and >>>> > > > >>>>>>>>>>>>> start >>>> > > > >>>>>>>>>>>>> voting process. >>>> > > > >>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>> Best, >>>> > > > >>>>>>>>>>>>> Godfrey >>>> > > > >>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>> Jark Wu<imj...@gmail.com> 于2020年3月31日周二 上午12:46写道: >>>> > > > >>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>> Hi, >>>> > > > >>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>> I didn't follow the full discussion. >>>> > > > >>>>>>>>>>>>>> But I share the same concern with Timo that >>>> streaming >>>> > > > queries >>>> > > > >>>>>> should >>>> > > > >>>>>>>>>>>>>> always >>>> > > > >>>>>>>>>>>>>> be async. >>>> > > > >>>>>>>>>>>>>> Otherwise, I can image it will cause a lot of >>>> confusion >>>> > > and >>>> > > > >>>>>> problems >>>> > > > >>>>>>>>> if >>>> > > > >>>>>>>>>>>>>> users don't deeply keep the "sync" in mind (e.g. >>>> client >>>> > > > >> hangs). >>>> > > > >>>>>>>>>>>>>> Besides, the streaming mode is still the majority >>>> use >>>> > > cases >>>> > > > >> of >>>> > > > >>>>>> Flink >>>> > > > >>>>>>>>>>>>>> and >>>> > > > >>>>>>>>>>>>>> Flink SQL. We should put the usability at a high >>>> > priority. >>>> > > > >>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>> Best, >>>> > > > >>>>>>>>>>>>>> Jark >>>> > > > >>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>> On Mon, 30 Mar 2020 at 23:27, Timo Walther< >>>> > > > >> twal...@apache.org> >>>> > > > >>>>>>>>> wrote: >>>> > > > >>>>>>>>>>>>>>> Hi Godfrey, >>>> > > > >>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>> maybe I wasn't expressing my biggest concern >>>> enough in >>>> > my >>>> > > > >> last >>>> > > > >>>>>> mail. >>>> > > > >>>>>>>>>>>>>>> Even in a singleline and sync execution, I think >>>> that >>>> > > > >>>> streaming >>>> > > > >>>>>>>>>>>>>>> queries >>>> > > > >>>>>>>>>>>>>>> should not block the execution. Otherwise it is >>>> not >>>> > > > possible >>>> > > > >>>> to >>>> > > > >>>>>> call >>>> > > > >>>>>>>>>>>>>>> collect() or print() on them afterwards. >>>> > > > >>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>> "there are too many things need to discuss for >>>> > > multiline": >>>> > > > >>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>> True, I don't want to solve all of them right >>>> now. But >>>> > > what >>>> > > > >> I >>>> > > > >>>>>> know >>>> > > > >>>>>>>>> is >>>> > > > >>>>>>>>>>>>>>> that our newly introduced methods should fit into >>>> a >>>> > > > >> multiline >>>> > > > >>>>>>>>>>>>>>> execution. >>>> > > > >>>>>>>>>>>>>>> There is no big difference of calling >>>> `executeSql(A), >>>> > > > >>>>>>>>>>>>>>> executeSql(B)` and >>>> > > > >>>>>>>>>>>>>>> processing a multiline file `A;\nB;`. >>>> > > > >>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>> I think the example that you mentioned can simply >>>> be >>>> > > > >> undefined >>>> > > > >>>>>> for >>>> > > > >>>>>>>>>>>>>>> now. >>>> > > > >>>>>>>>>>>>>>> Currently, no catalog is modifying data but just >>>> > > metadata. >>>> > > > >>>> This >>>> > > > >>>>>> is a >>>> > > > >>>>>>>>>>>>>>> separate discussion. >>>> > > > >>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>> "result of the second statement is >>>> indeterministic": >>>> > > > >>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>> Sure this is indeterministic. But this is the >>>> > > implementers >>>> > > > >>>> fault >>>> > > > >>>>>>>>>>>>>>> and we >>>> > > > >>>>>>>>>>>>>>> cannot forbid such pipelines. >>>> > > > >>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>> How about we always execute streaming queries >>>> async? It >>>> > > > >> would >>>> > > > >>>>>>>>> unblock >>>> > > > >>>>>>>>>>>>>>> executeSql() and multiline statements. >>>> > > > >>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>> Having a `executeSqlAsync()` is useful for batch. >>>> > > However, >>>> > > > I >>>> > > > >>>>>> don't >>>> > > > >>>>>>>>>>>>>>> want >>>> > > > >>>>>>>>>>>>>>> `sync/async` be the new batch/stream flag. The >>>> > execution >>>> > > > >>>> behavior >>>> > > > >>>>>>>>>>>>>>> should >>>> > > > >>>>>>>>>>>>>>> come from the query itself. >>>> > > > >>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>> Regards, >>>> > > > >>>>>>>>>>>>>>> Timo >>>> > > > >>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>> On 30.03.20 11:12, godfrey he wrote: >>>> > > > >>>>>>>>>>>>>>>> Hi Timo, >>>> > > > >>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>> Agree with you that streaming queries is our top >>>> > > priority, >>>> > > > >>>>>>>>>>>>>>>> but I think there are too many things need to >>>> discuss >>>> > > for >>>> > > > >>>>>> multiline >>>> > > > >>>>>>>>>>>>>>>> statements: >>>> > > > >>>>>>>>>>>>>>>> e.g. >>>> > > > >>>>>>>>>>>>>>>> 1. what's the behaivor of DDL and DML mixing for >>>> async >>>> > > > >>>>>> execution: >>>> > > > >>>>>>>>>>>>>>>> create table t1 xxx; >>>> > > > >>>>>>>>>>>>>>>> create table t2 xxx; >>>> > > > >>>>>>>>>>>>>>>> insert into t2 select * from t1 where xxx; >>>> > > > >>>>>>>>>>>>>>>> drop table t1; // t1 may be a MySQL table, the >>>> data >>>> > will >>>> > > > >>>> also be >>>> > > > >>>>>>>>>>>>>> deleted. >>>> > > > >>>>>>>>>>>>>>>> t1 is dropped when "insert" job is running. >>>> > > > >>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>> 2. what's the behaivor of unified scenario for >>>> async >>>> > > > >>>> execution: >>>> > > > >>>>>>>>>>>>>>>> (as you >>>> > > > >>>>>>>>>>>>>>>> mentioned) >>>> > > > >>>>>>>>>>>>>>>> INSERT INTO t1 SELECT * FROM s; >>>> > > > >>>>>>>>>>>>>>>> INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT >>>> STREAM; >>>> > > > >>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>> The result of the second statement is >>>> indeterministic, >>>> > > > >>>> because >>>> > > > >>>>>> the >>>> > > > >>>>>>>>>>>>>> first >>>> > > > >>>>>>>>>>>>>>>> statement maybe is running. >>>> > > > >>>>>>>>>>>>>>>> I think we need to put a lot of effort to define >>>> the >>>> > > > >>>> behavior of >>>> > > > >>>>>>>>>>>>>>> logically >>>> > > > >>>>>>>>>>>>>>>> related queries. >>>> > > > >>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>> In this FLIP, I suggest we only handle single >>>> > statement, >>>> > > > >> and >>>> > > > >>>> we >>>> > > > >>>>>>>>> also >>>> > > > >>>>>>>>>>>>>>>> introduce an async execute method >>>> > > > >>>>>>>>>>>>>>>> which is more important and more often used for >>>> users. >>>> > > > >>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>> Dor the sync methods (like >>>> > `TableEnvironment.executeSql` >>>> > > > >> and >>>> > > > >>>>>>>>>>>>>>>> `StatementSet.execute`), >>>> > > > >>>>>>>>>>>>>>>> the result will be returned until the job is >>>> finished. >>>> > > The >>>> > > > >>>>>>>>> following >>>> > > > >>>>>>>>>>>>>>>> methods will be introduced in this FLIP: >>>> > > > >>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>> /** >>>> > > > >>>>>>>>>>>>>>>> * Asynchronously execute the given single >>>> > > > statement >>>> > > > >>>>>>>>>>>>>>>> */ >>>> > > > >>>>>>>>>>>>>>>> TableEnvironment.executeSqlAsync(String >>>> statement): >>>> > > > >>>> TableResult >>>> > > > >>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>> /** >>>> > > > >>>>>>>>>>>>>>>> * Asynchronously execute the dml >>>> statements as >>>> > a >>>> > > > >> batch >>>> > > > >>>>>>>>>>>>>>>> */ >>>> > > > >>>>>>>>>>>>>>>> StatementSet.executeAsync(): TableResult >>>> > > > >>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>> public interface TableResult { >>>> > > > >>>>>>>>>>>>>>>> /** >>>> > > > >>>>>>>>>>>>>>>> * return JobClient for DQL and DML in >>>> async >>>> > > > mode, >>>> > > > >>>> else >>>> > > > >>>>>>>>> return >>>> > > > >>>>>>>>>>>>>>>> Optional.empty >>>> > > > >>>>>>>>>>>>>>>> */ >>>> > > > >>>>>>>>>>>>>>>> Optional<JobClient> getJobClient(); >>>> > > > >>>>>>>>>>>>>>>> } >>>> > > > >>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>> what do you think? >>>> > > > >>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>> Best, >>>> > > > >>>>>>>>>>>>>>>> Godfrey >>>> > > > >>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>> Timo Walther<twal...@apache.org> 于2020年3月26日周四 >>>> > > 下午9:15写道: >>>> > > > >>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> Hi Godfrey, >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> executing streaming queries must be our top >>>> priority >>>> > > > >> because >>>> > > > >>>>>> this >>>> > > > >>>>>>>>> is >>>> > > > >>>>>>>>>>>>>>>>> what distinguishes Flink from competitors. If we >>>> > change >>>> > > > >> the >>>> > > > >>>>>>>>>>>>>>>>> execution >>>> > > > >>>>>>>>>>>>>>>>> behavior, we should think about the other cases >>>> as >>>> > well >>>> > > > to >>>> > > > >>>> not >>>> > > > >>>>>>>>> break >>>> > > > >>>>>>>>>>>>>> the >>>> > > > >>>>>>>>>>>>>>>>> API a third time. >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> I fear that just having an async execute method >>>> will >>>> > > not >>>> > > > >> be >>>> > > > >>>>>> enough >>>> > > > >>>>>>>>>>>>>>>>> because users should be able to mix streaming >>>> and >>>> > batch >>>> > > > >>>> queries >>>> > > > >>>>>>>>> in a >>>> > > > >>>>>>>>>>>>>>>>> unified scenario. >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> If I remember it correctly, we had some >>>> discussions >>>> > in >>>> > > > the >>>> > > > >>>> past >>>> > > > >>>>>>>>>>>>>>>>> about >>>> > > > >>>>>>>>>>>>>>>>> what decides about the execution mode of a >>>> query. >>>> > > > >>>> Currently, we >>>> > > > >>>>>>>>>>>>>>>>> would >>>> > > > >>>>>>>>>>>>>>>>> like to let the query decide, not derive it >>>> from the >>>> > > > >>>> sources. >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> So I could image a multiline pipeline as: >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> USE CATALOG 'mycat'; >>>> > > > >>>>>>>>>>>>>>>>> INSERT INTO t1 SELECT * FROM s; >>>> > > > >>>>>>>>>>>>>>>>> INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT >>>> STREAM; >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> For executeMultilineSql(): >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> sync because regular SQL >>>> > > > >>>>>>>>>>>>>>>>> sync because regular Batch SQL >>>> > > > >>>>>>>>>>>>>>>>> async because Streaming SQL >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> For executeAsyncMultilineSql(): >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async >>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async >>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> What we should not start for >>>> > > executeAsyncMultilineSql(): >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> sync because DDL >>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async >>>> > > > >>>>>>>>>>>>>>>>> async because everything should be async >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> What are you thoughts here? >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> Regards, >>>> > > > >>>>>>>>>>>>>>>>> Timo >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>> On 26.03.20 12:50, godfrey he wrote: >>>> > > > >>>>>>>>>>>>>>>>>> Hi Timo, >>>> > > > >>>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>>> I agree with you that streaming queries mostly >>>> need >>>> > > > async >>>> > > > >>>>>>>>>>>>>>>>>> execution. >>>> > > > >>>>>>>>>>>>>>>>>> In fact, our original plan is only introducing >>>> sync >>>> > > > >>>> methods in >>>> > > > >>>>>>>>> this >>>> > > > >>>>>>>>>>>>>>> FLIP, >>>> > > > >>>>>>>>>>>>>>>>>> and async methods (like "executeSqlAsync") >>>> will be >>>> > > > >>>> introduced >>>> > > > >>>>>> in >>>> > > > >>>>>>>>>>>>>>>>>> the >>>> > > > >>>>>>>>>>>>>>>>> future >>>> > > > >>>>>>>>>>>>>>>>>> which is mentioned in the appendix. >>>> > > > >>>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>>> Maybe the async methods also need to be >>>> considered >>>> > in >>>> > > > >> this >>>> > > > >>>>>> FLIP. >>>> > > > >>>>>>>>>>>>>>>>>> >>>> > > > >>>>>>>>>>>>>>>>>> I think sync methods is also useful for >>>> streaming >>>> > > which >>>> >>>