Re: [DISCUSS] FLIP-84 Feedback Summary

Aljoscha Krettek Wed, 29 Apr 2020 02:48:24 -0700

+1 I like the general idea of printing the results as a table.

On the specifics I don't know enough but Fabians suggestions seems tomake sense to me.


Aljoscha

On 29.04.20 10:56, Fabian Hueske wrote:

Hi Godfrey,

Thanks for starting this discussion!

In my mind, WATERMARK is a property (or constraint) of a field, just like
PRIMARY KEY.
Take this example from MySQL:

mysql> CREATE TABLE people (id INT NOT NULL, name VARCHAR(128) NOT NULL,
age INT, PRIMARY KEY (id));
Query OK, 0 rows affected (0.06 sec)

mysql> describe people;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int          | NO   | PRI | NULL    |       |
| name  | varchar(128) | NO   |     | NULL    |       |
| age   | int          | YES  |     | NULL    |       |
+-------+--------------+------+-----+---------+-------+
3 rows in set (0.01 sec)

Here, PRIMARY KEY is marked in the Key column of the id field.
We could do the same for watermarks by adding a Watermark column.

Best, Fabian


Am Mi., 29. Apr. 2020 um 10:43 Uhr schrieb godfrey he <[email protected]>:

Hi everyone,

I would like to bring up a discussion about the result type of describe
statement,
which is introduced in FLIP-84[1].
In previous version, we define the result type of `describe` statement is a
single column as following

Statement

Result Schema

Result Value

Result Kind

Examples

DESCRIBE xx

field name: result

field type: VARCHAR(n)

(n is the max length of values)

describe the detail of an object

(single row)

SUCCESS_WITH_CONTENT

DESCRIBE table_name

for "describe table_name", the result value is the `toString` value of
`TableSchema`, which is an unstructured data.
It's hard to for user to use this info.

for example:

TableSchema schema = TableSchema.builder()
    .field("f0", DataTypes.BIGINT())
    .field("f1", DataTypes.ROW(
       DataTypes.FIELD("q1", DataTypes.STRING()),
       DataTypes.FIELD("q2", DataTypes.TIMESTAMP(3))))
    .field("f2", DataTypes.STRING())
    .field("f3", DataTypes.BIGINT(), "f0 + 1")
    .watermark("f1.q2", WATERMARK_EXPRESSION, WATERMARK_DATATYPE)
    .build();

its `toString` value is:
root
  |-- f0: BIGINT
  |-- f1: ROW<`q1` STRING, `q2` TIMESTAMP(3)>
  |-- f2: STRING
  |-- f3: BIGINT AS f0 + 1
  |-- WATERMARK FOR f1.q2 AS now()

For hive, MySQL, etc., the describe result is table form including field
names and field types.
which is more familiar with users.
TableSchema[2] has watermark expression and compute column, we should also
put them into the table:
for compute column, it's a column level, we add a new column named `expr`.
  for watermark expression, it's a table level, we add a special row named
`WATERMARK` to represent it.

The result will look like about above example:

name

type

expr

f0

BIGINT

(NULL)

f1

ROW<`q1` STRING, `q2` TIMESTAMP(3)>

(NULL)

f2

STRING

NULL

f3

BIGINT

f0 + 1

WATERMARK

(NULL)

f1.q2 AS now()

now there is a pr FLINK-17112 [3] to implement DESCRIBE statement.

What do you think about this update?
Any feedback are welcome~

Best,
Godfrey

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=134745878
[2]

https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/api/TableSchema.java
[3] https://github.com/apache/flink/pull/11892


godfrey he <[email protected]> 于2020年4月6日周一 下午10:38写道：

Hi Timo,

Sorry for the late reply, and thanks for your correction.
I missed DQL for job submission scenario.
I'll fix the document right away.

Best,
Godfrey

Timo Walther <[email protected]> 于2020年4月3日周五 下午9:53写道：

Hi Godfrey,

I'm sorry to jump in again but I still need to clarify some things
around TableResult.

The FLIP says:
"For DML, this method returns TableResult until the job is submitted.
For other statements, TableResult is returned until the execution is
finished."

I thought we agreed on making every execution async? This also means
returning a TableResult for DQLs even though the execution is not done
yet. People need access to the JobClient also for batch jobs in order to
cancel long lasting queries. If people want to wait for the completion
they can hook into JobClient or collect().

Can we rephrase this part to:

The FLIP says:
"For DML and DQL, this method returns TableResult once the job has been
submitted. For DDL and DCL statements, TableResult is returned once the
operation has finished."

Regards,
Timo


On 02.04.20 05:27, godfrey he wrote:

Hi Aljoscha, Dawid, Timo,

Thanks so much for the detailed explanation.
Agree with you that the multiline story is not completed now, and we

can

keep discussion.
I will add current discussions and conclusions to the FLIP.

Best,
Godfrey



Timo Walther <[email protected]> 于2020年4月1日周三 下午11:27写道：

Hi Godfrey,

first of all, I agree with Dawid. The multiline story is not

completed

by this FLIP. It just verifies the big picture.

1. "control the execution logic through the proposed method if they

know

what the statements are"

This is a good point that also Fabian raised in the linked google

doc.

could also imagine to return a more complicated POJO when calling
`executeMultiSql()`.

The POJO would include some `getSqlProperties()` such that a platform
gets insights into the query before executing. We could also trigger

the

execution more explicitly instead of hiding it behind an iterator.

2. "there are some special commands introduced in SQL client"

For platforms and SQL Client specific commands, we could offer a hook

to

the parser or a fallback parser in case the regular table environment
parser cannot deal with the statement.

However, all of that is future work and can be discussed in a

separate

FLIP.

3. +1 for the `Iterator` instead of `Iterable`.

4. "we should convert the checked exception to unchecked exception"

Yes, I meant using a runtime exception instead of a checked

exception.

There was no consensus on putting the exception into the

`TableResult`.


Regards,
Timo

On 01.04.20 15:35, Dawid Wysakowicz wrote:

When considering the multi-line support I think it is helpful to

start

with a use case in mind. In my opinion consumers of this method will

be:


   1. sql-client
   2. third-part sql based platforms

@Godfrey As for the quit/source/... commands. I think those belong

to

the responsibility of aforementioned. I think they should not be
understandable by the TableEnvironment. What would quit on a
TableEnvironment do? Moreover I think such commands should be

prefixed

appropriately. I think it's a common practice to e.g. prefix those

with

! or : to say they are meta commands of the tool rather than a

query.


I also don't necessarily understand why platform users need to know

the

kind of the query to use the proposed method. They should get the

type

from the TableResult#ResultKind. If the ResultKind is SUCCESS, it

was

DCL/DDL. If SUCCESS_WITH_CONTENT it was a DML/DQL. If that's not

enough

we can enrich the TableResult with more explicit kind of query, but

so

far I don't see such a need.

@Kurt In those cases I would assume the developers want to present
results of the queries anyway. Moreover I think it is safe to assume
they can adhere to such a contract that the results must be

iterated.


For direct users of TableEnvironment/Table API this method does not

make

much sense anyway, in my opinion. I think we can rather safely

assume

in

this scenario they do not want to submit multiple queries at a

single

time.


Best,

Dawid


On 01/04/2020 15:07, Kurt Young wrote:

One comment to `executeMultilineSql`, I'm afraid sometimes user

might

forget to
iterate the returned iterators, e.g. user submits a bunch of DDLs

and

expect the
framework will execute them one by one. But it didn't.

Best,
Kurt


On Wed, Apr 1, 2020 at 5:10 PM Aljoscha Krettek<

[email protected]>

wrote:

Agreed to what Dawid and Timo said.

To answer your question about multi line SQL: no, we don't think

we

need

this in Flink 1.11, we only wanted to make sure that the

interfaces

that

we now put in place will potentially allow this in the future.

Best,
Aljoscha

On 01.04.20 09:31, godfrey he wrote:

Hi, Timo & Dawid,

Thanks so much for the effort of `multiline statements

supporting`,

I have a few questions about this method:

1. users can well control the execution logic through the

proposed

method

     if they know what the statements are (a statement is a DDL, a

DML

or

others).
but if a statement is from a file, that means users do not know

what

the

statements are,
the execution behavior is unclear.
As a platform user, I think this method is hard to use, unless

the

platform

defines
a set of rule about the statements order, such as: no select in

the

middle,

dml must be at tail of sql file (which may be the most case in

product

env).
Otherwise the platform must parse the sql first, then know what

the

statements are.
If do like that, the platform can handle all cases through

`executeSql`

and

`StatementSet`.

2. SQL client can't also use `executeMultilineSql` to supports

multiline

statements,
     because there are some special commands introduced in SQL

client,

such as `quit`, `source`, `load jar` (not exist now, but maybe we

need

this

command
     to support dynamic table source and udf).
Does TableEnvironment also supports those commands?

3. btw, we must have this feature in release-1.11? I find there

are

few

user cases
     in the feedback document which behavior is unclear now.

regarding to "change the return value from `Iterable<Row` to
`Iterator<Row`",
I couldn't agree more with this change. Just as Dawid mentioned
"The contract of the Iterable#iterator is that it returns a new

iterator

each time,
     which effectively means we can iterate the results multiple

times.",

we does not provide iterate the results multiple times.
If we want do that, the client must buffer all results. but it's

impossible

for streaming job.

Best,
Godfrey

Dawid Wysakowicz<[email protected]>  于2020年4月1日周三 上午3:14写道：

Thank you Timo for the great summary! It covers (almost) all the

topics.

Even though in the end we are not suggesting much changes to the

current

state of FLIP I think it is important to lay out all possible

use

cases

so that we do not change the execution model every release.

There is one additional thing we discussed. Could we change the

result

type of TableResult#collect to Iterator<Row>? Even though those
interfaces do not differ much. I think Iterator better describes

that

the results might not be materialized on the client side, but

can

be

retrieved on a per record basis. The contract of the

Iterable#iterator

is that it returns a new iterator each time, which effectively

means

we

can iterate the results multiple times. Iterating the results is

not

possible when we don't retrieve all the results from the cluster

at

once.

I think we should also use Iterator for
TableEnvironment#executeMultilineSql(String statements):
Iterator<TableResult>.

Best,

Dawid

On 31/03/2020 19:27, Timo Walther wrote:

Hi Godfrey,

Aljoscha, Dawid, Klou, and I had another discussion around

FLIP-84.

In

particular, we discussed how the current status of the FLIP and

the

future requirements around multiline statements, async/sync,

collect()

fit together.

We also updated the FLIP-84 Feedback Summary document [1] with

some

use cases.

We believe that we found a good solution that also fits to what

is

in

the current FLIP. So no bigger changes necessary, which is

great!


Our findings were:

1. Async vs sync submission of Flink jobs:

Having a blocking `execute()` in DataStream API was rather a

mistake.

Instead all submissions should be async because this allows

supporting

both modes if necessary. Thus, submitting all queries async

sounds

good to us. If users want to run a job sync, they can use the
JobClient and wait for completion (or collect() in case of

batch

jobs).


2. Multi-statement execution:

For the multi-statement execution, we don't see a

contradication

with

the async execution behavior. We imagine a method like:

TableEnvironment#executeMultilineSql(String statements):
Iterable<TableResult>

Where the `Iterator#next()` method would trigger the next

statement

submission. This allows a caller to decide synchronously when

to

submit statements async to the cluster. Thus, a service such as

the

SQL Client can handle the result of each statement individually

and

process statement by statement sequentially.

3. The role of TableResult and result retrieval in general

`TableResult` is similar to `JobClient`. Instead of returning a
`CompletableFuture` of something, it is a concrete util class

where

some methods have the behavior of completable future (e.g.

collect(),

print()) and some are already completed (getTableSchema(),
getResultKind()).

`StatementSet#execute()` returns a single `TableResult` because

the

order is undefined in a set and all statements have the same

schema.

Its `collect()` will return a row for each executed `INSERT

INTO` in

the order of statement definition.

For simple `SELECT * FROM ...`, the query execution might block

until

`collect()` is called to pull buffered rows from the job (from
socket/REST API what ever we will use in the future). We can

say

that

a statement finished successfully, when the

`collect#Iterator#hasNext`

has returned false.

I hope this summarizes our discussion @Dawid/Aljoscha/Klou?

It would be great if we can add these findings to the FLIP

before we

start voting.

One minor thing: some `execute()` methods still throw a checked
exception; can we remove that from the FLIP? Also the above

mentioned

`Iterator#next()` would trigger an execution without throwing a
checked exception.

Thanks,
Timo

[1]

https://docs.google.com/document/d/1ueLjQWRPdLTFB_TReAyhseAX-1N3j4WYWD0F02Uau0E/edit#

On 31.03.20 06:28, godfrey he wrote:

Hi, Timo & Jark

Thanks for your explanation.
Agree with you that async execution should always be async,
and sync execution scenario can be covered  by async

execution.

It helps provide an unified entry point for batch and

streaming.

I think we can also use sync execution for some testing.
So, I agree with you that we provide `executeSql` method and

it's

async

method.
If we want sync method in the future, we can add method named
`executeSqlSync`.

I think we've reached an agreement. I will update the

document,

and

start
voting process.

Best,
Godfrey


Jark Wu<[email protected]>  于2020年3月31日周二 上午12:46写道：

Hi,

I didn't follow the full discussion.
But I share the same concern with Timo that streaming queries

should

always
be async.
Otherwise, I can image it will cause a lot of confusion and

problems

if

users don't deeply keep the "sync" in mind (e.g. client

hangs).

Besides, the streaming mode is still the majority use cases

of

Flink

and
Flink SQL. We should put the usability at a high priority.

Best,
Jark


On Mon, 30 Mar 2020 at 23:27, Timo Walther<

[email protected]>

wrote:

Hi Godfrey,

maybe I wasn't expressing my biggest concern enough in my

last

mail.

Even in a singleline and sync execution, I think that

streaming

queries
should not block the execution. Otherwise it is not possible

to

call

collect() or print() on them afterwards.

"there are too many things need to discuss for multiline":

True, I don't want to solve all of them right now. But what

know

is

that our newly introduced methods should fit into a

multiline

execution.
There is no big difference of calling `executeSql(A),
executeSql(B)` and
processing a multiline file `A;\nB;`.

I think the example that you mentioned can simply be

undefined

for

now.
Currently, no catalog is modifying data but just metadata.

This

is a

separate discussion.

"result of the second statement is indeterministic":

Sure this is indeterministic. But this is the implementers

fault

and we
cannot forbid such pipelines.

How about we always execute streaming queries async? It

would

unblock

executeSql() and multiline statements.

Having a `executeSqlAsync()` is useful for batch. However, I

don't

want
`sync/async` be the new batch/stream flag. The execution

behavior

should
come from the query itself.

Regards,
Timo


On 30.03.20 11:12, godfrey he wrote:

Hi Timo,

Agree with you that streaming queries is our top priority,
but I think there are too many things need to discuss for

multiline

statements:
e.g.
1. what's the behaivor of DDL and DML mixing for async

execution:

create table t1 xxx;
create table t2 xxx;
insert into t2 select * from t1 where xxx;
drop table t1; // t1 may be a MySQL table, the data will

also be

deleted.

t1 is dropped when "insert" job is running.

2. what's the behaivor of unified scenario for async

execution:

(as you
mentioned)
INSERT INTO t1 SELECT * FROM s;
INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT STREAM;

The result of the second statement is indeterministic,

because

the

first

statement maybe is running.
I think we need to put a lot of effort to define the

behavior of

logically

related queries.

In this FLIP, I suggest we only handle single statement,

and

we

also

introduce an async execute method
which is more important and more often used for users.

Dor the sync methods (like `TableEnvironment.executeSql`

and

`StatementSet.execute`),
the result will be returned until the job is finished. The

following

methods will be introduced in this FLIP:

       /**
        * Asynchronously execute the given single statement
        */
TableEnvironment.executeSqlAsync(String statement):

TableResult


/**
       * Asynchronously execute the dml statements as a

batch

       */
StatementSet.executeAsync(): TableResult

public interface TableResult {
         /**
          * return JobClient for DQL and DML in async mode,

else

return

Optional.empty
          */
         Optional<JobClient> getJobClient();
}

what do you think?

Best,
Godfrey

Timo Walther<[email protected]>  于2020年3月26日周四 下午9:15写道：

Hi Godfrey,

executing streaming queries must be our top priority

because

this

is

what distinguishes Flink from competitors. If we change

the

execution
behavior, we should think about the other cases as well to

not

break

the

API a third time.

I fear that just having an async execute method will not

be

enough

because users should be able to mix streaming and batch

queries

in a

unified scenario.

If I remember it correctly, we had some discussions in the

past

about
what decides about the execution mode of a query.

Currently, we

would
like to let the query decide, not derive it from the

sources.


So I could image a multiline pipeline as:

USE CATALOG 'mycat';
INSERT INTO t1 SELECT * FROM s;
INSERT INTO t2 SELECT * FROM s JOIN t1 EMIT STREAM;

For executeMultilineSql():

sync because regular SQL
sync because regular Batch SQL
async because Streaming SQL

For executeAsyncMultilineSql():

async because everything should be async
async because everything should be async
async because everything should be async

What we should not start for executeAsyncMultilineSql():

sync because DDL
async because everything should be async
async because everything should be async

What are you thoughts here?

Regards,
Timo


On 26.03.20 12:50, godfrey he wrote:

Hi Timo,

I agree with you that streaming queries mostly need async
execution.
In fact, our original plan is only introducing sync

methods in

this

FLIP,

and async methods (like "executeSqlAsync") will be

introduced

in

the

future

which is mentioned in the appendix.

Maybe the async methods also need to be considered in

this

FLIP.


I think sync methods is also useful for streaming which

can be

used

to

run

bounded source.
Maybe we should check whether all sources are bounded in

sync

execution

mode.

Also, if we block for streaming queries, we could never

support

multiline files. Because the first INSERT INTO would

block

the

further

execution.

agree with you, we need async method to submit multiline

files,

and files should be limited that the DQL and DML should

be

always in

the

end for streaming.

Best,
Godfrey

Timo Walther<[email protected]>  于2020年3月26日周四

下午4:29写道：

Hi Godfrey,

having control over the job after submission is a

requirement

that

was

requested frequently (some examples are [1], [2]). Users

would

like

to

get insights about the running or completed job.

Including

the

jobId,

jobGraph etc., the JobClient summarizes these

properties.


It is good to have a discussion about

synchronous/asynchronous

submission now to have a complete execution picture.

I thought we submit streaming queries mostly async and

just

wait for

the

successful submission. If we block for streaming

queries,

how

can we
collect() or print() results?

Also, if we block for streaming queries, we could never

support

multiline files. Because the first INSERT INTO would

block

the

further

execution.

If we decide to block entirely on streaming queries, we

need

the

async

execution methods in the design already. However, I

would

rather go

for

non-blocking streaming queries. Also with the `EMIT

STREAM`

key

word

in

mind that we might add to SQL statements soon.

Regards,
Timo

[1]https://issues.apache.org/jira/browse/FLINK-16761
[2]https://issues.apache.org/jira/browse/FLINK-12214

On 25.03.20 16:30, godfrey he wrote:

Hi Timo,

Thanks for the updating.

Regarding to "multiline statement support", I'm also

fine

that

`TableEnvironment.executeSql()` only supports single

line

statement,

and

we

can support multiline statement later (needs more

discussion

about

this).

Regarding to "StatementSet.explian()", I don't have

strong

opinions

about

that.

Regarding to "TableResult.getJobClient()", I think it's

unnecessary.

The

reason is: first, many statements (e.g. DDL, show xx,

use

xx)

will

not

submit a Flink job. second,

`TableEnvironment.executeSql()`

and

`StatementSet.execute()` are synchronous method,

`TableResult`

will

be

returned only after the job is finished or failed.

Regarding to "whether StatementSet.execute() needs to

throw

exception", I

think we should choose a unified way to tell whether

the

execution

is

successful. If `TableResult` contains ERROR kind

(non-runtime

exception),

users need to not only check the result but also catch

the

runtime
exception in their code. or `StatementSet.execute()`

does

not

throw

any

exception (including runtime exception), all exception
messages are

in

the

result.  I prefer "StatementSet.execute() needs to

throw

exception".

cc

@Jark

Wu<[email protected]>

I will update the agreed parts to the document first.

Best,
Godfrey


Timo Walther<[email protected]>  于2020年3月25日周三
下午6:51写道：

Hi Godfrey,

thanks for starting the discussion on the mailing

list.

And

sorry

again

for the late reply to FLIP-84. I have updated the

Google

doc

one

more

time to incorporate the offline discussions.

         From Dawid's and my view, it is fine to

postpone the

multiline

support

to a separate method. This can be future work even

though

we

will

need

it rather soon.

If there are no objections, I suggest to update the

FLIP-84

again

and

have another voting process.

Thanks,
Timo


On 25.03.20 11:17, godfrey he wrote:

Hi community,
Timo, Fabian and Dawid have some feedbacks about

FLIP-84[1].

The

feedbacks

are all about new introduced methods. We had a

discussion

yesterday,

and

most of feedbacks have been agreed upon. Here is the
conclusions:

*1. about proposed methods in `TableEnvironment`:*

the original proposed methods:

TableEnvironment.createDmlBatch(): DmlBatch
TableEnvironment.executeStatement(String statement):
ResultTable

the new proposed methods:

// we should not use abbreviations in the API, and

the

term

"Batch"

is

easily confused with batch/streaming processing
TableEnvironment.createStatementSet(): StatementSet

// every method that takes SQL should have `Sql` in

its

name

// supports multiline statement ???
TableEnvironment.executeSql(String statement):

TableResult


// new methods. supports explaining DQL and DML
TableEnvironment.explainSql(String statement,

ExplainDetail...

details):

String


*2. about proposed related classes:*

the original proposed classes:

interface DmlBatch {
             void addInsert(String insert);
             void addInsert(String targetPath, Table

table);

             ResultTable execute() throws Exception ;
             String explain(boolean extended);
}

public interface ResultTable {
             TableSchema getResultSchema();
             Iterable<Row> getResultRows();
}

the new proposed classes:

interface StatementSet {
             // every method that takes SQL should

have

`Sql` in

its

name

             // return StatementSet instance for

fluent

programming

             addInsertSql(String statement):

StatementSet


             // return StatementSet instance for

fluent

programming

             addInsert(String tablePath, Table table):

StatementSet

             // new method. support overwrite mode
             addInsert(String tablePath, Table table,

boolean

overwrite):

StatementSet

             explain(): String

             // new method. supports adding more

details

for the

result

             explain(ExplainDetail... extraDetails):

String


             // throw exception ???
             execute(): TableResult
}

interface TableResult {
             getTableSchema(): TableSchema

             // avoid custom parsing of an "OK" row in

programming

             getResultKind(): ResultKind

             // instead of `get` make it explicit that

this

is

might

be

triggering

an expensive operation
             collect(): Iterable<Row>

             // for fluent programming
             print(): Unit
}

enum ResultKind {
             SUCCESS, // for DDL, DCL and statements

with a

simple

"OK"

             SUCCESS_WITH_CONTENT, // rows with

important

content are

available

(DML, DQL)
}


*3. new proposed methods in `Table`*

`Table.insertInto()` will be deprecated, and the

following

methods

are

introduced:

Table.executeInsert(String tablePath): TableResult
Table.executeInsert(String tablePath, boolean

overwrite):

TableResult

Table.explain(ExplainDetail... details): String
Table.execute(): TableResult

There are two issues need further discussion, one is

whether

`TableEnvironment.executeSql(String statement):

TableResult`

needs

to

support multiline statement (or whether

`TableEnvironment`

needs

to

support

multiline statement), and another one is whether

`StatementSet.execute()`

needs to throw exception.

please refer to the feedback document [2] for the

details.


Any suggestions are warmly welcomed!

[1]

https://wiki.apache.org/confluence/pages/viewpage.action?pageId=134745878

[2]

https://docs.google.com/document/d/1ueLjQWRPdLTFB_TReAyhseAX-1N3j4WYWD0F02Uau0E/edit

Best,
Godfrey

Re: [DISCUSS] FLIP-84 Feedback Summary

Reply via email to