Re: [DISCUSS]FLIP-163: SQL Client Improvements

Timo Walther Wed, 03 Feb 2021 06:33:23 -0800

Hi everyone,

some feedback regarding the open questions. Maybe we can discuss the`TableEnvironment.executeMultiSql` story offline to determine how weproceed with this in the near future.


1) "whether the table environment has the ability to update itself"

Maybe there was some misunderstanding. I don't think that we shouldsupport `tEnv.getConfig.getConfiguration.setString("table.planner","old")`. Instead I'm proposing to support`TableEnvironment.create(Configuration)` where planner and executionmode are read immediately and a subsequent changes to these options willhave no effect. We are doing it similar in `newStreamExecutionEnvironment(Configuration)`. These two ConfigOption'smust not be SQL Client specific but can be part of the core table codebase. Many users would like to get a 100% preconfigured environment fromjust Configuration. And this is not possible right now. We can solveboth use cases in one change.


2) "the sql client, we will maintain two parsers"

I remember we had some discussion about this and decided that we wouldlike to maintain only one parser. In the end it is "One Flink SQL" wherecommands influence each other also with respect to keywords. It shouldbe fine to include the SQL Client commands in the Flink parser. Ofcource the table environment would not be able to handle the `Operation`instance that would be the result but we can introduce hooks to handlethose `Operation`s. Or we introduce parser extensions.

Can we skip `table.job.async` in the first version? We should furtherdiscuss whether we introduce a special SQL clause for wrapping asyncbehavior or if we use a config option? Esp. for streaming queries weneed to be careful and should force users to either "one INSERT INTO" or"one STATEMENT SET".


3) 4) "HIVE also uses these commands"

In general, Hive is not a good reference. Aligning the commands morewith the remaining commands should be our goal. We just had a MODULEdiscussion where we selected SHOW instead of LIST. But it is true thatJARs are not part of the catalog which is why I would not useCREATE/DROP. ADD/REMOVE are commonly siblings in the English language.Take a look at the Java collection API as another example.


6) "Most of the commands should belong to the table environment"

Thanks for updating the FLIP this makes things easier to understand. Itis good to see that most commends will be available in TableEnvironment.However, I would also support SET and RESET for consistency. Again, froman architectural point of view, if we would allow some kind of`Operation` hook in table environment, we could check for SQL Clientspecific options and forward to regular `TableConfig.getConfiguration`otherwise. What do you think?


Regards,
Timo


On 03.02.21 08:58, Jark Wu wrote:

Hi Timo,

I will respond some of the questions:

1) SQL client specific options

Whether it starts with "table" or "sql-client" depends on where the
configuration takes effect.
If it is a table configuration, we should make clear what's the behavior
when users change
the configuration in the lifecycle of TableEnvironment.

I agree with Shengkai `sql-client.planner` and `sql-client.execution.mode`
are something special
that can't be changed after TableEnvironment has been initialized. You can
see
`StreamExecutionEnvironment` provides `configure()`  method to override
configuration after
StreamExecutionEnvironment has been initialized.

Therefore, I think it would be better to still use  `sql-client.planner`
and `sql-client.execution.mode`.

2) Execution file

From my point of view, there is a big difference between

`sql-client.job.detach` and
`TableEnvironment.executeMultiSql()` that `sql-client.job.detach` will
affect every single DML statement
in the terminal, not only the statements in SQL files. I think the single
DML statement in the interactive
terminal is something like tEnv#executeSql() instead of
tEnv#executeMultiSql.
So I don't like the "multi" and "sql" keyword in `table.multi-sql-async`.
I just find that runtime provides a configuration called
"execution.attached" [1] which is false by default
which specifies if the pipeline is submitted in attached or detached mode.
It provides exactly the same
functionality of `sql-client.job.detach`. What do you think about using
this option?

If we also want to support this config in TableEnvironment, I think it
should also affect the DML execution
  of `tEnv#executeSql()`, not only DMLs in `tEnv#executeMultiSql()`.
Therefore, the behavior may look like this:

val tableResult = tEnv.executeSql("INSERT INTO ...")  ==> async by default
tableResult.await()   ==> manually block until finish
tEnv.getConfig().getConfiguration().setString("execution.attached", "true")
val tableResult2 = tEnv.executeSql("INSERT INTO ...")  ==> sync, don't need
to wait on the TableResult
tEnv.executeMultiSql(
"""
CREATE TABLE ....  ==> always sync
INSERT INTO ...  => sync, because we set configuration above
SET execution.attached = false;
INSERT INTO ...  => async
""")

On the other hand, I think `sql-client.job.detach`
and `TableEnvironment.executeMultiSql()` should be two separate topics,
as Shengkai mentioned above, SQL CLI only depends on
`TableEnvironment#executeSql()` to support multi-line statements.
I'm fine with making `executeMultiSql()` clear but don't want it to block
this FLIP, maybe we can discuss this in another thread.


Best,
Jark

[1]:
https://ci.apache.org/projects/flink/flink-docs-master/deployment/config.html#execution-attached

On Wed, 3 Feb 2021 at 15:33, Shengkai Fang <fskm...@gmail.com> wrote:

Hi, Timo.
Thanks for your detailed feedback. I have some thoughts about your
feedback.

*Regarding #1*: I think the main problem is whether the table environment
has the ability to update itself. Let's take a simple program as an
example.


```
TableEnvironment tEnv = TableEnvironment.create(...);

tEnv.getConfig.getConfiguration.setString("table.planner", "old");


tEnv.executeSql("...");

```

If we regard this option as a table option, users don't have to create
another table environment manually. In that case, tEnv needs to check
whether the current mode and planner are the same as before when executeSql
or explainSql. I don't think it's easy work for the table environment,
especially if users have a StreamExecutionEnvironment but set old planner
and batch mode. But when we make this option as a sql client option, users
only use the SET command to change the setting. We can rebuild a new table
environment when set successes.


*Regarding #2*: I think we need to discuss the implementation before
continuing this topic. In the sql client, we will maintain two parsers. The
first parser(client parser) will only match the sql client commands. If the
client parser can't parse the statement, we will leverage the power of the
table environment to execute. According to our blueprint,
TableEnvironment#executeSql is enough for the sql client. Therefore,
TableEnvironment#executeMultiSql is out-of-scope for this FLIP.

But if we need to introduce the `TableEnvironment.executeMultiSql` in the
future, I think it's OK to use the option `table.multi-sql-async` rather
than option `sql-client.job.detach`. But we think the name is not suitable
because the name is confusing for others. When setting the option false, we
just mean it will block the execution of the INSERT INTO statement, not DDL
or others(other sql statements are always executed synchronously). So how
about `table.job.async`? It only works for the sql-client and the
executeMultiSql. If we set this value false, the table environment will
return the result until the job finishes.


*Regarding #3, #4*: I still think we should use DELETE JAR and LIST JAR
because HIVE also uses these commands to add the jar into the classpath or
delete the jar. If we use  such commands, it can reduce our work for hive
compatibility.

For SHOW JAR, I think the main concern is the jars are not maintained by
the Catalog. If we really needs to keep consistent with SQL grammar, maybe
we should use

`ADD JAR` -> `CREATE JAR`,
`DELETE JAR` -> `DROP JAR`,
`LIST JAR` -> `SHOW JAR`.

*Regarding #5*: I agree with you that we'd better keep consistent.

*Regarding #6*: Yes. Most of the commands should belong to the table
environment. In the Summary section, I use the <NOTE> tag to identify which
commands should belong to the sql client and which commands should belong
to the table environment. I also add a new section about implementation
details in the FLIP.

Best,
Shengkai

Timo Walther <twal...@apache.org> 于2021年2月2日周二 下午6:43写道：

Thanks for this great proposal Shengkai. This will give the SQL Client a
very good update and make it production ready.

Here is some feedback from my side:

1) SQL client specific options

I don't think that `sql-client.planner` and `sql-client.execution.mode`
are SQL Client specific. Similar to `StreamExecutionEnvironment` and
`ExecutionConfig#configure` that have been added recently, we should
offer a possibility for TableEnvironment. How about we offer
`TableEnvironment.create(ReadableConfig)` and add a `table.planner` and
`table.execution-mode` to
`org.apache.flink.table.api.config.TableConfigOptions`?

2) Execution file

Did you have a look at the Appendix of FLIP-84 [1] including the mailing
list thread at that time? Could you further elaborate how the
multi-statement execution should work for a unified batch/streaming
story? According to our past discussions, each line in an execution file
should be executed blocking which means a streaming query needs a
statement set to execute multiple INSERT INTO statement, correct? We
should also offer this functionality in
`TableEnvironment.executeMultiSql()`. Whether `sql-client.job.detach` is
SQL Client specific needs to be determined, it could also be a general
`table.multi-sql-async` option?

3) DELETE JAR

Shouldn't the opposite of "ADD" be "REMOVE"? "DELETE" sounds like one is
actively deleting the JAR in the corresponding path.

4) LIST JAR

This should be `SHOW JARS` according to other SQL commands such as `SHOW
CATALOGS`, `SHOW TABLES`, etc. [2].

5) EXPLAIN [ExplainDetail[, ExplainDetail]*]

We should keep the details in sync with
`org.apache.flink.table.api.ExplainDetail` and avoid confusion about
differently named ExplainDetails. I would vote for `ESTIMATED_COST`
instead of `COST`. I'm sure the original author had a reason why to call
it that way.

6) Implementation details

It would be nice to understand how we plan to implement the given
features. Most of the commands and config options should go into
TableEnvironment and SqlParser directly, correct? This way users have a
unified way of using Flink SQL. TableEnvironment would provide a similar
user experience in notebooks or interactive programs than the SQL Client.

[1]

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=134745878

[2]

https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/show.html


Regards,
Timo


On 02.02.21 10:13, Shengkai Fang wrote:

Sorry for the typo. I mean `RESET` is much better rather than `UNSET`.

Shengkai Fang <fskm...@gmail.com> 于2021年2月2日周二 下午4:44写道：

Hi, Jingsong.

Thanks for your reply. I think `UNSET` is much better.

1. We don't need to introduce another command `UNSET`. `RESET` is
supported in the current sql client now. Our proposal just extends its
grammar and allow users to reset the specified keys.
2. Hive beeline also uses `RESET` to set the key to the default

value[1].

I think it is more friendly for batch users.

Best,
Shengkai

[1]

https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients


Jingsong Li <jingsongl...@gmail.com> 于2021年2月2日周二 下午1:56写道：

Thanks for the proposal, yes, sql-client is too outdated. +1 for
improving it.

About "SET"  and "RESET", Why not be "SET" and "UNSET"?

Best,
Jingsong

On Mon, Feb 1, 2021 at 2:46 PM Rui Li <lirui.fu...@gmail.com> wrote:

Thanks Shengkai for the update! The proposed changes look good to

me.


On Fri, Jan 29, 2021 at 8:26 PM Shengkai Fang <fskm...@gmail.com>

wrote:

Hi, Rui.
You are right. I have already modified the FLIP.

The main changes:

# -f parameter has no restriction about the statement type.
Sometimes, users use the pipe to redirect the result of queries to

debug

when submitting job by -f parameter. It's much convenient comparing

to

writing INSERT INTO statements.

# Add a new sql client option `sql-client.job.detach` .
Users prefer to execute jobs one by one in the batch mode. Users

can

set

this option false and the client will process the next job until

the

current job finishes. The default value of this option is false,

which

means the client will execute the next job when the current job is
submitted.

Best,
Shengkai



Rui Li <lirui.fu...@gmail.com> 于2021年1月29日周五 下午4:52写道：

Hi Shengkai,

Regarding #2, maybe the -f options in flink and hive have

different

implications, and we should clarify the behavior. For example, if

the

client just submits the job and exits, what happens if the file

contains

two INSERT statements? I don't think we should treat them as a

statement

set, because users should explicitly write BEGIN STATEMENT SET in

that

case. And the client shouldn't asynchronously submit the two jobs,

because

the 2nd may depend on the 1st, right?

On Fri, Jan 29, 2021 at 4:30 PM Shengkai Fang <fskm...@gmail.com>

wrote:

Hi Rui,
Thanks for your feedback. I agree with your suggestions.

For the suggestion 1: Yes. we are plan to strengthen the set

command. In

the implementation, it will just put the key-value into the
`Configuration`, which will be used to generate the table config.

If

hive

supports to read the setting from the table config, users are

able

to set

the hive-related settings.

For the suggestion 2: The -f parameter will submit the job and

exit.

If

the queries never end, users have to cancel the job by

themselves,

which is

not reliable(people may forget their jobs). In most case, queries

are used

to analyze the data. Users should use queries in the interactive

mode.


Best,
Shengkai

Rui Li <lirui.fu...@gmail.com> 于2021年1月29日周五 下午3:18写道：

Thanks Shengkai for bringing up this discussion. I think it

covers a

lot of useful features which will dramatically improve the

usability of our

SQL Client. I have two questions regarding the FLIP.

1. Do you think we can let users set arbitrary configurations

via

the

SET command? A connector may have its own configurations and we

don't have

a way to dynamically change such configurations in SQL Client.

For

example,

users may want to be able to change hive conf when using hive

connector [1].

2. Any reason why we have to forbid queries in SQL files

specified

with

the -f option? Hive supports a similar -f option but allows

queries

in the

file. And a common use case is to run some query and redirect

the

results

to a file. So I think maybe flink users would like to do the

same,

especially in batch scenarios.

[1] https://issues.apache.org/jira/browse/FLINK-20590

On Fri, Jan 29, 2021 at 10:46 AM Sebastian Liu <

liuyang0...@gmail.com>

wrote:

Hi Shengkai,

Glad to see this improvement. And I have some additional

suggestions:


#1. Unify the TableEnvironment in ExecutionContext to
StreamTableEnvironment for both streaming and batch sql.
#2. Improve the way of results retrieval: sql client collect

the

results
locally all at once using accumulators at present,
        which may have memory issues in JM or Local for the big

query

result.
Accumulator is only suitable for testing purpose.
        We may change to use SelectTableSink, which is based
on CollectSinkOperatorCoordinator.
#3. Do we need to consider Flink SQL gateway which is in

FLIP-91.

Seems

that this FLIP has not moved forward for a long time.
        Provide a long running service out of the box to

facilitate

the

sql
submission is necessary.

What do you think of these?

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-91%3A+Support+SQL+Client+Gateway



Shengkai Fang <fskm...@gmail.com> 于2021年1月28日周四 下午8:54写道：

Hi devs,

Jark and I want to start a discussion about FLIP-163:SQL

Client

Improvements.

Many users have complained about the problems of the sql

client.

For

example, users can not register the table proposed by FLIP-95.

The main changes in this FLIP:

- use -i parameter to specify the sql file to initialize the

table

environment and deprecated YAML file;
- add -f to submit sql file and deprecated '-u' parameter;
- add more interactive commands, e.g ADD JAR;
- support statement set syntax;


For more detailed changes, please refer to FLIP-163[1].

Look forward to your feedback.


Best,
Shengkai

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-163%3A+SQL+Client+Improvements



--

*With kind regards
------------------------------------------------------------
Sebastian Liu 刘洋
Institute of Computing Technology, Chinese Academy of Science
Mobile\WeChat: +86—15201613655
E-mail: liuyang0...@gmail.com <liuyang0...@gmail.com>
QQ: 3239559*



--
Best regards!
Rui Li


--
Best regards!
Rui Li


--
Best regards!
Rui Li



--
Best, Jingsong Lee

Re: [DISCUSS]FLIP-163: SQL Client Improvements

Reply via email to