Hi,
regarding the (un-)quoted question, compatibility is of course an
important
argument, but in terms of consistency I'd find it a bit surprising that
WITH handles it differently than SET, and I wonder if that could cause
friction for developers when writing their SQL.
Regards
Ingo
On Thu, Feb 4, 2021 at 9:38 AM Jark Wu <imj...@gmail.com> wrote:
Hi all,
Regarding "One Parser", I think it's not possible for now because
Calcite
parser can't parse
special characters (e.g. "-") unless quoting them as string literals.
That's why the WITH option
key are string literals not identifiers.
SET table.exec.mini-batch.enabled = true and ADD JAR
/local/my-home/test.jar
have the same
problems. That's why we propose two parser, one splits lines into
multiple
statements and match special
command through regex which is light-weight, and delegate other
statements
to the other parser which is Calcite parser.
Note: we should stick on the unquoted SET table.exec.mini-batch.enabled
=
true syntax,
both for backward-compatibility and easy-to-use, and all the other
systems
don't have quotes on the key.
Regarding "table.planner" vs "sql-client.planner",
if we want to use "table.planner", I think we should explain clearly
what's
the scope it can be used in documentation.
Otherwise, there will be users complaining why the planner doesn't
change
when setting the configuration on TableEnv.
Would be better throwing an exception to indicate users it's now
allowed to
change planner after TableEnv is initialized.
However, it seems not easy to implement.
Best,
Jark
On Thu, 4 Feb 2021 at 15:49, godfrey he <godfre...@gmail.com> wrote:
Hi everyone,
Regarding "table.planner" and "table.execution-mode"
If we define that those two options are just used to initialize the
TableEnvironment, +1 for introducing table options instead of
sql-client
options.
Regarding "the sql client, we will maintain two parsers", I want to
give
more inputs:
We want to introduce sql-gateway into the Flink project (see FLIP-24 &
FLIP-91 for more info [1] [2]). In the "gateway" mode, the CLI client
and
the gateway service will communicate through Rest API. The " ADD JAR
/local/path/jar " will be executed in the CLI client machine. So when
we
submit a sql file which contains multiple statements, the CLI client
needs
to pick out the "ADD JAR" line, and also statements need to be
submitted
or
executed one by one to make sure the result is correct. The sql file
may
be
look like:
SET xxx=yyy;
create table my_table ...;
create table my_sink ...;
ADD JAR /local/path/jar1;
create function my_udf as com....MyUdf;
insert into my_sink select ..., my_udf(xx) from ...;
REMOVE JAR /local/path/jar1;
drop function my_udf;
ADD JAR /local/path/jar2;
create function my_udf as com....MyUdf2;
insert into my_sink select ..., my_udf(xx) from ...;
The lines need to be splitted into multiple statements first in the
CLI
client, there are two approaches:
1. The CLI client depends on the sql-parser: the sql-parser splits the
lines and tells which lines are "ADD JAR".
pro: there is only one parser
cons: It's a little heavy that the CLI client depends on the
sql-parser,
because the CLI client is just a simple tool which receives the user
commands and displays the result. The non "ADD JAR" command will be
parsed
twice.
2. The CLI client splits the lines into multiple statements and finds
the
ADD JAR command through regex matching.
pro: The CLI client is very light-weight.
cons: there are two parsers.
(personally, I prefer the second option)
Regarding "SHOW or LIST JARS", I think we can support them both.
For default dialect, we support SHOW JARS, but if we switch to hive
dialect, LIST JARS is also supported.
[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-91%3A+Support+SQL+Client+Gateway
Best,
Godfrey
Rui Li <lirui.fu...@gmail.com> 于2021年2月4日周四 上午10:40写道:
Hi guys,
Regarding #3 and #4, I agree SHOW JARS is more consistent with other
commands than LIST JARS. I don't have a strong opinion about REMOVE
vs
DELETE though.
While flink doesn't need to follow hive syntax, as far as I know,
most
users who are requesting these features are previously hive users.
So I
wonder whether we can support both LIST/SHOW JARS and REMOVE/DELETE
JARS
as synonyms? It's just like lots of systems accept both EXIT and
QUIT
as
the command to terminate the program. So if that's not hard to
achieve,
and
will make users happier, I don't see a reason why we must choose one
over
the other.
On Wed, Feb 3, 2021 at 10:33 PM Timo Walther <twal...@apache.org>
wrote:
Hi everyone,
some feedback regarding the open questions. Maybe we can discuss
the
`TableEnvironment.executeMultiSql` story offline to determine how
we
proceed with this in the near future.
1) "whether the table environment has the ability to update
itself"
Maybe there was some misunderstanding. I don't think that we
should
support
`tEnv.getConfig.getConfiguration.setString("table.planner",
"old")`. Instead I'm proposing to support
`TableEnvironment.create(Configuration)` where planner and
execution
mode are read immediately and a subsequent changes to these
options
will
have no effect. We are doing it similar in `new
StreamExecutionEnvironment(Configuration)`. These two
ConfigOption's
must not be SQL Client specific but can be part of the core table
code
base. Many users would like to get a 100% preconfigured
environment
from
just Configuration. And this is not possible right now. We can
solve
both use cases in one change.
2) "the sql client, we will maintain two parsers"
I remember we had some discussion about this and decided that we
would
like to maintain only one parser. In the end it is "One Flink SQL"
where
commands influence each other also with respect to keywords. It
should
be fine to include the SQL Client commands in the Flink parser. Of
cource the table environment would not be able to handle the
`Operation`
instance that would be the result but we can introduce hooks to
handle
those `Operation`s. Or we introduce parser extensions.
Can we skip `table.job.async` in the first version? We should
further
discuss whether we introduce a special SQL clause for wrapping
async
behavior or if we use a config option? Esp. for streaming queries
we
need to be careful and should force users to either "one INSERT
INTO"
or
"one STATEMENT SET".
3) 4) "HIVE also uses these commands"
In general, Hive is not a good reference. Aligning the commands
more
with the remaining commands should be our goal. We just had a
MODULE
discussion where we selected SHOW instead of LIST. But it is true
that
JARs are not part of the catalog which is why I would not use
CREATE/DROP. ADD/REMOVE are commonly siblings in the English
language.
Take a look at the Java collection API as another example.
6) "Most of the commands should belong to the table environment"
Thanks for updating the FLIP this makes things easier to
understand.
It
is good to see that most commends will be available in
TableEnvironment.
However, I would also support SET and RESET for consistency.
Again,
from
an architectural point of view, if we would allow some kind of
`Operation` hook in table environment, we could check for SQL
Client
specific options and forward to regular
`TableConfig.getConfiguration`
otherwise. What do you think?
Regards,
Timo
On 03.02.21 08:58, Jark Wu wrote:
Hi Timo,
I will respond some of the questions:
1) SQL client specific options
Whether it starts with "table" or "sql-client" depends on where
the
configuration takes effect.
If it is a table configuration, we should make clear what's the
behavior
when users change
the configuration in the lifecycle of TableEnvironment.
I agree with Shengkai `sql-client.planner` and
`sql-client.execution.mode`
are something special
that can't be changed after TableEnvironment has been
initialized.
You
can
see
`StreamExecutionEnvironment` provides `configure()` method to
override
configuration after
StreamExecutionEnvironment has been initialized.
Therefore, I think it would be better to still use
`sql-client.planner`
and `sql-client.execution.mode`.
2) Execution file
>From my point of view, there is a big difference between
`sql-client.job.detach` and
`TableEnvironment.executeMultiSql()` that
`sql-client.job.detach`
will
affect every single DML statement
in the terminal, not only the statements in SQL files. I think
the
single
DML statement in the interactive
terminal is something like tEnv#executeSql() instead of
tEnv#executeMultiSql.
So I don't like the "multi" and "sql" keyword in
`table.multi-sql-async`.
I just find that runtime provides a configuration called
"execution.attached" [1] which is false by default
which specifies if the pipeline is submitted in attached or
detached
mode.
It provides exactly the same
functionality of `sql-client.job.detach`. What do you think
about
using
this option?
If we also want to support this config in TableEnvironment, I
think
it
should also affect the DML execution
of `tEnv#executeSql()`, not only DMLs in
`tEnv#executeMultiSql()`.
Therefore, the behavior may look like this:
val tableResult = tEnv.executeSql("INSERT INTO ...") ==> async
by
default
tableResult.await() ==> manually block until finish
tEnv.getConfig().getConfiguration().setString("execution.attached",
"true")
val tableResult2 = tEnv.executeSql("INSERT INTO ...") ==> sync,
don't
need
to wait on the TableResult
tEnv.executeMultiSql(
"""
CREATE TABLE .... ==> always sync
INSERT INTO ... => sync, because we set configuration above
SET execution.attached = false;
INSERT INTO ... => async
""")
On the other hand, I think `sql-client.job.detach`
and `TableEnvironment.executeMultiSql()` should be two separate
topics,
as Shengkai mentioned above, SQL CLI only depends on
`TableEnvironment#executeSql()` to support multi-line
statements.
I'm fine with making `executeMultiSql()` clear but don't want
it to
block
this FLIP, maybe we can discuss this in another thread.
Best,
Jark
[1]:
https://ci.apache.org/projects/flink/flink-docs-master/deployment/config.html#execution-attached
On Wed, 3 Feb 2021 at 15:33, Shengkai Fang <fskm...@gmail.com>
wrote:
Hi, Timo.
Thanks for your detailed feedback. I have some thoughts about
your
feedback.
*Regarding #1*: I think the main problem is whether the table
environment
has the ability to update itself. Let's take a simple program
as
an
example.
```
TableEnvironment tEnv = TableEnvironment.create(...);
tEnv.getConfig.getConfiguration.setString("table.planner",
"old");
tEnv.executeSql("...");
```
If we regard this option as a table option, users don't have to
create
another table environment manually. In that case, tEnv needs to
check
whether the current mode and planner are the same as before
when
executeSql
or explainSql. I don't think it's easy work for the table
environment,
especially if users have a StreamExecutionEnvironment but set
old
planner
and batch mode. But when we make this option as a sql client
option,
users
only use the SET command to change the setting. We can rebuild
a
new
table
environment when set successes.
*Regarding #2*: I think we need to discuss the implementation
before
continuing this topic. In the sql client, we will maintain two
parsers.
The
first parser(client parser) will only match the sql client
commands.
If
the
client parser can't parse the statement, we will leverage the
power
of
the
table environment to execute. According to our blueprint,
TableEnvironment#executeSql is enough for the sql client.
Therefore,
TableEnvironment#executeMultiSql is out-of-scope for this FLIP.
But if we need to introduce the
`TableEnvironment.executeMultiSql`
in
the
future, I think it's OK to use the option
`table.multi-sql-async`
rather
than option `sql-client.job.detach`. But we think the name is
not
suitable
because the name is confusing for others. When setting the
option
false, we
just mean it will block the execution of the INSERT INTO
statement,
not
DDL
or others(other sql statements are always executed
synchronously).
So
how
about `table.job.async`? It only works for the sql-client and
the
executeMultiSql. If we set this value false, the table
environment
will
return the result until the job finishes.
*Regarding #3, #4*: I still think we should use DELETE JAR and
LIST
JAR
because HIVE also uses these commands to add the jar into the
classpath
or
delete the jar. If we use such commands, it can reduce our
work
for
hive
compatibility.
For SHOW JAR, I think the main concern is the jars are not
maintained
by
the Catalog. If we really needs to keep consistent with SQL
grammar,
maybe
we should use
`ADD JAR` -> `CREATE JAR`,
`DELETE JAR` -> `DROP JAR`,
`LIST JAR` -> `SHOW JAR`.
*Regarding #5*: I agree with you that we'd better keep
consistent.
*Regarding #6*: Yes. Most of the commands should belong to the
table
environment. In the Summary section, I use the <NOTE> tag to
identify
which
commands should belong to the sql client and which commands
should
belong
to the table environment. I also add a new section about
implementation
details in the FLIP.
Best,
Shengkai
Timo Walther <twal...@apache.org> 于2021年2月2日周二 下午6:43写道:
Thanks for this great proposal Shengkai. This will give the
SQL
Client
a
very good update and make it production ready.
Here is some feedback from my side:
1) SQL client specific options
I don't think that `sql-client.planner` and
`sql-client.execution.mode`
are SQL Client specific. Similar to
`StreamExecutionEnvironment`
and
`ExecutionConfig#configure` that have been added recently, we
should
offer a possibility for TableEnvironment. How about we offer
`TableEnvironment.create(ReadableConfig)` and add a
`table.planner`
and
`table.execution-mode` to
`org.apache.flink.table.api.config.TableConfigOptions`?
2) Execution file
Did you have a look at the Appendix of FLIP-84 [1] including
the
mailing
list thread at that time? Could you further elaborate how the
multi-statement execution should work for a unified
batch/streaming
story? According to our past discussions, each line in an
execution
file
should be executed blocking which means a streaming query
needs a
statement set to execute multiple INSERT INTO statement,
correct?
We
should also offer this functionality in
`TableEnvironment.executeMultiSql()`. Whether
`sql-client.job.detach`
is
SQL Client specific needs to be determined, it could also be a
general
`table.multi-sql-async` option?
3) DELETE JAR
Shouldn't the opposite of "ADD" be "REMOVE"? "DELETE" sounds
like
one
is
actively deleting the JAR in the corresponding path.
4) LIST JAR
This should be `SHOW JARS` according to other SQL commands
such
as
`SHOW
CATALOGS`, `SHOW TABLES`, etc. [2].
5) EXPLAIN [ExplainDetail[, ExplainDetail]*]
We should keep the details in sync with
`org.apache.flink.table.api.ExplainDetail` and avoid confusion
about
differently named ExplainDetails. I would vote for
`ESTIMATED_COST`
instead of `COST`. I'm sure the original author had a reason
why
to
call
it that way.
6) Implementation details
It would be nice to understand how we plan to implement the
given
features. Most of the commands and config options should go
into
TableEnvironment and SqlParser directly, correct? This way
users
have a
unified way of using Flink SQL. TableEnvironment would
provide a
similar
user experience in notebooks or interactive programs than the
SQL
Client.
[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=134745878
[2]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/show.html
Regards,
Timo
On 02.02.21 10:13, Shengkai Fang wrote:
Sorry for the typo. I mean `RESET` is much better rather than
`UNSET`.
Shengkai Fang <fskm...@gmail.com> 于2021年2月2日周二 下午4:44写道:
Hi, Jingsong.
Thanks for your reply. I think `UNSET` is much better.
1. We don't need to introduce another command `UNSET`.
`RESET`
is
supported in the current sql client now. Our proposal just
extends
its
grammar and allow users to reset the specified keys.
2. Hive beeline also uses `RESET` to set the key to the
default
value[1].
I think it is more friendly for batch users.
Best,
Shengkai
[1]
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
Jingsong Li <jingsongl...@gmail.com> 于2021年2月2日周二 下午1:56写道:
Thanks for the proposal, yes, sql-client is too outdated.
+1
for
improving it.
About "SET" and "RESET", Why not be "SET" and "UNSET"?
Best,
Jingsong
On Mon, Feb 1, 2021 at 2:46 PM Rui Li <
lirui.fu...@gmail.com>
wrote:
Thanks Shengkai for the update! The proposed changes look
good
to
me.
On Fri, Jan 29, 2021 at 8:26 PM Shengkai Fang <
fskm...@gmail.com
wrote:
Hi, Rui.
You are right. I have already modified the FLIP.
The main changes:
# -f parameter has no restriction about the statement
type.
Sometimes, users use the pipe to redirect the result of
queries
to
debug
when submitting job by -f parameter. It's much convenient
comparing
to
writing INSERT INTO statements.
# Add a new sql client option `sql-client.job.detach` .
Users prefer to execute jobs one by one in the batch
mode.
Users
can
set
this option false and the client will process the next
job
until
the
current job finishes. The default value of this option is
false,
which
means the client will execute the next job when the
current
job
is
submitted.
Best,
Shengkai
Rui Li <lirui.fu...@gmail.com> 于2021年1月29日周五 下午4:52写道:
Hi Shengkai,
Regarding #2, maybe the -f options in flink and hive
have
different
implications, and we should clarify the behavior. For
example,
if
the
client just submits the job and exits, what happens if
the
file
contains
two INSERT statements? I don't think we should treat
them
as
a
statement
set, because users should explicitly write BEGIN
STATEMENT
SET
in
that
case. And the client shouldn't asynchronously submit the
two
jobs,
because
the 2nd may depend on the 1st, right?
On Fri, Jan 29, 2021 at 4:30 PM Shengkai Fang <
fskm...@gmail.com
wrote:
Hi Rui,
Thanks for your feedback. I agree with your
suggestions.
For the suggestion 1: Yes. we are plan to strengthen
the
set
command. In
the implementation, it will just put the key-value into
the
`Configuration`, which will be used to generate the
table
config.
If
hive
supports to read the setting from the table config,
users
are
able
to set
the hive-related settings.
For the suggestion 2: The -f parameter will submit the
job
and
exit.
If
the queries never end, users have to cancel the job by
themselves,
which is
not reliable(people may forget their jobs). In most
case,
queries
are used
to analyze the data. Users should use queries in the
interactive
mode.
Best,
Shengkai
Rui Li <lirui.fu...@gmail.com> 于2021年1月29日周五 下午3:18写道:
Thanks Shengkai for bringing up this discussion. I
think
it
covers a
lot of useful features which will dramatically improve
the
usability of our
SQL Client. I have two questions regarding the FLIP.
1. Do you think we can let users set arbitrary
configurations
via
the
SET command? A connector may have its own
configurations
and
we
don't have
a way to dynamically change such configurations in SQL
Client.
For
example,
users may want to be able to change hive conf when
using
hive
connector [1].
2. Any reason why we have to forbid queries in SQL
files
specified
with
the -f option? Hive supports a similar -f option but
allows
queries
in the
file. And a common use case is to run some query and
redirect
the
results
to a file. So I think maybe flink users would like to
do
the
same,
especially in batch scenarios.
[1] https://issues.apache.org/jira/browse/FLINK-20590
On Fri, Jan 29, 2021 at 10:46 AM Sebastian Liu <
liuyang0...@gmail.com>
wrote:
Hi Shengkai,
Glad to see this improvement. And I have some
additional
suggestions:
#1. Unify the TableEnvironment in ExecutionContext to
StreamTableEnvironment for both streaming and batch
sql.
#2. Improve the way of results retrieval: sql client
collect
the
results
locally all at once using accumulators at present,
which may have memory issues in JM or Local
for
the
big
query
result.
Accumulator is only suitable for testing purpose.
We may change to use SelectTableSink, which
is
based
on CollectSinkOperatorCoordinator.
#3. Do we need to consider Flink SQL gateway which
is in
FLIP-91.
Seems
that this FLIP has not moved forward for a long time.
Provide a long running service out of the
box to
facilitate
the
sql
submission is necessary.
What do you think of these?
[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-91%3A+Support+SQL+Client+Gateway
Shengkai Fang <fskm...@gmail.com> 于2021年1月28日周四
下午8:54写道:
Hi devs,
Jark and I want to start a discussion about
FLIP-163:SQL
Client
Improvements.
Many users have complained about the problems of the
sql
client.
For
example, users can not register the table proposed
by
FLIP-95.
The main changes in this FLIP:
- use -i parameter to specify the sql file to
initialize
the
table
environment and deprecated YAML file;
- add -f to submit sql file and deprecated '-u'
parameter;
- add more interactive commands, e.g ADD JAR;
- support statement set syntax;
For more detailed changes, please refer to
FLIP-163[1].
Look forward to your feedback.
Best,
Shengkai
[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-163%3A+SQL+Client+Improvements
--
*With kind regards
------------------------------------------------------------
Sebastian Liu 刘洋
Institute of Computing Technology, Chinese Academy of
Science
Mobile\WeChat: +86—15201613655
E-mail: liuyang0...@gmail.com <liuyang0...@gmail.com
QQ: 3239559*
--
Best regards!
Rui Li
--
Best regards!
Rui Li
--
Best regards!
Rui Li
--
Best, Jingsong Lee
--
Best regards!
Rui Li