Re: [DISCUSS] FLIP-222: Support full query lifecycle statements in SQL client

Martijn Visser Fri, 06 May 2022 04:13:08 -0700

Hi Paul,

Great that you could find something in the SQL standard! I'll try to read
the FLIP once more completely next week to see if I have any more concerns.


Best regards,

Martijn

On Fri, 6 May 2022 at 08:21, Paul Lam <paullin3...@gmail.com> wrote:

> I had a look at SQL-2016 that Martijn mentioned, and found that
> maybe we could follow the transaction savepoint syntax.
>
>    - SAVEPOINT <savepoint specifier>
>    - RELEASE SAVEPOINT <savepoint specifier>
>
> These savepoint statements are supported in lots of databases, like
> Oracle[1], PG[2], MariaDB[3].
>
> They’re usually used in the middle of a SQL transaction, so the target
> would be the current transaction. But if used in Flink SQL session, we
> need to add a JOB/QUERY id when create a savepoint, thus the syntax
> would be:
>
>    - SAVEPOINT <job/query id> <savepoint path>
>    - RELEASE SAVEPOINT <savepoint path>
>
> I’m adding it as an alternative in the FLIP.
>
> [1]
> https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_10001.htm
> [2] https://www.postgresql.org/docs/current/sql-savepoint.html
> [3] https://mariadb.com/kb/en/savepoint/
>
> Best,
> Paul Lam
>
> 2022年5月4日 16:42，Paul Lam <paullin3...@gmail.com> 写道：
>
> Hi Shengkai,
>
> Thanks a lot for your input!
>
> > I just wonder how the users can get the web ui in the application mode.
> Therefore, it's better we can list the Web UI using the SHOW statement.
> WDYT?
>
> I think it's a valid approach. I'm adding it to the FLIP.
>
> > After the investigation, I am fine with the QUERY but the keyword JOB is
> also okay to me.
>
> In addition, CockroachDB has both SHOW QUERIES [1] and SHOW JOBS [2],
> while the former shows the active running queries and the latter shows the
> background tasks like schema changes. FYI.
>
> WRT the questions:
>
> > 1. Could you add some details about the behaviour with the different
> execution.target, e.g. session, application mode?
>
> IMHO, the difference between different `execution.target` is mostly about
> cluster startup, which has little relation with the proposed statements.
> These statements rely on the current ClusterClient/JobClient API,
> which is deployment mode agnostic. Canceling a job in an application
> cluster is the same as in a session cluster.
>
> BTW, application mode is still in the development progress ATM [3].
>
> > 2. Considering the SQL Client/Gateway is not limited to submitting the
> job
> to the specified cluster, is it able to list jobs in the other clusters?
>
> I think multi-cluster support in SQL Client/Gateway should be aligned with
> CLI, at least at the early phase. We may use SET  to set a cluster id for
> a
> session, then we have access to the cluster. However,  every SHOW
> statement would only involve one cluster.
>
> Best,
> Paul Lam
>
> [1] https://www.cockroachlabs.com/docs/stable/show-statements.html
> [2] https://www.cockroachlabs.com/docs/v21.2/show-jobs
> [3] https://issues.apache.org/jira/browse/FLINK-26541
>
> Shengkai Fang <fskm...@gmail.com> 于2022年4月29日周五 15:36写道：
>
>> Hi.
>>
>> Thanks for Paul's update.
>>
>> > It's better we can also get the infos about the cluster where the job is
>> > running through the DESCRIBE statement.
>>
>> I just wonder how the users can get the web ui in the application mode.
>> Therefore, it's better we can list the Web UI using the SHOW statement.
>> WDYT?
>>
>>
>> > QUERY or other keywords.
>>
>> I list the statement to manage the lifecycle of the query/dml in other
>> systems:
>>
>> Mysql[1] allows users to SHOW [FULL] PROCESSLIST and use the KILL command
>> to kill the query.
>>
>> ```
>> mysql> SHOW PROCESSLIST;
>>
>> mysql> KILL 27;
>> ```
>>
>>
>> Postgres use the following statements to kill the queries.
>>
>> ```
>> SELECT pg_cancel_backend(<pid of the process>)
>>
>> SELECT pg_terminate_backend(<pid of the process>)
>> ```
>>
>> KSQL uses the following commands to control the query lifecycle[4].
>>
>> ```
>> SHOW QUERIES;
>>
>> TERMINATE <query id>;
>>
>> ```
>>
>> [1] https://dev.mysql.com/doc/refman/8.0/en/show-processlist.html
>> [2] https://scaledynamix.com/blog/how-to-kill-mysql-queries/
>> [3]
>>
>> https://stackoverflow.com/questions/35319597/how-to-stop-kill-a-query-in-postgresql
>> [4]
>>
>> https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/show-queries/
>> [5]
>>
>> https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/terminate/
>>
>> After the investigation, I am fine with the QUERY but the keyword JOB is
>> also okay to me.
>>
>> We also have two questions here.
>>
>> 1. Could you add some details about the behaviour with the different
>> execution.target, e.g. session, application mode?
>>
>> 2. Considering the SQL Client/Gateway is not limited to submitting the job
>> to the specified cluster, is it able to list jobs in the other clusters?
>>
>>
>> Best,
>> Shengkai
>>
>> Paul Lam <paullin3...@gmail.com> 于2022年4月28日周四 17:17写道：
>>
>> > Hi Martjin,
>> >
>> > Thanks a lot for your reply! I agree that the scope may be a bit
>> confusing,
>> > please let me clarify.
>> >
>> > The FLIP aims to add new SQL statements that are supported only in
>> > sql-client, similar to
>> > jar statements [1]. Jar statements can be parsed into jar operations,
>> which
>> > are used only in
>> > CliClient in sql-client module and cannot be executed by
>> TableEnvironment
>> > (not available in
>> > Table API program that contains SQL that you mentioned).
>> >
>> > WRT the unchanged CLI client, I mean CliClient instead of the sql-client
>> > module, which
>> > currently contains the gateway codes (e.g. Executor). The FLIP mainly
>> > extends
>> > the gateway part, and barely touches CliClient and REST server (REST
>> > endpoint in FLIP-91).
>> >
>> > WRT the syntax, I don't have much experience with SQL standards, and I'd
>> > like to hear
>> > more opinions from the community. I prefer Hive-style syntax because I
>> > think many users
>> > are familiar with Hive, and there're on-going efforts to improve
>> Flink-Hive
>> > integration [2][3].
>> > But my preference is not strong, I'm okay with other options too. Do you
>> > think JOB/Task is
>> > a good choice, or do you have other preferred keywords?
>> >
>> > [1]
>> >
>> >
>> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/sql/jar/
>> > [2]
>> >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-152%3A+Hive+Query+Syntax+Compatibility
>> > [3]
>> >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-223%3A+Support+HiveServer2+Endpoint
>> >
>> > Best,
>> > Paul Lam
>> >
>> > Martijn Visser <martijnvis...@apache.org> 于2022年4月26日周二 20:14写道：
>> >
>> > > Hi Paul,
>> > >
>> > > Thanks for creating the FLIP and opening the discussion. I did get a
>> bit
>> > > confused about the title, being "query lifecycle statements in SQL
>> > client".
>> > > This sounds like you want to adopt the SQL client, but you want to
>> expand
>> > > the SQL syntax with lifecycle statements, which could be used from the
>> > SQL
>> > > client, but of course also in a Table API program that contains SQL.
>> > GIven
>> > > that you're highlighting the CLI client as unchanged, this adds to
>> more
>> > > confusion.
>> > >
>> > > I am interested if there's anything listed in the SQL 2016 standard on
>> > > these types of lifecycle statements. I did a quick scan for "SHOW
>> > QUERIES"
>> > > but couldn't find it. It would be great if we could stay as close as
>> > > possible to such syntax. Overall I'm not in favour of using QUERIES
>> as a
>> > > keyword. I think Flink applications are not queries, but short- or
>> long
>> > > running applications. Why should we follow Hive's setup and indeed not
>> > > others such as Snowflake, but also Postgres or MySQL?
>> > >
>> > > Best regards,
>> > >
>> > > Martijn Visser
>> > > https://twitter.com/MartijnVisser82
>> > > https://github.com/MartijnVisser
>> > >
>> > >
>> > > On Fri, 22 Apr 2022 at 12:06, Paul Lam <paullin3...@gmail.com> wrote:
>> > >
>> > > > Hi Shengkai,
>> > > >
>> > > > Thanks a lot for your opinions!
>> > > >
>> > > > > 1. I think the keyword QUERY may confuse users because the
>> statement
>> > > also
>> > > > > works for the DML statement.
>> > > >
>> > > > I slightly lean to QUERY, because:
>> > > >
>> > > > Hive calls DMLs queries. We could be better aligned with Hive using
>> > > QUERY,
>> > > > especially given that we plan to introduce Hive endpoint.
>> > > > QUERY is a more SQL-like concept and friendly to SQL users.
>> > > >
>> > > > In general, my preference: QUERY > JOB > TASK. I’m okay with JOB,
>> but
>> > not
>> > > > very good with TASK, as it conflicts with the task concept in Flink
>> > > runtime.
>> > > >
>> > > > We could wait for more feedbacks from the community.
>> > > >
>> > > > > 2. STOP/CANCEL is not very straightforward for the SQL users to
>> > > terminate
>> > > > > their jobs.
>> > > >
>> > > > Agreed. I’m okay with DROP. And if we want to align with Hive, KILL
>> > might
>> > > > an alternative.
>> > > >
>> > > > > 3. I think CREATE/DROP SAVEPOINTS statement is more SQL-like.
>> > > >
>> > > > Agreed. It’s more SQL-like and intuitive. I’m updating the syntax on
>> > the
>> > > > FLIP.
>> > > >
>> > > > > 4. SHOW TASKS can just list the job id and use the DESCRIPE to get
>> > more
>> > > > > detailed job infos.
>> > > >
>> > > > That is a more SQL-like approach I think. But considering the
>> > > > ClusterClient APIs, we can fetch the names and the status along in
>> one
>> > > > request,
>> > > > thus it may be more user friendly to return them all in the SHOW
>> > > > statement?
>> > > >
>> > > > > It's better we can also get the infos about the cluster where the
>> job
>> > > is
>> > > > > running on through the DESCRIBE statement.
>> > > >
>> > > > I think cluster info could be part of session properties instead.
>> WDYT?
>> > > >
>> > > > Best,
>> > > > Paul Lam
>> > > >
>> > > > > 2022年4月22日 11:14，Shengkai Fang <fskm...@gmail.com> 写道：
>> > > > >
>> > > > > Hi Paul
>> > > > >
>> > > > > Sorry for the late response. I propose my thoughts here.
>> > > > >
>> > > > > 1. I think the keyword QUERY may confuse users because the
>> statement
>> > > also
>> > > > > works for the DML statement. I find the Snowflakes[1] supports
>> > > > >
>> > > > > - CREATE TASK
>> > > > > - DROP TASK
>> > > > > - ALTER TASK
>> > > > > - SHOW TASKS
>> > > > > - DESCRIPE TASK
>> > > > >
>> > > > > I think we can follow snowflake to use `TASK` as the keyword or
>> use
>> > the
>> > > > > keyword `JOB`?
>> > > > >
>> > > > > 2. STOP/CANCEL is not very straightforward for the SQL users to
>> > > terminate
>> > > > > their jobs.
>> > > > >
>> > > > > ```
>> > > > > DROP TASK [IF EXISTS] <job id> PURGE; -- Forcely stop the job with
>> > > drain
>> > > > >
>> > > > > DROP TASK [IF EXISTS] <job id>; -- Stop the task with savepoints
>> > > > > ```
>> > > > >
>> > > > > Oracle[2] uses the PURGE to clean up the table and users can't not
>> > > > recover.
>> > > > > I think it also works for us to terminate the job permanently.
>> > > > >
>> > > > > 3. I think CREATE/DROP SAVEPOINTS statement is more SQL-like.
>> Users
>> > can
>> > > > use
>> > > > > the
>> > > > >
>> > > > > ```
>> > > > >  SET 'state.savepoints.dir' = '<path_to_savepoint>';
>> > > > >  SET 'state.savepoints.fomat' = 'native';
>> > > > >  CREATE SAVEPOINT <job id>;
>> > > > >
>> > > > >  DROP SAVEPOINT <path_to_savepoint>;
>> > > > > ```
>> > > > >
>> > > > > 4. SHOW TASKS can just list the job id and use the DESCRIPE to get
>> > more
>> > > > > detailed job infos.
>> > > > >
>> > > > > ```
>> > > > >
>> > > > > SHOW TASKS;
>> > > > >
>> > > > >
>> > > > > +----------------------------------+
>> > > > > |            job_id                |
>> > > > > +----------------------------------+
>> > > > > | 0f6413c33757fbe0277897dd94485f04 |
>> > > > > +----------------------------------+
>> > > > >
>> > > > > DESCRIPE TASK <job id>;
>> > > > >
>> > > > > +------------------------
>> > > > > |  job name   | status  |
>> > > > > +------------------------
>> > > > > | insert-sink | running |
>> > > > > +------------------------
>> > > > >
>> > > > > ```
>> > > > > It's better we can also get the infos about the cluster where the
>> job
>> > > is
>> > > > > running on through the DESCRIBE statement.
>> > > > >
>> > > > >
>> > > > > [1]
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.snowflake.com/en/sql-reference/ddl-pipeline.html#task-management
>> > > > <
>> > > >
>> > >
>> >
>> https://docs.snowflake.com/en/sql-reference/ddl-pipeline.html#task-management
>> > > > >
>> > > > > [2]
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.oracle.com/cd/E11882_01/server.112/e41084/statements_9003.htm#SQLRF01806
>> > > > <
>> > > >
>> > >
>> >
>> https://docs.oracle.com/cd/E11882_01/server.112/e41084/statements_9003.htm#SQLRF01806
>> > > > >
>> > > > >
>> > > > > Paul Lam <paullin3...@gmail.com <mailto:paullin3...@gmail.com>>
>> > > > 于2022年4月21日周四 10:36写道：
>> > > > >
>> > > > >> ping @Timo @Jark @Shengkai
>> > > > >>
>> > > > >> Best,
>> > > > >> Paul Lam
>> > > > >>
>> > > > >>> 2022年4月18日 17:12，Paul Lam <paullin3...@gmail.com> 写道：
>> > > > >>>
>> > > > >>> Hi team,
>> > > > >>>
>> > > > >>> I’d like to start a discussion about FLIP-222 [1], which adds
>> query
>> > > > >> lifecycle
>> > > > >>> statements to SQL client.
>> > > > >>>
>> > > > >>> Currently, SQL client supports submitting queries (queries in a
>> > broad
>> > > > >> sense,
>> > > > >>> including DQLs and DMLs) but no further lifecycle statements,
>> like
>> > > > >> canceling
>> > > > >>> a query or triggering a savepoint. That makes SQL users have to
>> > rely
>> > > on
>> > > > >>> CLI or REST API to manage theirs queries.
>> > > > >>>
>> > > > >>> Thus, I propose to introduce the following statements to fill
>> the
>> > > gap.
>> > > > >>> SHOW QUERIES
>> > > > >>> STOP QUERY <query_id>
>> > > > >>> CANCEL QUERY <query_id>
>> > > > >>> TRIGGER SAVEPOINT <savepoint_path>
>> > > > >>> DISPOSE SAVEPOINT <savepoint_path>
>> > > > >>> These statement would align SQL client with CLI, providing the
>> full
>> > > > >> lifecycle
>> > > > >>> management for queries/jobs.
>> > > > >>>
>> > > > >>> Please see the FLIP page[1] for more details. Thanks a lot!
>> > > > >>> (For reference, the previous discussion thread see [2].)
>> > > > >>>
>> > > > >>> [1]
>> > > > >>
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-222%3A+Support+full+query+lifecycle+statements+in+SQL+client
>> > > > >> <
>> > > > >>
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-222:+Support+full+query+lifecycle+statements+in+SQL+client
>> > > > <
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-222:+Support+full+query+lifecycle+statements+in+SQL+client
>> > > > >
>> > > > >>>
>> > > > >>> [2]
>> > https://lists.apache.org/thread/wr47ng0m2hdybjkrwjlk9ftwg403odqb
>> > > <
>> > > > https://lists.apache.org/thread/wr47ng0m2hdybjkrwjlk9ftwg403odqb> <
>> > > > >> https://lists.apache.org/thread/wr47ng0m2hdybjkrwjlk9ftwg403odqb
>> <
>> > > > https://lists.apache.org/thread/wr47ng0m2hdybjkrwjlk9ftwg403odqb>>
>> > > > >>>
>> > > > >>> Best,
>> > > > >>> Paul Lam
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] FLIP-222: Support full query lifecycle statements in SQL client

Reply via email to