Re: [DISCUSS] FLIP-222: Support full query lifecycle statements in SQL client

Jark Wu Wed, 08 Jun 2022 05:54:26 -0700

Hi Jing,

Regarding JOBTREE (job lineage), I agree with Paul that this is out of the
scope
 of this FLIP and can be discussed in another FLIP.


Job lineage is a big topic that may involve many problems:
1) how to collect and report job entities, attributes, and lineages?
2) how to integrate with data catalogs, e.g. Apache Atlas, DataHub?
3) how does Flink SQL CLI/Gateway know the lineage information and show
jobtree?
4) ...

Best,
Jark

On Wed, 8 Jun 2022 at 20:44, Jark Wu <[email protected]> wrote:

> Hi Paul,
>
> I'm fine with using JOBS. The only concern is that this may conflict with
> displaying more detailed
> information for query (e.g. query content, plan) in the future, e.g. SHOW
> QUERIES EXTENDED in ksqldb[1].
> This is not a big problem as we can introduce SHOW QUERIES in the future
> if necessary.
>
> > STOP JOBS <job_id> (with options `table.job.stop-with-savepoint` and
> `table.job.stop-with-drain`)
> What about STOP JOB <job_id> [WITH SAVEPOINT] [WITH DRAIN] ?
> It might be trivial and error-prone to set configuration before executing
> a statement,
> and the configuration will affect all statements after that.
>
> > CREATE SAVEPOINT <savepoint_path> FOR JOB <job_id>
> We can simplify the statement to "CREATE SAVEPOINT FOR JOB <job_id>",
> and always use configuration "state.savepoints.dir" as the default
> savepoint dir.
> The concern with using "<savepoint_path>" is here should be savepoint dir,
> and savepoint_path is the returned value.
>
> I'm fine with other changes.
>
> Thanks,
> Jark
>
> [1]:
> https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/show-queries/
>
>
>
>
> On Wed, 8 Jun 2022 at 15:07, Paul Lam <[email protected]> wrote:
>
>> Hi Jing,
>>
>> Thank you for your inputs!
>>
>> TBH, I haven’t considered the ETL scenario that you mentioned. I think
>> they’re managed just like other jobs interns of job lifecycles (please
>> correct me if I’m wrong).
>>
>> WRT to the SQL statements about SQL lineages, I think it might be a
>> little bit out of the scope of the FLIP, since it’s mainly about
>> lifecycles. By the way, do we have these functionalities in Flink CLI or
>> REST API already?
>>
>> WRT `RELEASE SAVEPOINT ALL`, I’m sorry for the deprecated FLIP docs, the
>> community is more in favor of `DROP SAVEPOINT <savepoint_path>`. I’m
>> updating the FLIP arcading to the latest discussions.
>>
>> Best,
>> Paul Lam
>>
>> 2022年6月8日 07:31，Jing Ge <[email protected]> 写道：
>>
>> Hi Paul,
>>
>> Sorry that I am a little bit too late to join this thread. Thanks for
>> driving this and starting this informative discussion. The FLIP looks
>> really interesting. It will help us a lot to manage Flink SQL jobs.
>>
>> Have you considered the ETL scenario with Flink SQL, where multiple SQLs
>> build a DAG for many DAGs?
>>
>> 1)
>> +1 for SHOW JOBS. I think sooner or later we will start to discuss how to
>> support ETL jobs. Briefly speaking, SQLs that used to build the DAG are
>> responsible to *produce* data as the result(cube, materialized view, etc.)
>> for the future consumption by queries. The INSERT INTO SELECT FROM example
>> in FLIP and CTAS are typical SQL in this case. I would prefer to call them
>> Jobs instead of Queries.
>>
>> 2)
>> Speaking of ETL DAG, we might want to see the lineage. Is it possible to
>> support syntax like:
>>
>> SHOW JOBTREE <job_id>  // shows the downstream DAG from the given job_id
>> SHOW JOBTREE <job_id> FULL // shows the whole DAG that contains the given
>> job_id
>> SHOW JOBTREES // shows all DAGs
>> SHOW ANCIENTS <job_id> // shows all parents of the given job_id
>>
>> 3)
>> Could we also support Savepoint housekeeping syntax? We ran into this
>> issue that a lot of savepoints have been created by customers (via their
>> apps). It will take extra (hacking) effort to clean it.
>>
>> RELEASE SAVEPOINT ALL
>>
>> Best regards,
>> Jing
>>
>> On Tue, Jun 7, 2022 at 2:35 PM Martijn Visser <[email protected]>
>> wrote:
>>
>>> Hi Paul,
>>>
>>> I'm still doubting the keyword for the SQL applications. SHOW QUERIES
>>> could
>>> imply that this will actually show the query, but we're returning IDs of
>>> the running application. At first I was also not very much in favour of
>>> SHOW JOBS since I prefer calling it 'Flink applications' and not 'Flink
>>> jobs', but the glossary [1] made me reconsider. I would +1 SHOW/STOP JOBS
>>>
>>> Also +1 for the CREATE/SHOW/DROP SAVEPOINT syntax.
>>>
>>> Best regards,
>>>
>>> Martijn
>>>
>>> [1]
>>>
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/glossary
>>>
>>> Op za 4 jun. 2022 om 10:38 schreef Paul Lam <[email protected]>:
>>>
>>> > Hi Godfrey,
>>> >
>>> > Sorry for the late reply, I was on vacation.
>>> >
>>> > It looks like we have a variety of preferences on the syntax, how
>>> about we
>>> > choose the most acceptable one?
>>> >
>>> > WRT keyword for SQL jobs, we use JOBS, thus the statements related to
>>> jobs
>>> > would be:
>>> >
>>> > - SHOW JOBS
>>> > - STOP JOBS <job_id> (with options `table.job.stop-with-savepoint` and
>>> > `table.job.stop-with-drain`)
>>> >
>>> > WRT savepoint for SQL jobs, we use the `CREATE/DROP` pattern with `FOR
>>> > JOB`:
>>> >
>>> > - CREATE SAVEPOINT <savepoint_path> FOR JOB <job_id>
>>> > - SHOW SAVEPOINTS FOR JOB <job_id> (show savepoints the current job
>>> > manager remembers)
>>> > - DROP SAVEPOINT <savepoint_path>
>>> >
>>> > cc @Jark @ShengKai @Martijn @Timo .
>>> >
>>> > Best,
>>> > Paul Lam
>>> >
>>> >
>>> > godfrey he <[email protected]> 于2022年5月23日周一 21:34写道：
>>> >
>>> >> Hi Paul,
>>> >>
>>> >> Thanks for the update.
>>> >>
>>> >> >'SHOW QUERIES' lists all jobs in the cluster, no limit on APIs
>>> >> (DataStream or SQL) or
>>> >> clients (SQL client or CLI).
>>> >>
>>> >> Is DataStream job a QUERY? I think not.
>>> >> For a QUERY, the most important concept is the statement. But the
>>> >> result does not contain this info.
>>> >> If we need to contain all jobs in the cluster, I think the name should
>>> >> be JOB or PIPELINE.
>>> >> I learn to SHOW PIPELINES and STOP PIPELINE [IF RUNNING] id.
>>> >>
>>> >> > SHOW SAVEPOINTS
>>> >> To list the savepoint for a specific job, we need to specify a
>>> >> specific pipeline,
>>> >> the syntax should be SHOW SAVEPOINTS FOR PIPELINE id
>>> >>
>>> >> Best,
>>> >> Godfrey
>>> >>
>>> >> Paul Lam <[email protected]> 于2022年5月20日周五 11:25写道：
>>> >> >
>>> >> > Hi Jark,
>>> >> >
>>> >> > WRT “DROP QUERY”, I agree that it’s not very intuitive, and that’s
>>> >> > part of the reason why I proposed “STOP/CANCEL QUERY” at the
>>> >> > beginning. The downside of it is that it’s not ANSI-SQL compatible.
>>> >> >
>>> >> > Another question is, what should be the syntax for ungracefully
>>> >> > canceling a query? As ShengKai pointed out in a offline discussion,
>>> >> > “STOP QUERY” and “CANCEL QUERY” might confuse SQL users.
>>> >> > Flink CLI has both stop and cancel, mostly due to historical
>>> problems.
>>> >> >
>>> >> > WRT “SHOW SAVEPOINT”, I agree it’s a missing part. My concern is
>>> >> > that savepoints are owned by users and beyond the lifecycle of a
>>> Flink
>>> >> > cluster. For example, a user might take a savepoint at a custom path
>>> >> > that’s different than the default savepoint path, I think jobmanager
>>> >> would
>>> >> > not remember that, not to mention the jobmanager may be a fresh new
>>> >> > one after a cluster restart. Thus if we support “SHOW SAVEPOINT”,
>>> it's
>>> >> > probably a best-effort one.
>>> >> >
>>> >> > WRT savepoint syntax, I’m thinking of the semantic of the savepoint.
>>> >> > Savepoints are alias for nested transactions in DB area[1], and
>>> there’s
>>> >> > correspondingly global transactions. If we consider Flink jobs as
>>> >> > global transactions and Flink checkpoints as nested transactions,
>>> >> > then the savepoint semantics are close, thus I think savepoint
>>> syntax
>>> >> > in SQL-standard could be considered. But again, I’m don’t have very
>>> >> > strong preference.
>>> >> >
>>> >> > Ping @Timo to get more inputs.
>>> >> >
>>> >> > [1] https://en.wikipedia.org/wiki/Nested_transaction <
>>> >> https://en.wikipedia.org/wiki/Nested_transaction>
>>> >> >
>>> >> > Best,
>>> >> > Paul Lam
>>> >> >
>>> >> > > 2022年5月18日 17:48，Jark Wu <[email protected]> 写道：
>>> >> > >
>>> >> > > Hi Paul,
>>> >> > >
>>> >> > > 1) SHOW QUERIES
>>> >> > > +1 to add finished time, but it would be better to call it
>>> "end_time"
>>> >> to
>>> >> > > keep aligned with names in Web UI.
>>> >> > >
>>> >> > > 2) DROP QUERY
>>> >> > > I think we shouldn't throw exceptions for batch jobs, otherwise,
>>> how
>>> >> to
>>> >> > > stop batch queries?
>>> >> > > At present, I don't think "DROP" is a suitable keyword for this
>>> >> statement.
>>> >> > > From the perspective of users, "DROP" sounds like the query
>>> should be
>>> >> > > removed from the
>>> >> > > list of "SHOW QUERIES". However, it doesn't. Maybe "STOP QUERY" is
>>> >> more
>>> >> > > suitable and
>>> >> > > compliant with commands of Flink CLI.
>>> >> > >
>>> >> > > 3) SHOW SAVEPOINTS
>>> >> > > I think this statement is needed, otherwise, savepoints are lost
>>> >> after the
>>> >> > > SAVEPOINT
>>> >> > > command is executed. Savepoints can be retrieved from REST API
>>> >> > > "/jobs/:jobid/checkpoints"
>>> >> > > with filtering "checkpoint_type"="savepoint". It's also worth
>>> >> considering
>>> >> > > providing "SHOW CHECKPOINTS"
>>> >> > > to list all checkpoints.
>>> >> > >
>>> >> > > 4) SAVEPOINT & RELEASE SAVEPOINT
>>> >> > > I'm a little concerned with the SAVEPOINT and RELEASE SAVEPOINT
>>> >> statements
>>> >> > > now.
>>> >> > > In the vendors, the parameters of SAVEPOINT and RELEASE SAVEPOINT
>>> are
>>> >> both
>>> >> > > the same savepoint id.
>>> >> > > However, in our syntax, the first one is query id, and the second
>>> one
>>> >> is
>>> >> > > savepoint path, which is confusing and
>>> >> > > not consistent. When I came across SHOW SAVEPOINT, I thought maybe
>>> >> they
>>> >> > > should be in the same syntax set.
>>> >> > > For example, CREATE SAVEPOINT FOR [QUERY] <query_id> & DROP
>>> SAVEPOINT
>>> >> > > <sp_path>.
>>> >> > > That means we don't follow the majority of vendors in SAVEPOINT
>>> >> commands. I
>>> >> > > would say the purpose is different in Flink.
>>> >> > > What other's opinion on this?
>>> >> > >
>>> >> > > Best,
>>> >> > > Jark
>>> >> > >
>>> >> > > [1]:
>>> >> > >
>>> >>
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-checkpoints
>>> >> > >
>>> >> > >
>>> >> > > On Wed, 18 May 2022 at 14:43, Paul Lam <[email protected]>
>>> wrote:
>>> >> > >
>>> >> > >> Hi Godfrey,
>>> >> > >>
>>> >> > >> Thanks a lot for your inputs!
>>> >> > >>
>>> >> > >> 'SHOW QUERIES' lists all jobs in the cluster, no limit on APIs
>>> >> (DataStream
>>> >> > >> or SQL) or
>>> >> > >> clients (SQL client or CLI). Under the hook, it’s based on
>>> >> > >> ClusterClient#listJobs, the
>>> >> > >> same with Flink CLI. I think it’s okay to have non-SQL jobs
>>> listed
>>> >> in SQL
>>> >> > >> client, because
>>> >> > >> these jobs can be managed via SQL client too.
>>> >> > >>
>>> >> > >> WRT finished time, I think you’re right. Adding it to the FLIP.
>>> But
>>> >> I’m a
>>> >> > >> bit afraid that the
>>> >> > >> rows would be too long.
>>> >> > >>
>>> >> > >> WRT ‘DROP QUERY’,
>>> >> > >>> What's the behavior for batch jobs and the non-running jobs?
>>> >> > >>
>>> >> > >>
>>> >> > >> In general, the behavior would be aligned with Flink CLI.
>>> Triggering
>>> >> a
>>> >> > >> savepoint for
>>> >> > >> a non-running job would cause errors, and the error message
>>> would be
>>> >> > >> printed to
>>> >> > >> the SQL client. Triggering a savepoint for batch(unbounded) jobs
>>> in
>>> >> > >> streaming
>>> >> > >> execution mode would be the same with streaming jobs. However,
>>> for
>>> >> batch
>>> >> > >> jobs in
>>> >> > >> batch execution mode, I think there would be an error, because
>>> batch
>>> >> > >> execution
>>> >> > >> doesn’t support checkpoints currently (please correct me if I’m
>>> >> wrong).
>>> >> > >>
>>> >> > >> WRT ’SHOW SAVEPOINTS’, I’ve thought about it, but Flink
>>> >> clusterClient/
>>> >> > >> jobClient doesn’t have such a functionality at the moment,
>>> neither do
>>> >> > >> Flink CLI.
>>> >> > >> Maybe we could make it a follow-up FLIP, which includes the
>>> >> modifications
>>> >> > >> to
>>> >> > >> clusterClient/jobClient and Flink CLI. WDYT?
>>> >> > >>
>>> >> > >> Best,
>>> >> > >> Paul Lam
>>> >> > >>
>>> >> > >>> 2022年5月17日 20:34，godfrey he <[email protected]> 写道：
>>> >> > >>>
>>> >> > >>> Godfrey
>>> >> > >>
>>> >> > >>
>>> >> >
>>> >>
>>> >
>>>
>>
>>

Re: [DISCUSS] FLIP-222: Support full query lifecycle statements in SQL client

Reply via email to