Re: Re: [DISCUSS] FLIP-218: Support SELECT clause in CREATE TABLE(CTAS)

godfrey he Mon, 23 May 2022 06:40:12 -0700

Hi Jark,

> "Table#createTableAs(tablePath)" seems a
>little strange to me.


`Table#createTableAs` is a bit misleading, I learn to Table#saveAs(tablePath).

Best,
Godfrey

Jark Wu <[email protected]> 于2022年5月18日周三 23:09写道：
>
> Hi Godfrey,
>
> Regarding Table API for CTAS, "Table#createTableAs(tablePath)" seems a
> little strange to me.
> Usually, the parameter after AS should be the query, but the query is in
> front of AS.
> I slightly prefer a method on TableEnvironment besides "createTable" (i.e.
> a special createTable with writing data).
>
> For example:
> void createTableAs(String path, TableDescriptor descriptor, Table query);
>
> Usage:
> tableEnv.createTableAs(
>                 "T1",
>                 TableDescriptor.forConnector("hive")
>                         .option("format", "parquet")
>                         .build(),
>                 query);
>
>
> Best,
> Jark
>
> On Wed, 18 May 2022 at 22:53, Jark Wu <[email protected]> wrote:
>
> > Hi Mang,
> >
> > Thanks for proposing this, CTAS is a very important API for batch users.
> >
> > I think the key problem of this FLIP is the ACID semantics of the CTAS
> > operation.
> > We care most about two parts of the semantics:
> > 1) Atomicity: the created table should be rolled back if the write is
> > failed.
> > 2) Isolation: the created table shouldn't be visible before the write is
> > successful (read uncommitted).
> >
> > From your investigation, it seems that:
> > - Flink (your FLIP): none of them.   ==> LEVEL-1
> > - Spark DataSource v1: is atomic (can roll back), but is not isolated. ==>
> > LEVEL-2
> > - Spark DataSource v2: guarantees both of them.  ==> LEVEL-3
> > - Hive MR: guarantees both of them. ==> LEVEL-3
> >
> > In order to support higher ACID semantics, I agree with Godfrey that we
> > need some hooks in JM
> > which can be called when the job is finished or failed/canceled. It might
> > look like
> > `StreamExecutionEnvironment#registerJobListener(JobListener)`,
> > but JobListener is called on the
> > client side. What we need is an interface called on the JM side, because
> > the job can be submitted in
> > detached mode.
> >
> > With this interface, we can easily support LEVEL-2 semantics by calling
> > `Catalog#dropTable` in the
> > `JobListener#onJobFailed`. We can also support LEVEL-3 by introducing
> > `StagingTableCatalog` like Spark,
> > calling `StagedTable#commitStagedChanges()` in `JobListener#onJobFinished`
> > and
> > calling StagedTable#abortStagedChanges() in `JobListener#onJobFailed`.
> >
> > Best,
> > Jark
> >
> >
> > On Wed, 18 May 2022 at 12:29, godfrey he <[email protected]> wrote:
> >
> >> Hi Mang,
> >>
> >> Thanks for driving this FLIP.
> >>
> >> Please follow the FLIP template[1] style, and the `Syntax ` is part of
> >> the `Public API Changes` section.
> >> ‘Program research’ and 'Implementation Plan' are part of the `Proposed
> >> Changes` section,
> >> or move ‘Program research’ to the appendix.
> >>
> >> > Providing methods that are used to execute CTAS for Table API users.
> >> We should introduce `createTable` in `Table` instead of
> >> `TableEnvironment`.
> >> Because all table operations are defined in `Table`, see:
> >> Table#executeInsert,
> >> Table#insertInto, etc.
> >> About the method name, I prefer to use `createTableAs`.
> >>
> >> > TableSink needs to provide the CleanUp API, developers implement as
> >> needed.
> >> I think it's hard for TableSink to implement a clean up operation. For
> >> file system sink,
> >> the data can be written to a temporary directory, but for key/value
> >> sinks, it's hard to
> >> remove the written keys, unless the sink records all written keys.
> >>
> >> > Do not do drop table operations in the framework, drop table is
> >> implemented in
> >> TableSink according to the needs of specific TableSink
> >> The TM process may crash at any time, and the drop operation will not
> >> be executed any more.
> >>
> >> How about we do the drop table operation and cleanup data action in the
> >> catalog?
> >> Where to execute the drop operation. one approach is in client, other is
> >> in JM.
> >> 1. in client: this requires the client to be alive until the job is
> >> finished and failed.
> >> 2. in JM: this requires the JM could provide some interfaces/hooks
> >> that the planner
> >> implements the logic and the code will be executed in JM.
> >> I prefer the approach two, but it requires more detail design with
> >> runtime @gaoyunhaii, @kevin.yingjie
> >>
> >>
> >> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP+Template
> >>
> >> Best,
> >> Godfrey
> >>
> >>
> >> Mang Zhang <[email protected]> 于2022年5月6日周五 11:24写道：
> >>
> >> >
> >> > Hi, Yuxia
> >> > Thanks for your reply!
> >> > About the question 1, we will not support, FLIP-218[1] is to simplify
> >> the complexity of user DDL and make it easier for users to use. I have
> >> never encountered this case in a big data.
> >> > About the question 2, we will provide a public API like below public
> >> void cleanUp();
> >> >
> >> >       Regarding the mechanism of cleanUp, people who are familiar with
> >> the runtime module need to provide professional advice, which is what we
> >> need to focus on.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Best regards,
> >> > Mang Zhang
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > At 2022-04-29 17:00:03, "yuxia" <[email protected]> wrote:
> >> > >Thanks for for driving this work, it's to be a useful feature.
> >> > >About the flip-218, I have some questions.
> >> > >
> >> > >1: Does our CTAS syntax support specify target table's schema
> >> including column name and data type? I think it maybe a useful fature in
> >> case we want to change the data types in target table instead of always
> >> copy the source table's schema. It'll be more flexible with this feature.
> >> > >Btw, MySQL's "CREATE TABLE ... SELECT Statement"[1] support this
> >> feature.
> >> > >
> >> > >2: Seems it'll requre sink to implement an public interface to drop
> >> table, so what's the interface will look like?
> >> > >
> >> > >[1] https://dev.mysql.com/doc/refman/8.0/en/create-table-select.html
> >> > >
> >> > >Best regards,
> >> > >Yuxia
> >> > >
> >> > >----- 原始邮件 -----
> >> > >发件人: "Mang Zhang" <[email protected]>
> >> > >收件人: "dev" <[email protected]>
> >> > >发送时间: 星期四, 2022年 4 月 28日 下午 4:57:24
> >> > >主题: [DISCUSS] FLIP-218: Support SELECT clause in CREATE TABLE(CTAS)
> >> > >
> >> > >Hi, everyone
> >> > >
> >> > >
> >> > >I would like to open a discussion for support select clause in CREATE
> >> TABLE(CTAS),
> >> > >With the development of business and the enhancement of flink sql
> >> capabilities, queries become more and more complex.
> >> > >Now the user needs to use the Create Table statement to create the
> >> target table first, and then execute the insert statement.
> >> > >However, the target table may have many columns, which will bring a
> >> lot of work outside the business logic to the user.
> >> > >At the same time, ensure that the schema of the created target table
> >> is consistent with the schema of the query result.
> >> > >Using a CTAS syntax like Hive/Spark can greatly facilitate the user.
> >> > >
> >> > >
> >> > >
> >> > >You can find more details in FLIP-218[1]. Looking forward to your
> >> feedback.
> >> > >
> >> > >
> >> > >
> >> > >[1]
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-218%3A+Support+SELECT+clause+in+CREATE+TABLE(CTAS)
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >--
> >> > >
> >> > >Best regards,
> >> > >Mang Zhang
> >> >
> >> >
> >>
> >

Re: Re: [DISCUSS] FLIP-218: Support SELECT clause in CREATE TABLE(CTAS)

Reply via email to