Re: [DISCUSS] FLIP-463: Schema Definition in CREATE TABLE AS Statement

Jeyhun Karimov Thu, 13 Jun 2024 14:02:38 -0700

Thanks Sergio and Timo for your answers.
Sounds good to me.
Looking forward for this feature.


Regards,
Jeyhun

On Thu, Jun 13, 2024 at 4:48 PM Sergio Pena <[email protected]>
wrote:

> Sure Yuxia, I just added the support for RTAS statements too.
>
> - Sergio
>
> On Wed, Jun 12, 2024 at 8:22 PM yuxia <[email protected]> wrote:
>
> > Hi, Sergio.
> > Thanks for driving the FLIP. Given we also has REPLACE TABLE AS
> > Statement[1] and it's almost same with CREATE TABLE AS Statement,
> > would you mind also supporting schema definition for REPLACE TABLE AS
> > Statement in this FLIP? It'll be a great to align REPLACE TABLE AS
> Statement
> > to CREATE TABLE AS Statement
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-303%3A+Support+REPLACE+TABLE+AS+SELECT+statement
> >
> > Best regards,
> > Yuxia
> >
> > ----- 原始邮件 -----
> > 发件人: "Timo Walther" <[email protected]>
> > 收件人: "dev" <[email protected]>
> > 发送时间: 星期三, 2024年 6 月 12日 下午 10:19:14
> > 主题: Re: [DISCUSS] FLIP-463: Schema Definition in CREATE TABLE AS
> Statement
> >
> > > I just noticed the CREATE TABLE LIKE statement allows the definition
> >  > of new columns in the CREATE part. The difference
> >  > with this CTAS proposal is that TABLE LIKE appends the new columns at
> >  > the end of the schema instead of adding them
> >  > at the beginning like this proposal and Mysql do.
> >
> > This should be fine. The LIKE rather "extends from" the right table.
> > Whereas the SELECT in CTAS rather extends the left schema definition.
> > Given that "the extended part" is always appended, we could argue that
> > the current CTAS behavior in the FLIP is acceptable.
> >
> >  > If you want to rename a column in the create part, then that column
> >  > position must be in the same position as the query column.
> >  > I didn't like the Postgres approach because it does not let us add
> >  > columns that do not exist in the query schema.
> >
> > Flink offers similar functionality in INSERT INTO. INSERT INTO also
> > allows syntax like: `INSERT INTO (b, c) SELECT * FROM t`. So given that
> > our CTAS can be seen as a CREATE TABLE + INSERT INTO. I would adopt
> > Jeyhun comment in the FLIP. If users don't want to add additional schema
> > parts, the same column reordering should be available as if one would
> > write a INSERT INTO.
> >
> > Regards,
> > Timo
> >
> >
> >
> >
> > On 12.06.24 04:30, Yanquan Lv wrote:
> > > Hi Sergio, thanks for driving it, +1 for this.
> > >
> > > I have some comments:
> > > 1. If we have a source table with primary keys and partition keys
> > defined,
> > > what is the default behavior if PARTITIONED and DISTRIBUTED not
> specified
> > > in the CTAS statement, It should not be inherited by default？
> > > 2. I suggest providing a complete syntax that includes table_properties
> > > like FLIP-218.
> > >
> > >
> > > Sergio Pena <[email protected]> 于2024年6月12日周三 03:54写道：
> > >
> > >> I just noticed the CREATE TABLE LIKE statement allows the definition
> of
> > new
> > >> columns in the CREATE part. The difference
> > >> with this CTAS proposal is that TABLE LIKE appends the new columns at
> > the
> > >> end of the schema instead of adding them
> > >> at the beginning like this proposal and Mysql do.
> > >>
> > >>> create table t1(id int, name string);
> > >>>> create table s1(a int, b string) like t1;
> > >>>> describe s1;
> > >>
> > >> +-------------+-----------+----------+--------+
> > >>> | Column Name | Data Type | Nullable | Extras |
> > >>> +-------------+-----------+----------+--------+
> > >>> | id          | INT       | NULL     |        |
> > >>> | name        | STRING    | NULL     |        |
> > >>> | a           | INT       | NULL     |        |
> > >>> | b           | STRING    | NULL     |        |
> > >>> +-------------+-----------+----------+--------+
> > >>
> > >>
> > >>
> > >> The CREATE TABLE LIKE also does not let the definition of existing
> > columns
> > >> in the CREATE part. The statement fails
> > >> that the column already exists.
> > >>
> > >>> create table t1(id int, name string);
> > >>
> > >>> create table s1(id double) like t1;
> > >>> A column named 'id' already exists in the base table.
> > >>>
> > >>
> > >> What do you guys think of making it similar to the CREATE TABLE LIKE?
> > Seems
> > >> the best approach in order to
> > >> be compatible with it.
> > >>
> > >> - Sergio
> > >>
> > >> On Tue, Jun 11, 2024 at 2:10 PM Sergio Pena <[email protected]>
> > wrote:
> > >>
> > >>> Thanks Timo for answering Jeyhun questions.
> > >>>
> > >>> To add info more about your questions Jeyhun. This proposal is not
> > >>> handling NULL/NOT_NULL types. I noticed that
> > >>> the current CTAS impl. (as Timo said) adds this constraint as part of
> > the
> > >>> resulting schema. And when defining
> > >>> a primary key in the CREATE part, if the resulting schema does not
> > have a
> > >>> NOT NULL in the column then the CTAS
> > >>> will fail. This is similar to the CREATE TABLE LIKE which expects the
> > >> LIKE
> > >>> table to have a NOT NULL column if
> > >>> the user defines a primary key in the CREATE part.
> > >>>
> > >>>> In some cases, redefining the column types might be redundant,
> > >> especially
> > >>>> when users dont change the column type. A user just wants to change
> > the
> > >>>> column name from the SELECT clause. Should we also support this
> > >> scenario,
> > >>>> similar to postgres?
> > >>>
> > >>> I looked into Postgres too. Postgres matches the columns based on the
> > >>> order defined in the create and select part.
> > >>> If you want to rename a column in the create part, then that column
> > >>> position must be in the same position as the query column.
> > >>> I didn't like the Postgres approach because it does not let us add
> > >> columns
> > >>> that do not exist in the query schema.
> > >>>
> > >>> i.e. query has schema (a int, b string), now the `a` column is
> renamed
> > to
> > >>> `id` because both are in the same position 0
> > >>> `create table s1(id int) as select a, b from t1`;
> > >>> results in: [id int, b string]
> > >>>
> > >>> I think, if users want to rename then they can use a different alias
> in
> > >>> the select part. They could also do explicit casting
> > >>> for changing the data types, which now makes it redundant (as you
> said)
> > >> to
> > >>> allow redefining the query columns again. But
> > >>> perhaps there are cases where explicit casting does not work and just
> > >>> defining the column would? i.e. making a nullable
> > >>> type to not null? I couldn't make `cast(c1 as int not null)` to work
> > for
> > >>> instance, but it may work in the create part?
> > >>>
> > >>>> Could you also mention the casting rules in the FLIP for this case?
> > >>>
> > >>> I mentioned they're the same as insert/select when doing implicit
> > >> casting.
> > >>> I will search for more info about the insert/select
> > >>> and add the casting rules in the flip..
> > >>>
> > >>> - Sergio
> > >>>
> > >>>
> > >>> On Tue, Jun 11, 2024 at 12:59 AM Timo Walther <[email protected]>
> > >> wrote:
> > >>>
> > >>>> Hi Sergio,
> > >>>>
> > >>>> thanks for proposing this FLIP for finalizing the CTAS statement.
> > >>>> Adopting the logic from MySQL for deriving and potentially
> overwriting
> > >>>> parts of the schema should be easy to understand for everyone. So +1
> > for
> > >>>> the FLIP in general.
> > >>>>
> > >>>>   > How do you handle CTAS statements with SELECT clauses that have
> > >>>>   > (implicit or explicit) NULLABLE or NOT NULLABLE columns?
> > >>>>
> > >>>> @Jeyhun: I don't think there is anything special about this. The
> > current
> > >>>> CTAS implementation should already cover that. It takes the
> > nullability
> > >>>> of the column's data type as a constraint into derived schema. Keep
> in
> > >>>> mind that nullability is part of the data type in Flink, not only a
> > >>>> constraint on the schema. This decision was made due to Calcite
> > >> internals.
> > >>>>
> > >>>>   > redefining the column types might be redundant, especially
> > >>>>   > when users dont change the column type
> > >>>>
> > >>>> This is indeed a good point. On one hand, I think we should avoid
> > >>>> further complicating the syntax. But looking at other vendors [1]
> this
> > >>>> seems indeed a valid use case. If it doesn't cause too many special
> > >>>> cases in the parser (and it's look-ahead), I'm fine with supporting
> a
> > >>>> list of column names as well. However, the most important use case
> > will
> > >>>> be specifying a watermark, metadata columns, or other schema parts
> > that
> > >>>> are not just columns names.
> > >>>>
> > >>>> Regards,
> > >>>> Timo
> > >>>>
> > >>>>
> > >>>> [1]
> > >>>>
> > >>>>
> > >>
> >
> https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse?view=azure-sqldw-latest
> > >>>>
> > >>>>
> > >>>> On 10.06.24 21:37, Jeyhun Karimov wrote:
> > >>>>> Hi Sergio,
> > >>>>>
> > >>>>> Thanks for driving this FLIP. +1 for it.
> > >>>>> I have a few questions:
> > >>>>>
> > >>>>> - How do you handle CTAS statements with SELECT clauses that have
> > >>>> (implicit
> > >>>>> or explicit) NULLABLE or NOT NULLABLE columns? Could you also
> mention
> > >>>> the
> > >>>>> casting rules in the FLIP for this case?
> > >>>>> - In some cases, redefining the column types might be redundant,
> > >>>> especially
> > >>>>> when users dont change the column type. For example, a user just
> > wants
> > >>>> to
> > >>>>> change the column name from the SELECT clause.
> > >>>>> Should we also support this scenario, similar to the postgres [1] ?
> > >>>>>
> > >>>>> Regards,
> > >>>>> Jeyhun
> > >>>>>
> > >>>>>
> > >>>>> [1] https://www.postgresql.org/docs/8.1/sql-createtableas.html
> > >>>>>
> > >>>>> On Mon, Jun 10, 2024 at 6:28 PM Sergio Pena
> > >> <[email protected]
> > >>>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi David,
> > >>>>>>
> > >>>>>> The CTAS feature is already part of Flink (proposed in FLIP-218
> > [1]).
> > >>>> The
> > >>>>>> new FLIP-463 is just to extend the CTAS syntax to allow for adding
> > >> new
> > >>>>>> columns to
> > >>>>>> the created table that are not part of the generated schema. I
> think
> > >>>>>> FLIP-218 [1] was discussed in the mail list somewhere, but I
> > couldn't
> > >>>> find
> > >>>>>> the discussion thread.
> > >>>>>> I was hoping it could contain the answers for your questions as
> > >> that's
> > >>>>>> where CTAS was implemented. There's a user doc [2] for it that may
> > >> help
> > >>>>>> too.
> > >>>>>>
> > >>>>>> But, in a nutshell, CTAS is similar to running CREATE and then
> > >>>>>> INSERT/SELECT statements. The CTAS will execute a background job
> > that
> > >>>>>> executes
> > >>>>>> the AS SELECT query and then inserts the data into the newly
> created
> > >>>> table.
> > >>>>>> So the table will be kept up to date as the source data changes as
> > >> you
> > >>>>>> said.
> > >>>>>>
> > >>>>>> Generated columns are from the AS SELECT schema, right? Yes,
> that's
> > >>>> already
> > >>>>>> allowed. FLIP-463 will also allow modifying the schema of the sink
> > >>>> table
> > >>>>>> in case you need to add more columns.
> > >>>>>>
> > >>>>>> I notice amazon [3] talks of the difference between view and CTAS;
> > >> that
> > >>>>>>> CTAS persists the content. Is this the approach we are taking? If
> > so
> > >>>>>> where
> > >>>>>>> are we persisting?
> > >>>>>>
> > >>>>>> This is already supported by FLIP-218 [1]. But yes, the CTAS
> > persists
> > >>>> the
> > >>>>>> content in the new table you specify in the CTAS statement.
> > >>>>>>
> > >>>>>>
> > >>>>>>> is the table read-only, or can we insert/ delete / update into
> it.
> > >> If
> > >>>> it
> > >>>>>>> is read only how will the inserts , deletes updates fail.
> > >>>>>>
> > >>>>>> Are there any restrictions on the select ? Can the select be a
> join,
> > >>>>>>> aggregate, windowed?
> > >>>>>>
> > >>>>>> I notice Azure [2] supports ISNULL. Is there a thought to change
> the
> > >>>>>>> nullability for the CTAS?
> > >>>>>>
> > >>>>>> Amazon [4] ignores ordered by clauses ? is that the same for this
> > >>>> proposal?
> > >>>>>>
> > >>>>>> I suppose all of this is already supported by FLIP-218. I don't
> see
> > >> any
> > >>>>>> restrictions on the query. You can learn more about it in the
> > >> FLIP-218
> > >>>> [1]
> > >>>>>>
> > >>>>>> - Sergio
> > >>>>>>
> > >>>>>> [1]
> > >>>>>>
> > >>>>
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=199541185
> > >>>>>> [2]
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#as-select_statement
> > >>>>>>
> > >>>>>> On Mon, Jun 10, 2024 at 3:10 AM David Radley <
> > >> [email protected]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Sergio,
> > >>>>>>> Sounds good . I am relatively new to this area and had some basic
> > >>>>>>> questions:
> > >>>>>>>
> > >>>>>>> I notice in [1] it talks of materialized views. And CREATE view
> can
> > >>>>>>> already take the AS keyword. It would be useful to me to
> understand
> > >>>> when
> > >>>>>> we
> > >>>>>>> would use each of these.
> > >>>>>>>
> > >>>>>>> - I assume the table will be kept up to date as the source data
> > >>>> changes
> > >>>>>>> like a view.
> > >>>>>>> - Are there any restrictions on the select ? Can the select be a
> > >> join,
> > >>>>>>> aggregate, windowed?
> > >>>>>>> - I assume generated columns are allowed?
> > >>>>>>> - is the table read-only, or can we insert/ delete / update into
> > it.
> > >>>> If
> > >>>>>> it
> > >>>>>>> is read only how will the inserts , deletes updates fail.
> > >>>>>>> - I notice Azure [2] supports ISNULL. Is there a thought to
> change
> > >> the
> > >>>>>>> nullability for the CTAS?
> > >>>>>>> - I notice amazon [3] talks of the difference between view and
> > CTAS;
> > >>>> that
> > >>>>>>> CTAS persists the content. Is this the approach we are taking? If
> > so
> > >>>>>> where
> > >>>>>>> are we persisting?
> > >>>>>>> - Amazon [4] ignores ordered by clauses ? is that the same for
> this
> > >>>>>>> proposal?
> > >>>>>>>
> > >>>>>>> [1]
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/concepts/dynamic_tables/
> > >>>>>>> [2]
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-develop-ctas#selectinto-vs-ctas
> > >>>>>>> [3]
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://docs.aws.amazon.com/athena/latest/ug/ctas-considerations-limitations.html#ctas-considerations-limitations-queries-vs-views
> > >>>>>>> [4]
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://docs.aws.amazon.com/athena/latest/ug/ctas-considerations-limitations.html#ctas-considerations-limitations-order-by-ignored
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Kind regards, David.
> > >>>>>>>
> > >>>>>>> From: Sergio Pena <[email protected]>
> > >>>>>>> Date: Friday, 7 June 2024 at 16:13
> > >>>>>>> To: [email protected] <[email protected]>
> > >>>>>>> Subject: [EXTERNAL] [DISCUSS] FLIP-463: Schema Definition in
> CREATE
> > >>>> TABLE
> > >>>>>>> AS Statement
> > >>>>>>> HI All,
> > >>>>>>>
> > >>>>>>> I'd like to start a discussion on FLIP-463: Schema Definition in
> > >>>> CREATE
> > >>>>>>> TABLE AS Statement [1]
> > >>>>>>>
> > >>>>>>> The proposal extends the CTAS statement to allow users to define
> > >> their
> > >>>>>> own
> > >>>>>>> schema by adding columns, primary and partition keys, and table
> > >>>>>>> distribution to the CREATE statement.
> > >>>>>>>
> > >>>>>>> Any thoughts are welcome.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> - Sergio Pena
> > >>>>>>>
> > >>>>>>> [1]
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-463%3A+Schema+Definition+in+CREATE+TABLE+AS+Statement
> > >>>>>>>
> > >>>>>>> Unless otherwise stated above:
> > >>>>>>>
> > >>>>>>> IBM United Kingdom Limited
> > >>>>>>> Registered in England and Wales with number 741598
> > >>>>>>> Registered office: PO Box 41, North Harbour, Portsmouth, Hants.
> PO6
> > >>>> 3AU
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>
> > >
> >
>

Re: [DISCUSS] FLIP-463: Schema Definition in CREATE TABLE AS Statement

Reply via email to