Re: [DISCUSS] FLIP-463: Schema Definition in CREATE TABLE AS Statement

Timo Walther Wed, 19 Jun 2024 05:17:11 -0700

Hi Sergio,

thanks for updating the FLIP and ingesting the feedback from the MLdiscussion.


I would be fine to start voting on this, if there are no objections?

Regards,
Timo

On 18.06.24 19:53, Sergio Pena wrote:

Hi Jeyhun, Timo,

I updated the FLIP to include the option to reorder the columns defined in
the SELECT part.
As Timo mentioned, this is equivalent to the Flink INSERT INTO statement
and also to
Azure CTAS statement.

For example:

CREATE TABLE t1(a INT, b INT, c INT);
CREATE TABLE t1(c, b, a) AS SELECT * FROM t1;
DESCRIBE s1;

+-------------+-----------+----------+--------+
| Column Name | Data Type | Nullable | Extras |
+-------------+-----------+----------+--------+
| c           | INT       | NULL     |        |
| b           | INT       | NULL     |        |
| a           | INT       | NULL     |        |
+-------------+-----------+----------+--------+

Let me know your thoughts.

- Sergio

On Tue, Jun 18, 2024 at 7:34 AM Sergio Pena <ser...@confluent.io> wrote:

Hi Ron,

I think the primary key and unique values are handled by the engine. I
found this document [1]
that explains how Flink handles those records. When using CTAS, you can
specify in the
table options how you want the records to be merged if you define a
primary key.

Regarding CTAS, this is just a wrapper that's similar to executing two
statements manually; a
create table followed by an insert select statements. If a user has a
primary key in the target
table and then execute the insert select, then the Flink engine will
handle unique records
depending on how the target table was configured.

Hope this link [1] helps understand primary keys? It really helped me, I
wasn't sure how
it worked either :).

- Sergio

[1]
https://nightlies.apache.org/flink/flink-table-store-docs-master/docs/concepts/primary-key-table/


On Mon, Jun 17, 2024 at 8:43 PM Ron Liu <ron9....@gmail.com> wrote:

Hi, Sergio

Thanks for your reply.

PKEYS are not inherited from the select query. Also UNIQUE KEY is not

supported on the create statements yet.
All columns queried in the select part form a query schema with only the
column name, type and null/not null properties.
Based on this, I don't see there will be an issue with uniqueness. Users
may define a pkey on the not null columns
only.

Sorry if I wasn't clear about this issue. My question is, if a primary key
is defined on column a in the target table of a CTAS, and the data fetched
by the Select query can't guarantee uniqueness based on column a, then
what
is relied upon to ensure uniqueness?
Is it something that the engine does, or is it dependent on the storage
mechanism of the target table?

Best,
Ron


Sergio Pena <ser...@confluent.io.invalid> 于2024年6月18日周二 01:24写道：

Hi Ron,
Thanks for your feedback.

Is it possible for you to list these systems and make it part of the

Reference chapter? Which would make it easier for everyone to

understand.


I updated the flip to include the ones I found (mysql, postgres,

oracle).

Though postgres/oracle semantics are different, they at least allow you
some schema changes in the create part.
I think spark sql supports it too, but I couldn't find a way to test

it, so

I didn't include it in the list.

One is whether Nullable columns are

allowed to be referenced in the primary key definition, because all the

columns deduced based on Select Query may be Nullable;


I updated the flip. But in short, CTAS will use the same rules and
validations as other CREATE statements.
Pkeys are not allowed on NULL columns, so CTAS would fail too. The CTAS
inherits the NULL and NOT NULL
constraints from the query schema, so PKEY can be used on the NOT NULL

cols

only.

if the UNIQUE KEY cannot be deduced based on Select Query, or the

deduced UNIQUE KEY does not match the defined primary key, which will

lead

to data duplication, the engine needs to what to ensure the

uniqueness of

the data?


PKEYS are not inherited from the select query. Also UNIQUE KEY is not
supported on the create statements yet.
All columns queried in the select part form a query schema with only the
column name, type and null/not null properties.
Based on this, I don't see there will be an issue with uniqueness. Users
may define a pkey on the not null columns
only.

- Sergio


On Sun, Jun 16, 2024 at 9:55 PM Ron Liu <ron9....@gmail.com> wrote:

Hi, Sergio

Sorry for later joining this thread.

Thanks for driving this proposal, it looks great.

I have a few questions:

1. Many SQL-based data processing systems, however, support schema
definitions within their CTAS statements

Is it possible for you to list these systems and make it part of the
Reference chapter? Which would make it easier for everyone to

understand.



2. This proposal proposes to support defining primary keys in CTAS,

then

there are two possible issues here. One is whether Nullable columns

are

allowed to be referenced in the primary key definition, because all

the

columns deduced based on Select Query may be Nullable; and the second

is

that if the UNIQUE KEY cannot be deduced based on Select Query, or the
deduced UNIQUE KEY does not match the defined primary key, which will

lead

to data duplication, the engine needs to what to ensure the

uniqueness of

the data?


Best
Ron


Jeyhun Karimov <je.kari...@gmail.com> 于2024年6月14日周五 01:51写道：

Thanks Sergio and Timo for your answers.
Sounds good to me.
Looking forward for this feature.

Regards,
Jeyhun

On Thu, Jun 13, 2024 at 4:48 PM Sergio Pena

<ser...@confluent.io.invalid


wrote:

Sure Yuxia, I just added the support for RTAS statements too.

- Sergio

On Wed, Jun 12, 2024 at 8:22 PM yuxia <

luoyu...@alumni.sjtu.edu.cn>

wrote:

Hi, Sergio.
Thanks for driving the FLIP. Given we also has REPLACE TABLE AS
Statement[1] and it's almost same with CREATE TABLE AS

Statement,

would you mind also supporting schema definition for REPLACE

TABLE

AS

Statement in this FLIP? It'll be a great to align REPLACE TABLE

AS

Statement

to CREATE TABLE AS Statement


[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-303%3A+Support+REPLACE+TABLE+AS+SELECT+statement


Best regards,
Yuxia

----- 原始邮件 -----
发件人: "Timo Walther" <twal...@apache.org>
收件人: "dev" <dev@flink.apache.org>
发送时间: 星期三, 2024年 6 月 12日 下午 10:19:14
主题: Re: [DISCUSS] FLIP-463: Schema Definition in CREATE TABLE AS

Statement

I just noticed the CREATE TABLE LIKE statement allows the

definition

  > of new columns in the CREATE part. The difference
  > with this CTAS proposal is that TABLE LIKE appends the new

columns

at

  > the end of the schema instead of adding them
  > at the beginning like this proposal and Mysql do.

This should be fine. The LIKE rather "extends from" the right

table.

Whereas the SELECT in CTAS rather extends the left schema

definition.

Given that "the extended part" is always appended, we could

argue

that

the current CTAS behavior in the FLIP is acceptable.

  > If you want to rename a column in the create part, then that

column

  > position must be in the same position as the query column.
  > I didn't like the Postgres approach because it does not let

us

add

  > columns that do not exist in the query schema.

Flink offers similar functionality in INSERT INTO. INSERT INTO

also

allows syntax like: `INSERT INTO (b, c) SELECT * FROM t`. So

given

that

our CTAS can be seen as a CREATE TABLE + INSERT INTO. I would

adopt

Jeyhun comment in the FLIP. If users don't want to add

additional

schema

parts, the same column reordering should be available as if one

would

write a INSERT INTO.

Regards,
Timo




On 12.06.24 04:30, Yanquan Lv wrote:

Hi Sergio, thanks for driving it, +1 for this.

I have some comments:
1. If we have a source table with primary keys and partition

keys

defined,

what is the default behavior if PARTITIONED and DISTRIBUTED

not

specified

in the CTAS statement, It should not be inherited by default？
2. I suggest providing a complete syntax that includes

table_properties

like FLIP-218.


Sergio Pena <ser...@confluent.io.invalid> 于2024年6月12日周三

03:54写道：

I just noticed the CREATE TABLE LIKE statement allows the

definition

of

new

columns in the CREATE part. The difference
with this CTAS proposal is that TABLE LIKE appends the new

columns

at

the

end of the schema instead of adding them
at the beginning like this proposal and Mysql do.

create table t1(id int, name string);

create table s1(a int, b string) like t1;
describe s1;


+-------------+-----------+----------+--------+

| Column Name | Data Type | Nullable | Extras |
+-------------+-----------+----------+--------+
| id          | INT       | NULL     |        |
| name        | STRING    | NULL     |        |
| a           | INT       | NULL     |        |
| b           | STRING    | NULL     |        |
+-------------+-----------+----------+--------+




The CREATE TABLE LIKE also does not let the definition of

existing

columns

in the CREATE part. The statement fails
that the column already exists.

create table t1(id int, name string);

create table s1(id double) like t1;
A column named 'id' already exists in the base table.


What do you guys think of making it similar to the CREATE

TABLE

LIKE?

Seems

the best approach in order to
be compatible with it.

- Sergio

On Tue, Jun 11, 2024 at 2:10 PM Sergio Pena <

ser...@confluent.io>

wrote:

Thanks Timo for answering Jeyhun questions.

To add info more about your questions Jeyhun. This proposal

is

not

handling NULL/NOT_NULL types. I noticed that
the current CTAS impl. (as Timo said) adds this constraint

as

part

of

the

resulting schema. And when defining
a primary key in the CREATE part, if the resulting schema

does

not

have a

NOT NULL in the column then the CTAS
will fail. This is similar to the CREATE TABLE LIKE which

expects

the

LIKE

table to have a NOT NULL column if
the user defines a primary key in the CREATE part.

In some cases, redefining the column types might be

redundant,

especially

when users dont change the column type. A user just wants

to

change

the

column name from the SELECT clause. Should we also support

this

scenario,

similar to postgres?


I looked into Postgres too. Postgres matches the columns

based

on

the

order defined in the create and select part.
If you want to rename a column in the create part, then that

column

position must be in the same position as the query column.
I didn't like the Postgres approach because it does not let

us

add

columns

that do not exist in the query schema.

i.e. query has schema (a int, b string), now the `a` column

is

renamed

to

`id` because both are in the same position 0
`create table s1(id int) as select a, b from t1`;
results in: [id int, b string]

I think, if users want to rename then they can use a

different

alias

in

the select part. They could also do explicit casting
for changing the data types, which now makes it redundant

(as

you

said)

to

allow redefining the query columns again. But
perhaps there are cases where explicit casting does not work

and

just

defining the column would? i.e. making a nullable
type to not null? I couldn't make `cast(c1 as int not

null)` to

work

for

instance, but it may work in the create part?

Could you also mention the casting rules in the FLIP for

this

case?


I mentioned they're the same as insert/select when doing

implicit

casting.

I will search for more info about the insert/select
and add the casting rules in the flip..

- Sergio


On Tue, Jun 11, 2024 at 12:59 AM Timo Walther <

twal...@apache.org>

wrote:

Hi Sergio,

thanks for proposing this FLIP for finalizing the CTAS

statement.

Adopting the logic from MySQL for deriving and potentially

overwriting

parts of the schema should be easy to understand for

everyone.

So

+1

for

the FLIP in general.

   > How do you handle CTAS statements with SELECT clauses

that

have

   > (implicit or explicit) NULLABLE or NOT NULLABLE

columns?


@Jeyhun: I don't think there is anything special about

this.

The

current

CTAS implementation should already cover that. It takes the

nullability

of the column's data type as a constraint into derived

schema.

Keep

in

mind that nullability is part of the data type in Flink,

not

only

constraint on the schema. This decision was made due to

Calcite

internals.


   > redefining the column types might be redundant,

especially

   > when users dont change the column type

This is indeed a good point. On one hand, I think we should

avoid

further complicating the syntax. But looking at other

vendors

[1]

this

seems indeed a valid use case. If it doesn't cause too many

special

cases in the parser (and it's look-ahead), I'm fine with

supporting

list of column names as well. However, the most important

use

case

will

be specifying a watermark, metadata columns, or other

schema

parts

that

are not just columns names.

Regards,
Timo


[1]

https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse?view=azure-sqldw-latest



On 10.06.24 21:37, Jeyhun Karimov wrote:

Hi Sergio,

Thanks for driving this FLIP. +1 for it.
I have a few questions:

- How do you handle CTAS statements with SELECT clauses

that

have

(implicit

or explicit) NULLABLE or NOT NULLABLE columns? Could you

also

mention

the

casting rules in the FLIP for this case?
- In some cases, redefining the column types might be

redundant,

especially

when users dont change the column type. For example, a

user

just

wants

to

change the column name from the SELECT clause.
Should we also support this scenario, similar to the

postgres

[1] ?


Regards,
Jeyhun


[1]

https://www.postgresql.org/docs/8.1/sql-createtableas.html


On Mon, Jun 10, 2024 at 6:28 PM Sergio Pena

<ser...@confluent.io.invalid


wrote:

Hi David,

The CTAS feature is already part of Flink (proposed in

FLIP-218

[1]).

The

new FLIP-463 is just to extend the CTAS syntax to allow

for

adding

new

columns to
the created table that are not part of the generated

schema. I

think

FLIP-218 [1] was discussed in the mail list somewhere,

but I

couldn't

find

the discussion thread.
I was hoping it could contain the answers for your

questions

as

that's

where CTAS was implemented. There's a user doc [2] for it

that

may

help

too.

But, in a nutshell, CTAS is similar to running CREATE and

then

INSERT/SELECT statements. The CTAS will execute a

background

job

that

executes
the AS SELECT query and then inserts the data into the

newly

created

table.

So the table will be kept up to date as the source data

changes

as

you

said.

Generated columns are from the AS SELECT schema, right?

Yes,

that's

already

allowed. FLIP-463 will also allow modifying the schema of

the

sink

table

in case you need to add more columns.

I notice amazon [3] talks of the difference between view

and

CTAS;

that

CTAS persists the content. Is this the approach we are

taking?

If

so

where

are we persisting?


This is already supported by FLIP-218 [1]. But yes, the

CTAS

persists

the

content in the new table you specify in the CTAS

statement.

is the table read-only, or can we insert/ delete /

update

into

it.

If

it

is read only how will the inserts , deletes updates

fail.


Are there any restrictions on the select ? Can the select

be a

join,

aggregate, windowed?


I notice Azure [2] supports ISNULL. Is there a thought to

change

the

nullability for the CTAS?


Amazon [4] ignores ordered by clauses ? is that the same

for

this

proposal?


I suppose all of this is already supported by FLIP-218. I

don't

see

any

restrictions on the query. You can learn more about it in

the

FLIP-218

[1]


- Sergio

[1]

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=199541185

[2]

https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#as-select_statement


On Mon, Jun 10, 2024 at 3:10 AM David Radley <

david_rad...@uk.ibm.com>

wrote:

Hi Sergio,
Sounds good . I am relatively new to this area and had

some

basic

questions:

I notice in [1] it talks of materialized views. And

CREATE

view

can

already take the AS keyword. It would be useful to me to

understand

when

we

would use each of these.

- I assume the table will be kept up to date as the

source

data

changes

like a view.
- Are there any restrictions on the select ? Can the

select

be

join,

aggregate, windowed?
- I assume generated columns are allowed?
- is the table read-only, or can we insert/ delete /

update

into

it.

If

it

is read only how will the inserts , deletes updates

fail.

- I notice Azure [2] supports ISNULL. Is there a

thought to

change

the

nullability for the CTAS?
- I notice amazon [3] talks of the difference between

view

and

CTAS;

that

CTAS persists the content. Is this the approach we are

taking?

If

so

where

are we persisting?
- Amazon [4] ignores ordered by clauses ? is that the

same

for

this

proposal?

[1]

https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/concepts/dynamic_tables/

[2]

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-develop-ctas#selectinto-vs-ctas

[3]

https://docs.aws.amazon.com/athena/latest/ug/ctas-considerations-limitations.html#ctas-considerations-limitations-queries-vs-views

[4]

https://docs.aws.amazon.com/athena/latest/ug/ctas-considerations-limitations.html#ctas-considerations-limitations-order-by-ignored



Kind regards, David.

From: Sergio Pena <ser...@confluent.io.INVALID>
Date: Friday, 7 June 2024 at 16:13
To: dev@flink.apache.org <dev@flink.apache.org>
Subject: [EXTERNAL] [DISCUSS] FLIP-463: Schema

Definition

in

CREATE

TABLE

AS Statement
HI All,

I'd like to start a discussion on FLIP-463: Schema

Definition

in

CREATE

TABLE AS Statement [1]

The proposal extends the CTAS statement to allow users

to

define

their

own

schema by adding columns, primary and partition keys,

and

table

distribution to the CREATE statement.

Any thoughts are welcome.

Thanks,
- Sergio Pena

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-463%3A+Schema+Definition+in+CREATE+TABLE+AS+Statement


Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth,

Hants.

PO6

3AU

Re: [DISCUSS] FLIP-463: Schema Definition in CREATE TABLE AS Statement

Reply via email to