Hi Sergio,

thanks for updating the FLIP and ingesting the feedback from the ML discussion.

I would be fine to start voting on this, if there are no objections?

Regards,
Timo

On 18.06.24 19:53, Sergio Pena wrote:
Hi Jeyhun, Timo,

I updated the FLIP to include the option to reorder the columns defined in
the SELECT part.
As Timo mentioned, this is equivalent to the Flink INSERT INTO statement
and also to
Azure CTAS statement.

For example:

CREATE TABLE t1(a INT, b INT, c INT);
CREATE TABLE t1(c, b, a) AS SELECT * FROM t1;
DESCRIBE s1;
+-------------+-----------+----------+--------+
| Column Name | Data Type | Nullable | Extras |
+-------------+-----------+----------+--------+
| c           | INT       | NULL     |        |
| b           | INT       | NULL     |        |
| a           | INT       | NULL     |        |
+-------------+-----------+----------+--------+

Let me know your thoughts.

- Sergio

On Tue, Jun 18, 2024 at 7:34 AM Sergio Pena <ser...@confluent.io> wrote:

Hi Ron,

I think the primary key and unique values are handled by the engine. I
found this document [1]
that explains how Flink handles those records. When using CTAS, you can
specify in the
table options how you want the records to be merged if you define a
primary key.

Regarding CTAS, this is just a wrapper that's similar to executing two
statements manually; a
create table followed by an insert select statements. If a user has a
primary key in the target
table and then execute the insert select, then the Flink engine will
handle unique records
depending on how the target table was configured.

Hope this link [1] helps understand primary keys? It really helped me, I
wasn't sure how
it worked either :).

- Sergio

[1]
https://nightlies.apache.org/flink/flink-table-store-docs-master/docs/concepts/primary-key-table/


On Mon, Jun 17, 2024 at 8:43 PM Ron Liu <ron9....@gmail.com> wrote:

Hi, Sergio

Thanks for your reply.

PKEYS are not inherited from the select query. Also UNIQUE KEY is not
supported on the create statements yet.
All columns queried in the select part form a query schema with only the
column name, type and null/not null properties.
Based on this, I don't see there will be an issue with uniqueness. Users
may define a pkey on the not null columns
only.

Sorry if I wasn't clear about this issue. My question is, if a primary key
is defined on column a in the target table of a CTAS, and the data fetched
by the Select query can't guarantee uniqueness based on column a, then
what
is relied upon to ensure uniqueness?
Is it something that the engine does, or is it dependent on the storage
mechanism of the target table?

Best,
Ron


Sergio Pena <ser...@confluent.io.invalid> 于2024年6月18日周二 01:24写道:

Hi Ron,
Thanks for your feedback.

Is it possible for you to list these systems and make it part of the
Reference chapter? Which would make it easier for everyone to
understand.


I updated the flip to include the ones I found (mysql, postgres,
oracle).
Though postgres/oracle semantics are different, they at least allow you
some schema changes in the create part.
I think spark sql supports it too, but I couldn't find a way to test
it, so
I didn't include it in the list.

One is whether Nullable columns are

allowed to be referenced in the primary key definition, because all the

columns deduced based on Select Query may be Nullable;


I updated the flip. But in short, CTAS will use the same rules and
validations as other CREATE statements.
Pkeys are not allowed on NULL columns, so CTAS would fail too. The CTAS
inherits the NULL and NOT NULL
constraints from the query schema, so PKEY can be used on the NOT NULL
cols
only.

if the UNIQUE KEY cannot be deduced based on Select Query, or the
deduced UNIQUE KEY does not match the defined primary key, which will
lead
to data duplication, the engine needs to what to ensure the
uniqueness of
the data?


PKEYS are not inherited from the select query. Also UNIQUE KEY is not
supported on the create statements yet.
All columns queried in the select part form a query schema with only the
column name, type and null/not null properties.
Based on this, I don't see there will be an issue with uniqueness. Users
may define a pkey on the not null columns
only.

- Sergio


On Sun, Jun 16, 2024 at 9:55 PM Ron Liu <ron9....@gmail.com> wrote:

Hi, Sergio

Sorry for later joining this thread.

Thanks for driving this proposal, it looks great.

I have a few questions:

1. Many SQL-based data processing systems, however, support schema
definitions within their CTAS statements

Is it possible for you to list these systems and make it part of the
Reference chapter? Which would make it easier for everyone to
understand.


2. This proposal proposes to support defining primary keys in CTAS,
then
there are two possible issues here. One is whether Nullable columns
are
allowed to be referenced in the primary key definition, because all
the
columns deduced based on Select Query may be Nullable; and the second
is
that if the UNIQUE KEY cannot be deduced based on Select Query, or the
deduced UNIQUE KEY does not match the defined primary key, which will
lead
to data duplication, the engine needs to what to ensure the
uniqueness of
the data?


Best
Ron


Jeyhun Karimov <je.kari...@gmail.com> 于2024年6月14日周五 01:51写道:

Thanks Sergio and Timo for your answers.
Sounds good to me.
Looking forward for this feature.

Regards,
Jeyhun

On Thu, Jun 13, 2024 at 4:48 PM Sergio Pena
<ser...@confluent.io.invalid

wrote:

Sure Yuxia, I just added the support for RTAS statements too.

- Sergio

On Wed, Jun 12, 2024 at 8:22 PM yuxia <
luoyu...@alumni.sjtu.edu.cn>
wrote:

Hi, Sergio.
Thanks for driving the FLIP. Given we also has REPLACE TABLE AS
Statement[1] and it's almost same with CREATE TABLE AS
Statement,
would you mind also supporting schema definition for REPLACE
TABLE
AS
Statement in this FLIP? It'll be a great to align REPLACE TABLE
AS
Statement
to CREATE TABLE AS Statement


[1]





https://cwiki.apache.org/confluence/display/FLINK/FLIP-303%3A+Support+REPLACE+TABLE+AS+SELECT+statement

Best regards,
Yuxia

----- 原始邮件 -----
发件人: "Timo Walther" <twal...@apache.org>
收件人: "dev" <dev@flink.apache.org>
发送时间: 星期三, 2024年 6 月 12日 下午 10:19:14
主题: Re: [DISCUSS] FLIP-463: Schema Definition in CREATE TABLE AS
Statement

I just noticed the CREATE TABLE LIKE statement allows the
definition
  > of new columns in the CREATE part. The difference
  > with this CTAS proposal is that TABLE LIKE appends the new
columns
at
  > the end of the schema instead of adding them
  > at the beginning like this proposal and Mysql do.

This should be fine. The LIKE rather "extends from" the right
table.
Whereas the SELECT in CTAS rather extends the left schema
definition.
Given that "the extended part" is always appended, we could
argue
that
the current CTAS behavior in the FLIP is acceptable.

  > If you want to rename a column in the create part, then that
column
  > position must be in the same position as the query column.
  > I didn't like the Postgres approach because it does not let
us
add
  > columns that do not exist in the query schema.

Flink offers similar functionality in INSERT INTO. INSERT INTO
also
allows syntax like: `INSERT INTO (b, c) SELECT * FROM t`. So
given
that
our CTAS can be seen as a CREATE TABLE + INSERT INTO. I would
adopt
Jeyhun comment in the FLIP. If users don't want to add
additional
schema
parts, the same column reordering should be available as if one
would
write a INSERT INTO.

Regards,
Timo




On 12.06.24 04:30, Yanquan Lv wrote:
Hi Sergio, thanks for driving it, +1 for this.

I have some comments:
1. If we have a source table with primary keys and partition
keys
defined,
what is the default behavior if PARTITIONED and DISTRIBUTED
not
specified
in the CTAS statement, It should not be inherited by default?
2. I suggest providing a complete syntax that includes
table_properties
like FLIP-218.


Sergio Pena <ser...@confluent.io.invalid> 于2024年6月12日周三
03:54写道:

I just noticed the CREATE TABLE LIKE statement allows the
definition
of
new
columns in the CREATE part. The difference
with this CTAS proposal is that TABLE LIKE appends the new
columns
at
the
end of the schema instead of adding them
at the beginning like this proposal and Mysql do.

create table t1(id int, name string);
create table s1(a int, b string) like t1;
describe s1;

+-------------+-----------+----------+--------+
| Column Name | Data Type | Nullable | Extras |
+-------------+-----------+----------+--------+
| id          | INT       | NULL     |        |
| name        | STRING    | NULL     |        |
| a           | INT       | NULL     |        |
| b           | STRING    | NULL     |        |
+-------------+-----------+----------+--------+



The CREATE TABLE LIKE also does not let the definition of
existing
columns
in the CREATE part. The statement fails
that the column already exists.

create table t1(id int, name string);

create table s1(id double) like t1;
A column named 'id' already exists in the base table.


What do you guys think of making it similar to the CREATE
TABLE
LIKE?
Seems
the best approach in order to
be compatible with it.

- Sergio

On Tue, Jun 11, 2024 at 2:10 PM Sergio Pena <
ser...@confluent.io>
wrote:

Thanks Timo for answering Jeyhun questions.

To add info more about your questions Jeyhun. This proposal
is
not
handling NULL/NOT_NULL types. I noticed that
the current CTAS impl. (as Timo said) adds this constraint
as
part
of
the
resulting schema. And when defining
a primary key in the CREATE part, if the resulting schema
does
not
have a
NOT NULL in the column then the CTAS
will fail. This is similar to the CREATE TABLE LIKE which
expects
the
LIKE
table to have a NOT NULL column if
the user defines a primary key in the CREATE part.

In some cases, redefining the column types might be
redundant,
especially
when users dont change the column type. A user just wants
to
change
the
column name from the SELECT clause. Should we also support
this
scenario,
similar to postgres?

I looked into Postgres too. Postgres matches the columns
based
on
the
order defined in the create and select part.
If you want to rename a column in the create part, then that
column
position must be in the same position as the query column.
I didn't like the Postgres approach because it does not let
us
add
columns
that do not exist in the query schema.

i.e. query has schema (a int, b string), now the `a` column
is
renamed
to
`id` because both are in the same position 0
`create table s1(id int) as select a, b from t1`;
results in: [id int, b string]

I think, if users want to rename then they can use a
different
alias
in
the select part. They could also do explicit casting
for changing the data types, which now makes it redundant
(as
you
said)
to
allow redefining the query columns again. But
perhaps there are cases where explicit casting does not work
and
just
defining the column would? i.e. making a nullable
type to not null? I couldn't make `cast(c1 as int not
null)` to
work
for
instance, but it may work in the create part?

Could you also mention the casting rules in the FLIP for
this
case?

I mentioned they're the same as insert/select when doing
implicit
casting.
I will search for more info about the insert/select
and add the casting rules in the flip..

- Sergio


On Tue, Jun 11, 2024 at 12:59 AM Timo Walther <
twal...@apache.org>
wrote:

Hi Sergio,

thanks for proposing this FLIP for finalizing the CTAS
statement.
Adopting the logic from MySQL for deriving and potentially
overwriting
parts of the schema should be easy to understand for
everyone.
So
+1
for
the FLIP in general.

   > How do you handle CTAS statements with SELECT clauses
that
have
   > (implicit or explicit) NULLABLE or NOT NULLABLE
columns?

@Jeyhun: I don't think there is anything special about
this.
The
current
CTAS implementation should already cover that. It takes the
nullability
of the column's data type as a constraint into derived
schema.
Keep
in
mind that nullability is part of the data type in Flink,
not
only
a
constraint on the schema. This decision was made due to
Calcite
internals.

   > redefining the column types might be redundant,
especially
   > when users dont change the column type

This is indeed a good point. On one hand, I think we should
avoid
further complicating the syntax. But looking at other
vendors
[1]
this
seems indeed a valid use case. If it doesn't cause too many
special
cases in the parser (and it's look-ahead), I'm fine with
supporting
a
list of column names as well. However, the most important
use
case
will
be specifying a watermark, metadata columns, or other
schema
parts
that
are not just columns names.

Regards,
Timo


[1]








https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse?view=azure-sqldw-latest


On 10.06.24 21:37, Jeyhun Karimov wrote:
Hi Sergio,

Thanks for driving this FLIP. +1 for it.
I have a few questions:

- How do you handle CTAS statements with SELECT clauses
that
have
(implicit
or explicit) NULLABLE or NOT NULLABLE columns? Could you
also
mention
the
casting rules in the FLIP for this case?
- In some cases, redefining the column types might be
redundant,
especially
when users dont change the column type. For example, a
user
just
wants
to
change the column name from the SELECT clause.
Should we also support this scenario, similar to the
postgres
[1] ?

Regards,
Jeyhun


[1]
https://www.postgresql.org/docs/8.1/sql-createtableas.html

On Mon, Jun 10, 2024 at 6:28 PM Sergio Pena
<ser...@confluent.io.invalid

wrote:

Hi David,

The CTAS feature is already part of Flink (proposed in
FLIP-218
[1]).
The
new FLIP-463 is just to extend the CTAS syntax to allow
for
adding
new
columns to
the created table that are not part of the generated
schema. I
think
FLIP-218 [1] was discussed in the mail list somewhere,
but I
couldn't
find
the discussion thread.
I was hoping it could contain the answers for your
questions
as
that's
where CTAS was implemented. There's a user doc [2] for it
that
may
help
too.

But, in a nutshell, CTAS is similar to running CREATE and
then
INSERT/SELECT statements. The CTAS will execute a
background
job
that
executes
the AS SELECT query and then inserts the data into the
newly
created
table.
So the table will be kept up to date as the source data
changes
as
you
said.

Generated columns are from the AS SELECT schema, right?
Yes,
that's
already
allowed. FLIP-463 will also allow modifying the schema of
the
sink
table
in case you need to add more columns.

I notice amazon [3] talks of the difference between view
and
CTAS;
that
CTAS persists the content. Is this the approach we are
taking?
If
so
where
are we persisting?

This is already supported by FLIP-218 [1]. But yes, the
CTAS
persists
the
content in the new table you specify in the CTAS
statement.


is the table read-only, or can we insert/ delete /
update
into
it.
If
it
is read only how will the inserts , deletes updates
fail.

Are there any restrictions on the select ? Can the select
be a
join,
aggregate, windowed?

I notice Azure [2] supports ISNULL. Is there a thought to
change
the
nullability for the CTAS?

Amazon [4] ignores ordered by clauses ? is that the same
for
this
proposal?

I suppose all of this is already supported by FLIP-218. I
don't
see
any
restrictions on the query. You can learn more about it in
the
FLIP-218
[1]

- Sergio

[1]








https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=199541185
[2]









https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#as-select_statement

On Mon, Jun 10, 2024 at 3:10 AM David Radley <
david_rad...@uk.ibm.com>
wrote:

Hi Sergio,
Sounds good . I am relatively new to this area and had
some
basic
questions:

I notice in [1] it talks of materialized views. And
CREATE
view
can
already take the AS keyword. It would be useful to me to
understand
when
we
would use each of these.

- I assume the table will be kept up to date as the
source
data
changes
like a view.
- Are there any restrictions on the select ? Can the
select
be
a
join,
aggregate, windowed?
- I assume generated columns are allowed?
- is the table read-only, or can we insert/ delete /
update
into
it.
If
it
is read only how will the inserts , deletes updates
fail.
- I notice Azure [2] supports ISNULL. Is there a
thought to
change
the
nullability for the CTAS?
- I notice amazon [3] talks of the difference between
view
and
CTAS;
that
CTAS persists the content. Is this the approach we are
taking?
If
so
where
are we persisting?
- Amazon [4] ignores ordered by clauses ? is that the
same
for
this
proposal?

[1]









https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/concepts/dynamic_tables/
[2]









https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-develop-ctas#selectinto-vs-ctas
[3]









https://docs.aws.amazon.com/athena/latest/ug/ctas-considerations-limitations.html#ctas-considerations-limitations-queries-vs-views
[4]









https://docs.aws.amazon.com/athena/latest/ug/ctas-considerations-limitations.html#ctas-considerations-limitations-order-by-ignored


Kind regards, David.

From: Sergio Pena <ser...@confluent.io.INVALID>
Date: Friday, 7 June 2024 at 16:13
To: dev@flink.apache.org <dev@flink.apache.org>
Subject: [EXTERNAL] [DISCUSS] FLIP-463: Schema
Definition
in
CREATE
TABLE
AS Statement
HI All,

I'd like to start a discussion on FLIP-463: Schema
Definition
in
CREATE
TABLE AS Statement [1]

The proposal extends the CTAS statement to allow users
to
define
their
own
schema by adding columns, primary and partition keys,
and
table
distribution to the CREATE statement.

Any thoughts are welcome.

Thanks,
- Sergio Pena

[1]










https://cwiki.apache.org/confluence/display/FLINK/FLIP-463%3A+Schema+Definition+in+CREATE+TABLE+AS+Statement

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth,
Hants.
PO6
3AU
















Reply via email to