Re: [DISCUSS] SQL CTAS Syntax

Jark Wu Fri, 28 May 2021 02:30:58 -0700

Thanks Danny for starting the discussion of extending CTAS syntax.

I think this is a very useful feature for data integration and ETL jobs (a
big use case of Flink).
Many users complain a lot that manually defining schemas for sources and
sinks is hard.
CTAS helps users to write ETL jobs without defining any schemas of sources
and sinks.
CTAS automatically creates physical tables in external systems, and
automatically
maps external tables to Flink tables with the help of catalogs (e.g.
PgCatalog, HiveCatalog).


On the other hand, the schema of the SELECT query is fixed after compile
time.
CTAS TABLE extends the syntax which allows dynamic schema during runtime,
semantically it streaming copies the up-to-date structure and data (if run
in streaming mode).
So I think CTAS TABLE is a major step forward for data integration, it
defines a syntax
which allows the underlying streaming pipeline automatically migrate schema
evolution
(e.g. ADD COLUMN) from source tables to sink tables without stopping jobs
or updating SQLs.

Therefore, I'm +1 for the feature.

Best,
Jark

On Fri, 28 May 2021 at 16:22, JING ZHANG <beyond1...@gmail.com> wrote:

> Hi Danny,
>
> Thanks for starting this discussion.
>
>
>
> Big +1 for this feature. Both CTAS AND CREATE TABLE LIKE are very useful
> features. IMO, it is clear to separate them into two parts in the `syntax`
> character. 😀
>
>
>
> First, I have two related problems:
>
>
> 1. Would `create table` in CTAS trigger to create a physical table in
> external storage system?
>
> For example, now normal `create table` would only define a connecting with
> an existed external Kafka topic instead of trigger to create a physical
> kafka topic in kafka cluster. Does this behavior still work for CTAS AND
> CREATE TABLE LIKE?
>
>
> 2. Would the data sync  of CTAS run continuously if select works on a
> unbounded source?
>
> Since sub select query may works on unbounded source in Flink, which is
> different with other system (postgres, spark, hive, mysql). Does data sync
> continuously run or just sync the snapshot at the job submit?
>
>
>
> Besides, I have some minor problems which is mentioned in your email.
>
>
>
> > how to write data into existing table with history data declare [IF NOT
> EXISTS] keywords and we ignore the table creation but the pipeline still
> starts up
>
>
>
> Maybe we should check old schema and new schema. What would happen if
> schema of existed table is different with new schema?
>
>
>
> > How to match sub-database and sub-table ? Use regex style source table
> name
>
>
>
>    1. What would happen if schema of matched tables different with each
> other?
>
>    2. What orders to sync data of all matched table? Sync data from all
> matched tables one by one or at the same time?
>
>
>
> >  AS select_statement: copy source table data into target
>
>
>
> User could  explicitly specify the data type for each column in the CTAS,
> what happened when run the following example. The demo is from MySQL
> document, https://dev.mysql.com/doc/refman/5.6/en/create-table-select.html
> , the result is a bit unexpected, I wonder
>
> What the behavior would be in Flink.
>
>
> [image: image.png]
>
> Best,
> JING ZHANG
>

Re: [DISCUSS] SQL CTAS Syntax

Reply via email to