Re: [DISCUSS] SQL CTAS Syntax

Jingsong Li Tue, 01 Jun 2021 03:06:53 -0700

Thanks Danny for starting the discussion.

+1 for this feature.


I like schema evolution.

A question:

Can we support multiple pipelines? For example, I have three source tables
and just one sink table.

So, I will write three CTAS:
- CTAS IF NOT EXISTS 1
- CTAS IF NOT EXISTS 2
- CTAS IF NOT EXISTS 3

Do I have a chance to configure different sink options? Something like
dynamic table options?

Best,
Jingsong

On Fri, May 28, 2021 at 10:10 PM Kurt Young <ykt...@gmail.com> wrote:

> Hi Konstantin,
>
> From my understanding, this syntax has 2 major benefits:
>
> 1. Just like you said, it saves the effort to specify the schema,
> especially when involving hundreds of fields.
> 2. When using CREATE TABLE xx AS TABLE yy, it gives us the possibility to
> enable schema evolution, and it seems pretty natural to do so.
>
> Best,
> Kurt
>
>
> On Fri, May 28, 2021 at 5:44 PM Konstantin Knauf <kna...@apache.org>
> wrote:
>
> > Hi everyone,
> >
> > quick question for my understanding: how is this different to
> >
> > CREATE TABLE IF NOT EXISTS my_table (
> > ...
> > ) WITH (
> > ...
> > );
> > INSERT INTO my_table SELECT ...;
> >
> > ?
> >
> > Is it only about a) not having to specify the schema and b) a more
> > condensed syntax?
> >
> > Cheers,
> >
> > Konstantin
> >
> > On Fri, May 28, 2021 at 11:30 AM Jark Wu <imj...@gmail.com> wrote:
> >
> > > Thanks Danny for starting the discussion of extending CTAS syntax.
> > >
> > > I think this is a very useful feature for data integration and ETL jobs
> > (a
> > > big use case of Flink).
> > > Many users complain a lot that manually defining schemas for sources
> and
> > > sinks is hard.
> > > CTAS helps users to write ETL jobs without defining any schemas of
> > sources
> > > and sinks.
> > > CTAS automatically creates physical tables in external systems, and
> > > automatically
> > > maps external tables to Flink tables with the help of catalogs (e.g.
> > > PgCatalog, HiveCatalog).
> > >
> > > On the other hand, the schema of the SELECT query is fixed after
> compile
> > > time.
> > > CTAS TABLE extends the syntax which allows dynamic schema during
> runtime,
> > > semantically it streaming copies the up-to-date structure and data (if
> > run
> > > in streaming mode).
> > > So I think CTAS TABLE is a major step forward for data integration, it
> > > defines a syntax
> > > which allows the underlying streaming pipeline automatically migrate
> > schema
> > > evolution
> > > (e.g. ADD COLUMN) from source tables to sink tables without stopping
> jobs
> > > or updating SQLs.
> > >
> > > Therefore, I'm +1 for the feature.
> > >
> > > Best,
> > > Jark
> > >
> > > On Fri, 28 May 2021 at 16:22, JING ZHANG <beyond1...@gmail.com> wrote:
> > >
> > > > Hi Danny,
> > > >
> > > > Thanks for starting this discussion.
> > > >
> > > >
> > > >
> > > > Big +1 for this feature. Both CTAS AND CREATE TABLE LIKE are very
> > useful
> > > > features. IMO, it is clear to separate them into two parts in the
> > > `syntax`
> > > > character. 😀
> > > >
> > > >
> > > >
> > > > First, I have two related problems:
> > > >
> > > >
> > > > 1. Would `create table` in CTAS trigger to create a physical table in
> > > > external storage system?
> > > >
> > > > For example, now normal `create table` would only define a connecting
> > > with
> > > > an existed external Kafka topic instead of trigger to create a
> physical
> > > > kafka topic in kafka cluster. Does this behavior still work for CTAS
> > AND
> > > > CREATE TABLE LIKE?
> > > >
> > > >
> > > > 2. Would the data sync  of CTAS run continuously if select works on a
> > > > unbounded source?
> > > >
> > > > Since sub select query may works on unbounded source in Flink, which
> is
> > > > different with other system (postgres, spark, hive, mysql). Does data
> > > sync
> > > > continuously run or just sync the snapshot at the job submit?
> > > >
> > > >
> > > >
> > > > Besides, I have some minor problems which is mentioned in your email.
> > > >
> > > >
> > > >
> > > > > how to write data into existing table with history data declare [IF
> > NOT
> > > > EXISTS] keywords and we ignore the table creation but the pipeline
> > still
> > > > starts up
> > > >
> > > >
> > > >
> > > > Maybe we should check old schema and new schema. What would happen if
> > > > schema of existed table is different with new schema?
> > > >
> > > >
> > > >
> > > > > How to match sub-database and sub-table ? Use regex style source
> > table
> > > > name
> > > >
> > > >
> > > >
> > > >    1. What would happen if schema of matched tables different with
> each
> > > > other?
> > > >
> > > >    2. What orders to sync data of all matched table? Sync data from
> all
> > > > matched tables one by one or at the same time?
> > > >
> > > >
> > > >
> > > > >  AS select_statement: copy source table data into target
> > > >
> > > >
> > > >
> > > > User could  explicitly specify the data type for each column in the
> > CTAS,
> > > > what happened when run the following example. The demo is from MySQL
> > > > document,
> > > https://dev.mysql.com/doc/refman/5.6/en/create-table-select.html
> > > > , the result is a bit unexpected, I wonder
> > > >
> > > > What the behavior would be in Flink.
> > > >
> > > >
> > > > [image: image.png]
> > > >
> > > > Best,
> > > > JING ZHANG
> > > >
> > >
> >
> >
> > --
> >
> > Konstantin Knauf
> >
> > https://twitter.com/snntrable
> >
> > https://github.com/knaufk
> >
>


-- 
Best, Jingsong Lee

Re: [DISCUSS] SQL CTAS Syntax

Reply via email to