Thanks Danny for starting the discussion. +1 for this feature.
I like schema evolution. A question: Can we support multiple pipelines? For example, I have three source tables and just one sink table. So, I will write three CTAS: - CTAS IF NOT EXISTS 1 - CTAS IF NOT EXISTS 2 - CTAS IF NOT EXISTS 3 Do I have a chance to configure different sink options? Something like dynamic table options? Best, Jingsong On Fri, May 28, 2021 at 10:10 PM Kurt Young <ykt...@gmail.com> wrote: > Hi Konstantin, > > From my understanding, this syntax has 2 major benefits: > > 1. Just like you said, it saves the effort to specify the schema, > especially when involving hundreds of fields. > 2. When using CREATE TABLE xx AS TABLE yy, it gives us the possibility to > enable schema evolution, and it seems pretty natural to do so. > > Best, > Kurt > > > On Fri, May 28, 2021 at 5:44 PM Konstantin Knauf <kna...@apache.org> > wrote: > > > Hi everyone, > > > > quick question for my understanding: how is this different to > > > > CREATE TABLE IF NOT EXISTS my_table ( > > ... > > ) WITH ( > > ... > > ); > > INSERT INTO my_table SELECT ...; > > > > ? > > > > Is it only about a) not having to specify the schema and b) a more > > condensed syntax? > > > > Cheers, > > > > Konstantin > > > > On Fri, May 28, 2021 at 11:30 AM Jark Wu <imj...@gmail.com> wrote: > > > > > Thanks Danny for starting the discussion of extending CTAS syntax. > > > > > > I think this is a very useful feature for data integration and ETL jobs > > (a > > > big use case of Flink). > > > Many users complain a lot that manually defining schemas for sources > and > > > sinks is hard. > > > CTAS helps users to write ETL jobs without defining any schemas of > > sources > > > and sinks. > > > CTAS automatically creates physical tables in external systems, and > > > automatically > > > maps external tables to Flink tables with the help of catalogs (e.g. > > > PgCatalog, HiveCatalog). > > > > > > On the other hand, the schema of the SELECT query is fixed after > compile > > > time. > > > CTAS TABLE extends the syntax which allows dynamic schema during > runtime, > > > semantically it streaming copies the up-to-date structure and data (if > > run > > > in streaming mode). > > > So I think CTAS TABLE is a major step forward for data integration, it > > > defines a syntax > > > which allows the underlying streaming pipeline automatically migrate > > schema > > > evolution > > > (e.g. ADD COLUMN) from source tables to sink tables without stopping > jobs > > > or updating SQLs. > > > > > > Therefore, I'm +1 for the feature. > > > > > > Best, > > > Jark > > > > > > On Fri, 28 May 2021 at 16:22, JING ZHANG <beyond1...@gmail.com> wrote: > > > > > > > Hi Danny, > > > > > > > > Thanks for starting this discussion. > > > > > > > > > > > > > > > > Big +1 for this feature. Both CTAS AND CREATE TABLE LIKE are very > > useful > > > > features. IMO, it is clear to separate them into two parts in the > > > `syntax` > > > > character. 😀 > > > > > > > > > > > > > > > > First, I have two related problems: > > > > > > > > > > > > 1. Would `create table` in CTAS trigger to create a physical table in > > > > external storage system? > > > > > > > > For example, now normal `create table` would only define a connecting > > > with > > > > an existed external Kafka topic instead of trigger to create a > physical > > > > kafka topic in kafka cluster. Does this behavior still work for CTAS > > AND > > > > CREATE TABLE LIKE? > > > > > > > > > > > > 2. Would the data sync of CTAS run continuously if select works on a > > > > unbounded source? > > > > > > > > Since sub select query may works on unbounded source in Flink, which > is > > > > different with other system (postgres, spark, hive, mysql). Does data > > > sync > > > > continuously run or just sync the snapshot at the job submit? > > > > > > > > > > > > > > > > Besides, I have some minor problems which is mentioned in your email. > > > > > > > > > > > > > > > > > how to write data into existing table with history data declare [IF > > NOT > > > > EXISTS] keywords and we ignore the table creation but the pipeline > > still > > > > starts up > > > > > > > > > > > > > > > > Maybe we should check old schema and new schema. What would happen if > > > > schema of existed table is different with new schema? > > > > > > > > > > > > > > > > > How to match sub-database and sub-table ? Use regex style source > > table > > > > name > > > > > > > > > > > > > > > > 1. What would happen if schema of matched tables different with > each > > > > other? > > > > > > > > 2. What orders to sync data of all matched table? Sync data from > all > > > > matched tables one by one or at the same time? > > > > > > > > > > > > > > > > > AS select_statement: copy source table data into target > > > > > > > > > > > > > > > > User could explicitly specify the data type for each column in the > > CTAS, > > > > what happened when run the following example. The demo is from MySQL > > > > document, > > > https://dev.mysql.com/doc/refman/5.6/en/create-table-select.html > > > > , the result is a bit unexpected, I wonder > > > > > > > > What the behavior would be in Flink. > > > > > > > > > > > > [image: image.png] > > > > > > > > Best, > > > > JING ZHANG > > > > > > > > > > > > > -- > > > > Konstantin Knauf > > > > https://twitter.com/snntrable > > > > https://github.com/knaufk > > > -- Best, Jingsong Lee