Russell, to me, "snapshot" procedure is a perfect place to adopt this feature. After the implementation, we can use the "snapshot" procedure to snapshot a Hive table or an Iceberg table (maybe we can also make it generic to snapshot any other table, e.g. Delta).
On Tue, May 9, 2023 at 10:00 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > How would Create Table Like, be different than our "Snapshot" procedure, > just enabled for Iceberg Tables? Wondering if we should just expand that > functionality. > > On Tue, May 9, 2023 at 11:54 AM Pucheng Yang <py...@pinterest.com.invalid> > wrote: > >> Ryan, when I mentioned "copy of the data'', I didn't mean to >> physically copy the data. I meant the copy of metadata and configuration >> such that the created table can also read the data that belongs to the >> table we created from. However, I do shared the concern that CREATE TABLE >> LIKE, if we plan to follow what most systems do, will copy some important >> configuration (such as gc.enabled) that I think we definitely don't want >> since it will create a surface for people to mess up the original table. In >> this regard, I agree we should adopt the approach of having a procedure >> instead. So I am dropping this CREATE TABLE LIKE feature request. >> >> Anton, branching will work but I will still prefer creating a separate >> table for these reasons: (1) I considered "branching" as a very advanced/ >> new feature to my customers and it is generally easy and safe to just let >> them use a separate test table. (2) the new generated data will be placed >> under a separate location making auditing and clean up easier. (3) if we >> use branching, there is coordination between the user who is doing testing >> via branching and the platform who is constantly performing table >> maintenance, thus introducing frictions. >> >> On Thu, Apr 27, 2023 at 2:15 PM Anton Okolnychyi >> <aokolnyc...@apple.com.invalid> wrote: >> >>> Iceberg supports branching so that you can safely perform such tests >>> without any risk of corrupting the table. No need to create a separate >>> table and clone the config. Overall, I don’t think it is a good idea to >>> break the contract of CREATE TABLE LIKE. >>> >>> - Anton >>> >>> On Apr 27, 2023, at 11:59 AM, Pucheng Yang <py...@pinterest.com.INVALID> >>> wrote: >>> >>> Hi Anton, >>> >>> Yes, I want to branch the table state and reuse the data files, but for >>> test purposes only. Imagine if we want to test something related to reading >>> the Iceberg table or perform row level update. >>> >>> And I acknowledge the potential risk of the table state being corrupted. >>> So I am thinking we can consider adding these limitations when running the >>> "create table like": >>> (1) the created table should have "snapshot=true" >>> (2) the created table should have "gc.enabled=false" to make sure >>> existing files don't get messed up >>> (3) the created table should have a table location different then the >>> existing Iceberg table location it creates from >>> We can consider "create table like" as a snapshot action for an existing >>> Iceberg table, similar to the existing snapshot procedure we have for an >>> existing Hive table. >>> >>> I know CREATE TABLE LIKE is supposed to be copy reuse existing table >>> definition only. If we have concerns around messing up table state, I wish >>> we can break it down into the implementation and at least first implement >>> the part where we create tables without reusing the existing data files. >>> >>> On Wed, Apr 26, 2023 at 8:26 AM Anton Okolnychyi < >>> aokolnyc...@apple.com.invalid> wrote: >>> >>>> Pucheng, you mentioned you want to reuse existing data in the new >>>> table? Branching Iceberg table state can lead to unexpected situations as >>>> there will be multiple pointers in the catalog to the same state, which can >>>> eventually corrupt the table. Isn’t CREATE TABLE LIKE supposed to just >>>> reuse the existing table definition without copying the data? >>>> >>>> - Anton >>>> >>>> On Apr 26, 2023, at 5:41 AM, Zoltán Borók-Nagy <borokna...@apache.org> >>>> wrote: >>>> >>>> As a reference, Impala can also do Hive-style CREATE TABLE x LIKE y for >>>> Iceberg tables. >>>> You can see various examples at >>>> https://github.com/apache/impala/blob/master/testdata/workloads/functional-query/queries/QueryTest/iceberg-create-table-like-table.test >>>> >>>> - Zoltan >>>> >>>> On Wed, Apr 26, 2023 at 4:10 AM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> You should be able to see how other DSv2 commands are written and copy >>>>> them. Look at Drop Table, maybe and see if you can copy the structure, but >>>>> instead of dropping, load the table and call createTable with its >>>>> metadata. >>>>> >>>>> On Tue, Apr 25, 2023 at 4:42 PM Pucheng Yang < >>>>> py...@pinterest.com.invalid> wrote: >>>>> >>>>>> Thanks Steve and Ryan for the reply. >>>>>> >>>>>> Steve, I am not looking for CTAS, my goal is to create an Iceberg >>>>>> table and reuse the existing data (same as the create table like >>>>>> statement >>>>>> above). Also my question is not about specifying location in >>>>>> create statement. >>>>>> >>>>>> Ryan, the engine we are interested in is SparkSQL. Since you >>>>>> mentioned it is an easy fix, would you please share how that should be >>>>>> implemented such that anyone (maybe myself) interested in this can >>>>>> explore >>>>>> the solution? >>>>>> >>>>>> Thanks both again. >>>>>> >>>>>> On Tue, Apr 25, 2023 at 4:07 PM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>>> Pucheng, what engine are you interested in? >>>>>>> >>>>>>> This works fine in Trino: CREATE TABLE table_copy (LIKE >>>>>>> source_table INCLUDING PROPERTIES) >>>>>>> >>>>>>> I don’t know if it works in Hive, and last time I checked it was not >>>>>>> implemented for DSv2 in Spark. The Spark problem should be an easy fix. >>>>>>> >>>>>>> Ryan >>>>>>> >>>>>>> On Tue, Apr 25, 2023 at 2:43 PM Steve Zhang < >>>>>>> hongyue_zh...@apple.com.invalid> wrote: >>>>>>> >>>>>>>> Hey Pengcheng, >>>>>>>> >>>>>>>> Are you looking for CTAS as in >>>>>>>> https://iceberg.apache.org/docs/latest/spark-ddl/#create-table--as-select? >>>>>>>> I >>>>>>>> think you can also specify explicit location as part of create >>>>>>>> statement in >>>>>>>> https://iceberg.apache.org/docs/latest/spark-ddl/#create-table >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Steve Zhang >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Apr 25, 2023, at 1:46 PM, Pucheng Yang < >>>>>>>> py...@pinterest.com.INVALID> wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I wonder how folks in the community deal with the cases where you >>>>>>>> want to create a test table from an existing iceberg table? In Hive, >>>>>>>> what >>>>>>>> we normally do is to run a query "create table x like y location z". >>>>>>>> But we >>>>>>>> can't do this for the Iceberg table. >>>>>>>> >>>>>>>> If this is a feature that is missing, should we collaborate to >>>>>>>> build a similar feature? >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Tabular >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>>> >>>