Hi Ryan Expanding a bit more on the use case.
The main table (non-temporary) is dual registered as both Hive and Iceberg tables. The main table location is used as the table location in the Hive case. Any new files need to be atomically added to this location to prevent read failures. A temporary table (with a temp location) is used to write these new files which are then renamed to the main table location and added to the main Iceberg table via the appendFiles API; hence we are not using the SQL API, and our goal is to reuse the files for performance reasons. Vikram ________________________________ From: Ryan Blue <b...@tabular.io> Sent: Sunday, August 21, 2022 12:06 PM To: Walaa Eldin Moustafa <wmoust...@linkedin.com> Cc: dev@iceberg.apache.org <dev@iceberg.apache.org>; Vikram Bohra <vbo...@linkedin.com>; Sudarshan Vasudevan <suvasude...@linkedin.com> Subject: Re: Capability to create table without reassigning IDs Can you expand on that a bit more? How is a table temporary if you intend to reuse its files in a different table? Is this something where you should be using `REPLACE TABLE ... AS SELECT` instead? On Sun, Aug 21, 2022 at 10:20 AM Walaa Eldin Moustafa <wmoust...@linkedin.com<mailto:wmoust...@linkedin.com>> wrote: Thanks Ryan! The use case is dropping a temporary table and reusing its files in a new table. I think temporary tables could be a common use case. In addition, I think reassigning field IDs makes it harder to reuse schemas, but does not prevent it. I think, we can give the users the option and let them reuse the IDs if they know what they are doing. Probably the default behavior can be to reassign, but optionally this can be overridden? Thanks, Walaa. ________________________________ From: Ryan Blue <b...@tabular.io<mailto:b...@tabular.io>> Sent: Sunday, August 21, 2022 9:38 AM To: dev@iceberg.apache.org<mailto:dev@iceberg.apache.org> <dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>> Cc: Walaa Eldin Moustafa <wmoust...@linkedin.com<mailto:wmoust...@linkedin.com>>; Vikram Bohra <vbo...@linkedin.com<mailto:vbo...@linkedin.com>>; Sudarshan Vasudevan <suvasude...@linkedin.com<mailto:suvasude...@linkedin.com>> Subject: Re: Capability to create table without reassigning IDs Hi Raymond, One of the reasons why Iceberg doesn't currently support this is that it's dangerous to share files between tables. Even if you guarantee that a table has the same schema at some point in time, there's nothing stopping table schemas from diverging later. What are you trying to accomplish by creating a table with the same IDs? Are you migrating from one metastore to another? In that case, I'd recommend using `registerTable` instead. Ryan On Fri, Aug 19, 2022 at 2:22 PM Raymond Zhang <razh...@linkedin.com.invalid> wrote: Hi there, I’m Raymond from LinkedIn big data platform org. I have a question regarding the capability to create a new table without assigning new IDs in the schema. Currently, BaseMetastoreCatalog.create() calls the public TableMetadata.newTableMetadata() which then calls the package-private newTableMetadata() method. The package-private newTableMetadata() method takes in an Iceberg schema and always reassigns the ids in the schema to get a freshSchema and use that for creating the new TableMetadata. This means, currently when we create a table, the IDs will always be reassigned. I wonder if we can expose a possibility to create a table using the input Iceberg schema as-is (without freshly assigning ids to it). I have the following arguments to support this: * It seems when an Iceberg schema is created, it’s already guaranteed that the ids are consistent from creation. I tried to create a new Schema with duplicate ids, and it fails at creation time, this means the creation already takes care of ID consistency. So, I wonder if that reassign id step really adds value to making the schema consistent. * From a user perspective, if we introduce this new capability, we will have a guaranteed way to create Iceberg tables with the ids we specif. We then will be able to create Iceberg tables with identical schema (of same ids), and thus their files can be reused between each other. A simple use case is that we can directly use AppendFiles API to add files from one table to the other without worrying their IDs discrepancies. Let me know how you think this might be beneficial, or I’m missing anything here? Thanks, Raymond -- Ryan Blue Tabular -- Ryan Blue Tabular