Is the idea to be able to use older systems that only support Hive tables? If so, I'm not sure why you'd write to a staging table. You could write directly to the final Iceberg table (or stage a commit) and then copy the data files to Hive locations after that. I would build this more as a service that exposes a Hive layout for the current data files, rather than needing to control column IDs. That seems harder.
On Mon, Aug 22, 2022 at 12:14 PM Péter Váry <peter.vary.apa...@gmail.com> wrote: > Hi Vikram, > > You might be able to use Hive to directly read/write the Iceberg tables. > With the Hive 4.0.0 you can handle Iceberg tables as any other table, and > it could work with older Hive versions with somewhat limited functionality. > > Could this help your use case? > > Thanks, > Peter > > On Sun, Aug 21, 2022, 22:06 Vikram Bohra <vbo...@linkedin.com.invalid> > wrote: > >> Hi Ryan >> >> Expanding a bit more on the use case. >> >> The main table (non-temporary) is dual registered as both Hive and >> Iceberg tables. The main table location is used as the table location in >> the Hive case. Any new files need to be atomically added to this location >> to prevent read failures. A temporary table (with a temp location) is used >> to write these new files which are then renamed to the main table location >> and added to the main Iceberg table via the appendFiles API; hence we are >> not using the SQL API, and our goal is to reuse the files for performance >> reasons. >> >> Vikram >> ------------------------------ >> *From:* Ryan Blue <b...@tabular.io> >> *Sent:* Sunday, August 21, 2022 12:06 PM >> *To:* Walaa Eldin Moustafa <wmoust...@linkedin.com> >> *Cc:* dev@iceberg.apache.org <dev@iceberg.apache.org>; Vikram Bohra < >> vbo...@linkedin.com>; Sudarshan Vasudevan <suvasude...@linkedin.com> >> *Subject:* Re: Capability to create table without reassigning IDs >> >> Can you expand on that a bit more? How is a table temporary if you intend >> to reuse its files in a different table? Is this something where you should >> be using `REPLACE TABLE ... AS SELECT` instead? >> >> On Sun, Aug 21, 2022 at 10:20 AM Walaa Eldin Moustafa < >> wmoust...@linkedin.com> wrote: >> >> Thanks Ryan! The use case is dropping a temporary table and reusing its >> files in a new table. I think temporary tables could be a common use case. >> In addition, I think reassigning field IDs makes it harder to reuse >> schemas, but does not prevent it. I think, we can give the users the option >> and let them reuse the IDs if they know what they are doing. Probably the >> default behavior can be to reassign, but optionally this can be overridden? >> >> Thanks, >> Walaa. >> >> ------------------------------ >> *From:* Ryan Blue <b...@tabular.io> >> *Sent:* Sunday, August 21, 2022 9:38 AM >> *To:* dev@iceberg.apache.org <dev@iceberg.apache.org> >> *Cc:* Walaa Eldin Moustafa <wmoust...@linkedin.com>; Vikram Bohra < >> vbo...@linkedin.com>; Sudarshan Vasudevan <suvasude...@linkedin.com> >> *Subject:* Re: Capability to create table without reassigning IDs >> >> Hi Raymond, >> >> One of the reasons why Iceberg doesn't currently support this is that >> it's dangerous to share files between tables. Even if you guarantee that a >> table has the same schema at some point in time, there's nothing stopping >> table schemas from diverging later. What are you trying to accomplish by >> creating a table with the same IDs? Are you migrating from one metastore to >> another? In that case, I'd recommend using `registerTable` instead. >> >> Ryan >> >> On Fri, Aug 19, 2022 at 2:22 PM Raymond Zhang >> <razh...@linkedin.com.invalid> wrote: >> >> Hi there, >> >> >> >> I’m Raymond from LinkedIn big data platform org. >> >> >> >> I have a question regarding the capability to create a new table without >> assigning new IDs in the schema. Currently, BaseMetastoreCatalog.create() >> calls the public TableMetadata.newTableMetadata() which then calls the >> package-private newTableMetadata() method. The package-private >> newTableMetadata() method takes in an Iceberg schema and always reassigns >> the ids in the schema to get a freshSchema and use that for creating the >> new TableMetadata. This means, currently when we create a table, the IDs >> will always be reassigned. >> >> >> >> I wonder if we can expose a possibility to create a table using the input >> Iceberg schema as-is (without freshly assigning ids to it). I have the >> following arguments to support this: >> >> >> >> - It seems when an Iceberg schema is created, it’s already guaranteed >> that the ids are consistent from creation. I tried to create a new Schema >> with duplicate ids, and it fails at creation time, this means the creation >> already takes care of ID consistency. So, I wonder if that reassign id >> step >> really adds value to making the schema consistent. >> - From a user perspective, if we introduce this new capability, we >> will have a guaranteed way to create Iceberg tables with the ids we >> specif. >> We then will be able to create Iceberg tables with identical schema (of >> same ids), and thus their files can be reused between each other. A simple >> use case is that we can directly use AppendFiles API to add files from one >> table to the other without worrying their IDs discrepancies. >> >> >> >> Let me know how you think this might be beneficial, or I’m missing >> anything here? >> >> >> >> Thanks, >> >> Raymond >> >> >> >> -- >> Ryan Blue >> Tabular >> >> >> >> -- >> Ryan Blue >> Tabular >> > -- Ryan Blue Tabular