Re: Capability to create table without reassigning IDs

Vikram Bohra Sun, 21 Aug 2022 13:06:19 -0700

Hi Ryan

Expanding a bit more on the use case.


The main table (non-temporary) is dual registered as both Hive and Iceberg 
tables. The main table location is used as the table location in the Hive case. 
Any new files need to be atomically added to this location to prevent read 
failures. A temporary table (with a temp location) is used to write these new 
files which are then renamed to the main table location and added to the main 
Iceberg table via the appendFiles API; hence we are not using the SQL API, and 
our goal is to reuse the files for performance reasons.

Vikram
________________________________
From: Ryan Blue <b...@tabular.io>
Sent: Sunday, August 21, 2022 12:06 PM
To: Walaa Eldin Moustafa <wmoust...@linkedin.com>
Cc: dev@iceberg.apache.org <dev@iceberg.apache.org>; Vikram Bohra 
<vbo...@linkedin.com>; Sudarshan Vasudevan <suvasude...@linkedin.com>
Subject: Re: Capability to create table without reassigning IDs

Can you expand on that a bit more? How is a table temporary if you intend to 
reuse its files in a different table? Is this something where you should be 
using `REPLACE TABLE ... AS SELECT` instead?

On Sun, Aug 21, 2022 at 10:20 AM Walaa Eldin Moustafa 
<wmoust...@linkedin.com<mailto:wmoust...@linkedin.com>> wrote:
Thanks Ryan! The use case is dropping a temporary table and reusing its files 
in a new table. I think temporary tables could be a common use case. In 
addition, I think reassigning field IDs makes it harder to reuse schemas, but 
does not prevent it. I think, we can give the users the option and let them 
reuse the IDs if they know what they are doing. Probably the default behavior 
can be to reassign, but optionally this can be overridden?

Thanks,
Walaa.

________________________________
From: Ryan Blue <b...@tabular.io<mailto:b...@tabular.io>>
Sent: Sunday, August 21, 2022 9:38 AM
To: dev@iceberg.apache.org<mailto:dev@iceberg.apache.org> 
<dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>>
Cc: Walaa Eldin Moustafa 
<wmoust...@linkedin.com<mailto:wmoust...@linkedin.com>>; Vikram Bohra 
<vbo...@linkedin.com<mailto:vbo...@linkedin.com>>; Sudarshan Vasudevan 
<suvasude...@linkedin.com<mailto:suvasude...@linkedin.com>>
Subject: Re: Capability to create table without reassigning IDs

Hi Raymond,

One of the reasons why Iceberg doesn't currently support this is that it's 
dangerous to share files between tables. Even if you guarantee that a table has 
the same schema at some point in time, there's nothing stopping table schemas 
from diverging later. What are you trying to accomplish by creating a table 
with the same IDs? Are you migrating from one metastore to another? In that 
case, I'd recommend using `registerTable` instead.

Ryan

On Fri, Aug 19, 2022 at 2:22 PM Raymond Zhang <razh...@linkedin.com.invalid> 
wrote:

Hi there,



I’m Raymond from LinkedIn big data platform org.



I have a question regarding the capability to create a new table without 
assigning new IDs in the schema. Currently, BaseMetastoreCatalog.create() calls 
the public TableMetadata.newTableMetadata() which then calls the 
package-private newTableMetadata() method. The package-private 
newTableMetadata() method takes in an Iceberg schema and always reassigns the 
ids in the schema to get a freshSchema and use that for creating the new 
TableMetadata. This means, currently when we create a table, the IDs will 
always be reassigned.



I wonder if we can expose a possibility to create a table using the input 
Iceberg schema as-is (without freshly assigning ids to it). I have the 
following arguments to support this:



  *   It seems when an Iceberg schema is created, it’s already guaranteed that 
the ids are consistent from creation. I tried to create a new Schema with 
duplicate ids, and it fails at creation time, this means the creation already 
takes care of ID consistency. So, I wonder if that reassign id step really adds 
value to making the schema consistent.
  *   From a user perspective, if we introduce this new capability, we will 
have a guaranteed way to create Iceberg tables with the ids we specif. We then 
will be able to create Iceberg tables with identical schema (of same ids), and 
thus their files can be reused between each other. A simple use case is that we 
can directly use AppendFiles API to add files from one table to the other 
without worrying their IDs discrepancies.



Let me know how you think this might be beneficial, or I’m missing anything 
here?



Thanks,

Raymond


--
Ryan Blue
Tabular


--
Ryan Blue
Tabular

Re: Capability to create table without reassigning IDs

Reply via email to