Re: Capability to create table without reassigning IDs

Ryan Blue Sun, 21 Aug 2022 09:39:35 -0700

Hi Raymond,

One of the reasons why Iceberg doesn't currently support this is that it's
dangerous to share files between tables. Even if you guarantee that a table
has the same schema at some point in time, there's nothing stopping table
schemas from diverging later. What are you trying to accomplish by creating
a table with the same IDs? Are you migrating from one metastore to another?
In that case, I'd recommend using `registerTable` instead.


Ryan

On Fri, Aug 19, 2022 at 2:22 PM Raymond Zhang <razh...@linkedin.com.invalid>
wrote:

> Hi there,
>
>
>
> I’m Raymond from LinkedIn big data platform org.
>
>
>
> I have a question regarding the capability to create a new table without
> assigning new IDs in the schema. Currently, BaseMetastoreCatalog.create()
> calls the public TableMetadata.newTableMetadata() which then calls the
> package-private newTableMetadata() method. The package-private
> newTableMetadata() method takes in an Iceberg schema and always reassigns
> the ids in the schema to get a freshSchema and use that for creating the
> new TableMetadata. This means, currently when we create a table, the IDs
> will always be reassigned.
>
>
>
> I wonder if we can expose a possibility to create a table using the input
> Iceberg schema as-is (without freshly assigning ids to it). I have the
> following arguments to support this:
>
>
>
>    - It seems when an Iceberg schema is created, it’s already guaranteed
>    that the ids are consistent from creation. I tried to create a new Schema
>    with duplicate ids, and it fails at creation time, this means the creation
>    already takes care of ID consistency. So, I wonder if that reassign id step
>    really adds value to making the schema consistent.
>    - From a user perspective, if we introduce this new capability, we
>    will have a guaranteed way to create Iceberg tables with the ids we specif.
>    We then will be able to create Iceberg tables with identical schema (of
>    same ids), and thus their files can be reused between each other. A simple
>    use case is that we can directly use AppendFiles API to add files from one
>    table to the other without worrying their IDs discrepancies.
>
>
>
> Let me know how you think this might be beneficial, or I’m missing
> anything here?
>
>
>
> Thanks,
>
> Raymond
>


-- 
Ryan Blue
Tabular

Re: Capability to create table without reassigning IDs

Reply via email to