Re: Capability to create table without reassigning IDs

Ryan Blue Mon, 22 Aug 2022 12:20:51 -0700

Is the idea to be able to use older systems that only support Hive tables?
If so, I'm not sure why you'd write to a staging table. You could write
directly to the final Iceberg table (or stage a commit) and then copy the
data files to Hive locations after that. I would build this more as a
service that exposes a Hive layout for the current data files, rather than
needing to control column IDs. That seems harder.


On Mon, Aug 22, 2022 at 12:14 PM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> Hi Vikram,
>
> You might be able to use Hive to directly read/write the Iceberg tables.
> With the Hive 4.0.0 you can handle Iceberg tables as any other table, and
> it could work with older Hive versions with somewhat limited functionality.
>
> Could this help your use case?
>
> Thanks,
> Peter
>
> On Sun, Aug 21, 2022, 22:06 Vikram Bohra <vbo...@linkedin.com.invalid>
> wrote:
>
>> Hi Ryan
>>
>> Expanding a bit more on the use case.
>>
>> The main table (non-temporary) is dual registered as both Hive and
>> Iceberg tables. The main table location is used as the table location in
>> the Hive case. Any new files need to be atomically added to this location
>> to prevent read failures. A temporary table (with a temp location) is used
>> to write these new files which are then renamed to the main table location
>> and added to the main Iceberg table via the appendFiles API; hence we are
>> not using the SQL API, and our goal is to reuse the files for performance
>> reasons.
>>
>> Vikram
>> ------------------------------
>> *From:* Ryan Blue <b...@tabular.io>
>> *Sent:* Sunday, August 21, 2022 12:06 PM
>> *To:* Walaa Eldin Moustafa <wmoust...@linkedin.com>
>> *Cc:* dev@iceberg.apache.org <dev@iceberg.apache.org>; Vikram Bohra <
>> vbo...@linkedin.com>; Sudarshan Vasudevan <suvasude...@linkedin.com>
>> *Subject:* Re: Capability to create table without reassigning IDs
>>
>> Can you expand on that a bit more? How is a table temporary if you intend
>> to reuse its files in a different table? Is this something where you should
>> be using `REPLACE TABLE ... AS SELECT` instead?
>>
>> On Sun, Aug 21, 2022 at 10:20 AM Walaa Eldin Moustafa <
>> wmoust...@linkedin.com> wrote:
>>
>> Thanks Ryan! The use case is dropping a temporary table and reusing its
>> files in a new table. I think temporary tables could be a common use case.
>> In addition, I think reassigning field IDs makes it harder to reuse
>> schemas, but does not prevent it. I think, we can give the users the option
>> and let them reuse the IDs if they know what they are doing. Probably the
>> default behavior can be to reassign, but optionally this can be overridden?
>>
>> Thanks,
>> Walaa.
>>
>> ------------------------------
>> *From:* Ryan Blue <b...@tabular.io>
>> *Sent:* Sunday, August 21, 2022 9:38 AM
>> *To:* dev@iceberg.apache.org <dev@iceberg.apache.org>
>> *Cc:* Walaa Eldin Moustafa <wmoust...@linkedin.com>; Vikram Bohra <
>> vbo...@linkedin.com>; Sudarshan Vasudevan <suvasude...@linkedin.com>
>> *Subject:* Re: Capability to create table without reassigning IDs
>>
>> Hi Raymond,
>>
>> One of the reasons why Iceberg doesn't currently support this is that
>> it's dangerous to share files between tables. Even if you guarantee that a
>> table has the same schema at some point in time, there's nothing stopping
>> table schemas from diverging later. What are you trying to accomplish by
>> creating a table with the same IDs? Are you migrating from one metastore to
>> another? In that case, I'd recommend using `registerTable` instead.
>>
>> Ryan
>>
>> On Fri, Aug 19, 2022 at 2:22 PM Raymond Zhang
>> <razh...@linkedin.com.invalid> wrote:
>>
>> Hi there,
>>
>>
>>
>> I’m Raymond from LinkedIn big data platform org.
>>
>>
>>
>> I have a question regarding the capability to create a new table without
>> assigning new IDs in the schema. Currently, BaseMetastoreCatalog.create()
>> calls the public TableMetadata.newTableMetadata() which then calls the
>> package-private newTableMetadata() method. The package-private
>> newTableMetadata() method takes in an Iceberg schema and always reassigns
>> the ids in the schema to get a freshSchema and use that for creating the
>> new TableMetadata. This means, currently when we create a table, the IDs
>> will always be reassigned.
>>
>>
>>
>> I wonder if we can expose a possibility to create a table using the input
>> Iceberg schema as-is (without freshly assigning ids to it). I have the
>> following arguments to support this:
>>
>>
>>
>>    - It seems when an Iceberg schema is created, it’s already guaranteed
>>    that the ids are consistent from creation. I tried to create a new Schema
>>    with duplicate ids, and it fails at creation time, this means the creation
>>    already takes care of ID consistency. So, I wonder if that reassign id 
>> step
>>    really adds value to making the schema consistent.
>>    - From a user perspective, if we introduce this new capability, we
>>    will have a guaranteed way to create Iceberg tables with the ids we 
>> specif.
>>    We then will be able to create Iceberg tables with identical schema (of
>>    same ids), and thus their files can be reused between each other. A simple
>>    use case is that we can directly use AppendFiles API to add files from one
>>    table to the other without worrying their IDs discrepancies.
>>
>>
>>
>> Let me know how you think this might be beneficial, or I’m missing
>> anything here?
>>
>>
>>
>> Thanks,
>>
>> Raymond
>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Capability to create table without reassigning IDs

Reply via email to