Re: [VOTE] Update row lineage spec ID assignment

Steven Wu Thu, 17 Apr 2025 11:20:26 -0700

+1 (binding)

On Thu, Apr 17, 2025 at 11:09 AM Amogh Jahagirdar <[email protected]> wrote:


> +1 (binding)
>
> On Thu, Apr 17, 2025 at 11:54 AM Szehon Ho <[email protected]>
> wrote:
>
>> +1 (binding)  Seems cleaner to me.
>>
>> Thanks
>> Szehon
>>
>> On Thu, Apr 17, 2025 at 10:31 AM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> +1
>>>
>>> On Thu, Apr 17, 2025 at 12:30 PM Ryan Blue <[email protected]> wrote:
>>>
>>>> Adding my own +1.
>>>>
>>>> On Thu, Apr 17, 2025 at 10:19 AM Daniel Weeks <[email protected]>
>>>> wrote:
>>>>
>>>>> +1 (binding)
>>>>>
>>>>> I think this update really helps ensure row ids will be present and
>>>>> reliable for upgraded tables.  Thanks Ryan!
>>>>>
>>>>> On Wed, Apr 16, 2025 at 4:09 PM Ryan Blue <[email protected]> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I’d like to start a vote to incorporate the spec changes in PR 12781
>>>>>> <https://github.com/apache/iceberg/pull/12781>.
>>>>>>
>>>>>> There are two main changes. First, the current language says that
>>>>>> upgrading a table to v3 leaves all row IDs null and they are assigned 
>>>>>> when
>>>>>> the rows are rewritten for the first time (either to move or modify the
>>>>>> row). The problem with this is that row IDs are missing until the entire
>>>>>> table is rewritten, which means that the feature is unreliable. Instead, 
>>>>>> I
>>>>>> propose that row IDs are assigned in the first write after upgrading to 
>>>>>> v3.
>>>>>>
>>>>>> In addition to making row IDs more useful, the change to how we
>>>>>> upgrade tables allows us to simplify the spec with statements like “any
>>>>>> added or existing data file without first_row_id should be assigned
>>>>>> one via inheritance” and “any manifest without a first_row_id must
>>>>>> be assigned one when writing a manifest list”. I think this sets clearer
>>>>>> expectations.
>>>>>>
>>>>>> Second, I found some issues with the strict way that first_row_id is
>>>>>> inherited and assigned in the metadata tree. The current wording would
>>>>>> prevent writers from assigning row IDs to existing data files because
>>>>>> assignment was strict and only accounted for added files. Instead, I
>>>>>> propose changing the wording to “must be greater than or equal to” so 
>>>>>> that
>>>>>> there is some flexibility, and giving simple examples that are safe, 
>>>>>> like first_row_id
>>>>>> = last_assigned.first_row_id + last_assigned.added_rows_count +
>>>>>> last_assigned.existing_rows_count.
>>>>>>
>>>>>> Please take a look at the PR and vote in the next 72 hours.
>>>>>>
>>>>>> [ ] +1 Add these changes to the spec for v3 row lineage
>>>>>> [ ] +0
>>>>>> [ ] -1 I have questions and/or concerns
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>

Re: [VOTE] Update row lineage spec ID assignment

Reply via email to