+ (non-binding). On Thu, Apr 17, 2025 at 11:22 AM Steven Wu <stevenz...@gmail.com> wrote:
> +1 (binding) > > On Thu, Apr 17, 2025 at 11:09 AM Amogh Jahagirdar <2am...@gmail.com> > wrote: > >> +1 (binding) >> >> On Thu, Apr 17, 2025 at 11:54 AM Szehon Ho <szehon.apa...@gmail.com> >> wrote: >> >>> +1 (binding) Seems cleaner to me. >>> >>> Thanks >>> Szehon >>> >>> On Thu, Apr 17, 2025 at 10:31 AM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> +1 >>>> >>>> On Thu, Apr 17, 2025 at 12:30 PM Ryan Blue <rdb...@gmail.com> wrote: >>>> >>>>> Adding my own +1. >>>>> >>>>> On Thu, Apr 17, 2025 at 10:19 AM Daniel Weeks <dwe...@apache.org> >>>>> wrote: >>>>> >>>>>> +1 (binding) >>>>>> >>>>>> I think this update really helps ensure row ids will be present and >>>>>> reliable for upgraded tables. Thanks Ryan! >>>>>> >>>>>> On Wed, Apr 16, 2025 at 4:09 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> I’d like to start a vote to incorporate the spec changes in PR 12781 >>>>>>> <https://github.com/apache/iceberg/pull/12781>. >>>>>>> >>>>>>> There are two main changes. First, the current language says that >>>>>>> upgrading a table to v3 leaves all row IDs null and they are assigned >>>>>>> when >>>>>>> the rows are rewritten for the first time (either to move or modify the >>>>>>> row). The problem with this is that row IDs are missing until the entire >>>>>>> table is rewritten, which means that the feature is unreliable. >>>>>>> Instead, I >>>>>>> propose that row IDs are assigned in the first write after upgrading to >>>>>>> v3. >>>>>>> >>>>>>> In addition to making row IDs more useful, the change to how we >>>>>>> upgrade tables allows us to simplify the spec with statements like “any >>>>>>> added or existing data file without first_row_id should be assigned >>>>>>> one via inheritance” and “any manifest without a first_row_id must >>>>>>> be assigned one when writing a manifest list”. I think this sets clearer >>>>>>> expectations. >>>>>>> >>>>>>> Second, I found some issues with the strict way that first_row_id >>>>>>> is inherited and assigned in the metadata tree. The current wording >>>>>>> would >>>>>>> prevent writers from assigning row IDs to existing data files because >>>>>>> assignment was strict and only accounted for added files. Instead, I >>>>>>> propose changing the wording to “must be greater than or equal to” so >>>>>>> that >>>>>>> there is some flexibility, and giving simple examples that are safe, >>>>>>> like first_row_id >>>>>>> = last_assigned.first_row_id + last_assigned.added_rows_count + >>>>>>> last_assigned.existing_rows_count. >>>>>>> >>>>>>> Please take a look at the PR and vote in the next 72 hours. >>>>>>> >>>>>>> [ ] +1 Add these changes to the spec for v3 row lineage >>>>>>> [ ] +0 >>>>>>> [ ] -1 I have questions and/or concerns >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Ryan >>>>>>> >>>>>>