Hi everyone, I’d like to start a vote to incorporate the spec changes in PR 12781 <https://github.com/apache/iceberg/pull/12781>.
There are two main changes. First, the current language says that upgrading a table to v3 leaves all row IDs null and they are assigned when the rows are rewritten for the first time (either to move or modify the row). The problem with this is that row IDs are missing until the entire table is rewritten, which means that the feature is unreliable. Instead, I propose that row IDs are assigned in the first write after upgrading to v3. In addition to making row IDs more useful, the change to how we upgrade tables allows us to simplify the spec with statements like “any added or existing data file without first_row_id should be assigned one via inheritance” and “any manifest without a first_row_id must be assigned one when writing a manifest list”. I think this sets clearer expectations. Second, I found some issues with the strict way that first_row_id is inherited and assigned in the metadata tree. The current wording would prevent writers from assigning row IDs to existing data files because assignment was strict and only accounted for added files. Instead, I propose changing the wording to “must be greater than or equal to” so that there is some flexibility, and giving simple examples that are safe, like first_row_id = last_assigned.first_row_id + last_assigned.added_rows_count + last_assigned.existing_rows_count. Please take a look at the PR and vote in the next 72 hours. [ ] +1 Add these changes to the spec for v3 row lineage [ ] +0 [ ] -1 I have questions and/or concerns Thanks, Ryan