Hi everyone,

I’d like to start a vote to incorporate the spec changes in PR 12781
<https://github.com/apache/iceberg/pull/12781>.

There are two main changes. First, the current language says that upgrading
a table to v3 leaves all row IDs null and they are assigned when the rows
are rewritten for the first time (either to move or modify the row). The
problem with this is that row IDs are missing until the entire table is
rewritten, which means that the feature is unreliable. Instead, I propose
that row IDs are assigned in the first write after upgrading to v3.

In addition to making row IDs more useful, the change to how we upgrade
tables allows us to simplify the spec with statements like “any added or
existing data file without first_row_id should be assigned one via
inheritance” and “any manifest without a first_row_id must be assigned one
when writing a manifest list”. I think this sets clearer expectations.

Second, I found some issues with the strict way that first_row_id is
inherited and assigned in the metadata tree. The current wording would
prevent writers from assigning row IDs to existing data files because
assignment was strict and only accounted for added files. Instead, I
propose changing the wording to “must be greater than or equal to” so that
there is some flexibility, and giving simple examples that are safe,
like first_row_id
= last_assigned.first_row_id + last_assigned.added_rows_count +
last_assigned.existing_rows_count.

Please take a look at the PR and vote in the next 72 hours.

[ ] +1 Add these changes to the spec for v3 row lineage
[ ] +0
[ ] -1 I have questions and/or concerns

Thanks,

Ryan

Reply via email to