During the sync, we were mostly aligned that the row lineage semantics for updates depends on how the writer engine interprets/implements (e.g. Flink with equality deletes).
Now, if we make it required for V3 tables, what if users don't need the row lineage feature. There is a bit overhead (although low) for row lineage. E.g., extra metadata columns in data files during rewrite/compaction. On Thu, Mar 20, 2025 at 8:53 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > I think I'm in favor of this but I would like some way of knowing whether > or not a snapshot was produced while preserving row_ids or not. Just so we > can make it clear on read what the row-lineage behavior of the writer was > without knowing what system wrote the data. > > On Thu, Mar 20, 2025 at 10:43 AM Ryan Blue <rdb...@gmail.com> wrote: > >> +1 for the PR and always having the lineage metadata. >> >> I think that is going to make the feature much more reliable. We don't >> gain anything from allowing the feature to be turned off for compatibility, >> when we have reasonable ways to interpret data written by any engine. >> >> Ryan >> >> On Wed, Mar 19, 2025 at 12:37 PM Daniel Weeks <dwe...@apache.org> wrote: >> >>> Hey everyone, >>> >>> When Row lineage was originally introduced, it was believed to be >>> incompatible with equality deletes and we initially added lineage as a >>> feature that could be turned on. Now that these features can co-exist >>> <https://lists.apache.org/thread/vhl7p72433m904y115hmh7vnnbjdz4xn>, we >>> would like to require lineage for v3 as there are benefits for feature >>> enablement and adoption. >>> >>> We discussed this topic >>> <https://www.youtube.com/watch?v=9BBZKTfcU0s&t=1475s> in the community >>> sync, but I would like to raise this here on the dev list to have a broader >>> discussion around the spec changes proposed in this PR >>> <https://github.com/apache/iceberg/pull/12580>. >>> >>> The main points of discussion were around what implications this >>> requirement has for writers, especially where tracking changes may be >>> difficult/expensive. The proposal is to address this in the spec by >>> clarifying the semantics of treating changes as upserts vs. delete/add is >>> an engine implementation decision. The update to the spec would state that >>> engines *should* track row ids through modification, but depending on >>> the engine or the viability of tracking these changes, an engine may choose >>> to model updates as deletes/adds. >>> >>> I'd love to get everyones thoughts/feedback on this proposal since we >>> still have the opportunity to change this for v3, >>> -Dan >>> >>