Hi Russell, Thanks for bringing this up! I think equality deletes are not the root of the problem here.
- If we have a positional delete, and the new row doesn't include the old rowId, then the lineage info is lost. - If we have an equality delete, and the new row contains the rowId, then we have the lineage info That said, it is unlikely to have rowId for the new rows for writers (Flink, Kafka Connect) who are currently using equality deletes for updates, because of the exact reason they're using equality deletes (has to defer the lookup for later). Maybe we can add checks for these engines to warn when they try to write a table which has row lineage enabled. If we want to keep the equality deletes, we might add a marker to the updated row (for example rowId=-1) that this is an update. The reader and the compaction could calculate the correct value. That said, I think it would be better to invest in finding a replacement for equality deletes than investing more in it. Maybe adding Iceberg indexes which would allow fast lookup and conversion at commit time, or push those indexes to the file format level? Thanks, Peter On Wed, Feb 12, 2025, 06:38 Gang Wu <ust...@gmail.com> wrote: > Thanks Steven for the explanation! Yes, you're right that solely rewriting > delete files does not help. > > IIUC, Iceberg is the only table format that does not produce changelog > files. Is there any chance to recompute the row_id of updated rows by > tracking changes of the identifier fields between snapshots during the > rewrite and produce changelog files? > > I'm not asking to add changelogs support since it is a large design > choice. Just want to brainstorm it. > > On Wed, Feb 12, 2025 at 11:55 AM Steven Wu <stevenz...@gmail.com> wrote: > >> I am fine with the proposed spec change. While it "supports/allows" >> equality deletes, row lineage semantics needn't/can't be maintained >> properly for equality deletes (compared to position deletes). Gang pointed >> out a couple issues with the implications. But we have no choice but to >> live with those implications due to how equality deletes behave. >> >> Gang, rewriting equality deletes to position deletes doesn't really help >> in this case. To have correct lineage, the row update is supposed to have >> the row_id carried over from the previous row (equality deleted row) during >> the write phase with equality deletes. Instead, this spec change now says >> the updated row is a complete new row with new row_id. >> >> On Tue, Feb 11, 2025 at 7:39 PM Gang Wu <ust...@gmail.com> wrote: >> >>> Hi Russell, >>> >>> Thanks for supporting equality deletes to row lineage! >>> >>> > accept that "updates" will be treated as "delete" and "insert" >>> >>> I would say that it has obvious drawbacks below (though it is better >>> than not supported): >>> 1) updates will be populated differently when outputting changelogs to >>> users or downstream databases >>> 2) lead to more computation for incremental processing like refreshing >>> materialized views >>> >>> At the same time, I would like to ask if it would help if we support >>> rewriting equality deletes to position deletes. >>> There was an effort but it has been closed: >>> https://github.com/apache/iceberg/pull/2216 >>> >>> Best, >>> Gang >>> >>> >>> On Wed, Feb 12, 2025 at 7:25 AM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> Hi Y'all, >>>> >>>> As we have been working on the row lineage implementation I've been >>>> reached out to by a few folks >>>> in the community who are interested in changing our defined behavior >>>> around equality deletes. >>>> >>>> Currently when Row Lineage is enabled, the spec says to disable >>>> equality deletes for the table. >>>> >>>> In the interest of compatibility with Flink and other Equality delete >>>> producers, I originally wrote >>>> that we would simply treat all equality delete based updates as a pure >>>> insert and >>>> delete. At the time, some folks thought this was too open and worried >>>> that it would be poor behavior which >>>> led to the current restriction. >>>> >>>> Now that we are actually implementing I think there have been some >>>> changes of heart and that we >>>> should go back to the original design. I'd like to see if we have >>>> consensus >>>> in the community to change the wording back and allow equality deletes. >>>> >>>> PR: https://github.com/apache/iceberg/pull/12230 >>>> >>>> The TLDR; >>>> >>>> Allow equality deletes with row lineage but accept that "updates" will >>>> be treated as "delete" and "insert" >>>> >>>> Thanks for your time, >>>> Russ >>>> >>>