Thanks Peter and Manu for the feedback. @Peter:
Good point on the end goal. The end goal should be to completely remove equality deletes. While the staging branch, which contains the equality deletes, is an internal implementation detail of the Flink writer, it will still be accessible via the Iceberg reader API. For the transition period, I think this has several advantages: 1. We don't need to fundamentally change the write logic of existing writers. 2. We still allow for the data to be inspected before converting it and merging it to the main branch. This is also helpful for troubleshooting. The staging branch solution is a first step towards removing equality deletes. In V4, we could already deprecate equality deletes. Once the spec includes indices, we can move the index into Iceberg, which should make it easier to develop an in-place resolution of equality deletes supporting multiple writers and conflict resolution. Admittedly, we haven't fully figured out the best in-place approach. I think it is a good idea to take it one step at a time. On row lineage: If we want to preserve the row id of updated rows, we will have to store the row id in the primary key index. Theoretically, we should be able to then add it to the corresponding new row. The question is how to do that efficiently, such that we don't have to rewrite any data files. We would need some way to map the row id of the newly inserted row to the row id of the deleted row. Do we already have such functionality in Iceberg? On concurrent writes: For the time being, I think we should not allow concurrent maintenance tasks, including equality delete conversion. Concurrent writes are still supported, as long as they go to the staging branch. @Manu: +1 to Peter's response. The primary key index is bounded and independent of the number of accumulated equality deletes, so memory doesn't blow up, as long as we have sufficient resources to load the index. We definitely cannot rely on the full index to fit into memory. Fortunately, Flink is already prepared for this; it supports spilling to disk via its RocksDB state backend. Cheers, Max On Fri, Mar 20, 2026 at 11:07 AM Péter Váry <[email protected]> wrote: > > Equality delete resolution could be made significantly more efficient by > using an index (e.g., backed by RocksDB) to store the current mapping from > primary keys to (file, position). While the memory footprint would not be > small, it would be bounded and independent of the number of accumulated > equality deletes. In addition, even a blocking compaction should block for a > shorter period than the typical interval at which table compactions are > scheduled. > > Manu Zhang <[email protected]> ezt írta (időpont: 2026. márc. 19., Cs, > 15:57): >> >> Thanks Max for the proposal. One question here. >> When the convert task can not finish in time (e.g. blocked by compactions), >> and equality deletes accumulate on the staging branch, will we have the same >> issue as loading too many equality deletes and blowing up memory? >> >> Regards, >> Manu >> >> On Thu, Mar 19, 2026 at 2:45 PM Péter Váry <[email protected]> >> wrote: >>> >>> Thanks, Max, for continuing to push this forward. >>> >>> The proposal feels like a step in the right direction, but I would like to >>> see a clearer view of the end goal. As it stands, equality deletes remain >>> in the spec because the changes are committed to an intermediate branch. >>> Since the long‑term objective is to remove equality deletes from the >>> specification altogether, we should be clear about the final solution that >>> achieves this. >>> >>> Flink writes will also continue to have the limitation that row lineage is >>> not maintained correctly. This is unchanged from the current situation, but >>> I think it’s important to explicitly call this out, or ideally, explore >>> whether there’s a way to address it. >>> >>> In addition, concurrent writes and compactions would require updating the >>> primary key index, which could be expensive. >>> >>> That said, I don’t see a clearly better alternative at the moment, and >>> overall this seems like a reasonable way forward. >>> >>> Thanks again for continuing to drive the proposal. >>> Peter >>> >>> >>> On Wed, Mar 18, 2026, 16:47 Maximilian Michels <[email protected]> wrote: >>>> >>>> Hi, >>>> >>>> I'd like to discuss resolving equality deletes in the Flink write >>>> path, which will get us one step closer to removing equality deletes >>>> from the spec. >>>> >>>> ## tl;dr >>>> >>>> We're planning to add an equality delete to deletion vector (DV) >>>> conversion to Flink. Equality deletes may remain as an internal >>>> intermediary format. >>>> >>>> ## Background >>>> >>>> For deletes, Flink currently produces equality delete files. >>>> >>>> Equality deletes are used to support deletes in the write path, which >>>> is a requirement for many use cases like CDC [3]. They are cheap for >>>> the writer; it only notes down the to-be-deleted values of the >>>> identifier fields inside so-called delete files, and leaves it up for >>>> the readers to match the values to the corresponding rows. The heavy >>>> lifting has to be done by the readers, which potentially need to scan >>>> the entire table to resolve equality deletes. >>>> >>>> Therefore, equality deletes have been criticized. There are >>>> discussions around deprecating / removing them [1]. >>>> >>>> ## Resolving Equality Deletes >>>> >>>> Steven, Peter, and a few other contributors came up with a proposal to >>>> convert equality deletes into DV [2]. The original solution was quite >>>> complex, mainly due to the conflict handling between streaming writes, >>>> table maintenance, and equality delete resolution. The proposal is >>>> also blocked on index support in the Iceberg spec [5]. >>>> >>>> We may need to simplify further to make some progress. The old table >>>> specs are going to be around for some time, even after we have a new >>>> spec with index support. Users have been asking for a solution to this >>>> issue for quite some time [3]. >>>> >>>> The following is a modification of the original design document which >>>> adapts the ideas described under "use lock to avoid conflicts" [2]. >>>> >>>> ## Proposed Solution >>>> >>>> The idea is to add the equality delete to deletion vector (DV) >>>> conversion as a Flink table maintenance task. After recent changes, we >>>> can now run the writer and the maintenance in the same Flink job and >>>> use a Flink-maintained lock to avoid conflicts between the maintenance >>>> tasks. >>>> >>>> 1. Instead of writing directly to the target branch, the writer >>>> commits data files + equality deletes to a staging branch. >>>> 2. The new "EqualityDeleteResolver" maintenance task reads from the >>>> staging branch and converts the equality deletes to DVs using a >>>> Flink-maintained primary key index, then commits data files + DVs to >>>> the target branch. >>>> 3. The existing Flink maintenance framework's lock mechanism ensures >>>> mutual exclusion between the convert task and table compaction to >>>> avoid conflicts. >>>> >>>> After conversion, the target branch contains only data files and DVs, >>>> no equality deletes. >>>> >>>> ## Limitations >>>> >>>> - Readers will only see new data until the conversion is complete. >>>> This is partially mitigated by the fact that snapshots with equality >>>> deletes cannot be read properly with Flink today [4]. >>>> - The Flink-maintained index needs to be built initially which >>>> requires reading the entire table. We will use Flink's state backend >>>> which apart from heap-based storage, also supports spilling to disk >>>> via the RocksDB state backend. >>>> >>>> ## Wrapping up >>>> >>>> This solution may not be perfect because of the above limitations, but >>>> it provides a viable path to free users of the burden of equality >>>> deletes, which cannot be read efficiently by most engines today. >>>> Eventually, the Flink-maintained index can be replaced by an Iceberg >>>> index, which will allow for the index to be shared across engines. >>>> >>>> What does the community think? >>>> >>>> Thanks, >>>> Max >>>> >>>> [1] Deprecate equality deletes: >>>> https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6 >>>> [2] Design doc: >>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/edit >>>> [3] Upserts use case: >>>> https://lists.apache.org/thread/rt7dmg7l78xpzc9w3lwn090yzqq4fyyw >>>> [4] Handling upserts downstream: >>>> https://lists.apache.org/thread/bx1ntfr45c0g9rh643yw7w8znv6wtrno >>>> [5] V4 Index: >>>> https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j
