Re: Re: [DISCUSS] Flink: Equality delete to DV conversion

Péter Váry Wed, 13 May 2026 08:51:26 -0700

Hi Qian,
What are the main blockers for you to move to V3?
Many engines already support V3, and migration should be straightforward.
Thanks,
Peter


Qian Su <[email protected]> ezt írta (időpont: 2026. máj. 13., Sze,
0:57):

> Thanks for driving this proposal forward! Our organization is still
> largely running on V2 tables. Because migrating our entire footprint to V3
> will take some time, having a V2-compatible conversion path would be
> incredibly valuable for our use cases. I would be very interested in
> contributing to the V2 path to help support this. Please let me know where
> it makes the most sense for me to jump in—whether collaborating directly on
> the current PR or picking it up as a follow-up once the initial V3
> foundation lands.
>
> Qian
>
> On 2026/04/17 13:58:28 Maximilian Michels wrote:
> > Hi Steven,
> >
> > Thanks for chiming in! Yes, the current solution requires V3 tables.
> > I'll probably have to make that clearer in the PR description.
> >
> > In principle, it wouldn't be hard to make it work with V2 either,
> > obviously with the tradeoffs that come along with it in terms of
> > storage / lookup efficiency.
> >
> > Cheers,
> > Max
> >
> > On Thu, Apr 16, 2026 at 5:30 PM Steven Wu <[email protected]> wrote:
> > >
> > > Sorry for the delayed response. I agree that this proposal could get
> us one step closer to removing equality deletes. It can also verify whether
> the index solution can scale for streaming CDC/Upsert writes.
> > >
> > > > For this PR, we opted to use deletion vectors over regular delete
> files due to their efficiency in terms of space and lookup.
> > >
> > > I didn't quite get this from the PR description. Does this mean it is
> limited to V3 tables?
> > >
> > > Thanks for working on this, Max!
> > >
> > > On Thu, Apr 16, 2026 at 8:03 AM Maximilian Michels <[email protected]>
> wrote:
> > >>
> > >> Thanks everyone for the discussion and support.
> > >>
> > >> I've opened a PR which implements what we discussed here:
> > >> https://github.com/apache/iceberg/pull/15996
> > >>
> > >> Just to cap:
> > >>
> > >> The EqualityDeleteConverter (EDC) will be running in-line with the
> > >> writer, which produces the equality deletes to a staging branch. We
> > >> monitor the staging branch for new commits, build a sharded primary
> > >> key index in Flink state (backed by RocksDB for large tables), resolve
> > >> equality deletes against the index, and commit back the resulting data
> > >> files + DVs to the main branch.
> > >>
> > >> The PR is split into several commits, which we can break out into
> > >> separate PRs for easier review. There are some limitations and
> > >> follow-ups listed in the PR. The biggest gaps are preserving row
> > >> lineage and lifecycle management of the staging branch. In a
> > >> follow-up, we will also add integration with the Flink IcebergSink. I
> > >> tried to keep the scope limited to the EDC maintenance task for the
> > >> first PR.
> > >>
> > >> Cheers,
> > >> Max
> > >>
> > >> On Thu, Apr 2, 2026 at 9:43 AM Márton Balassi <[email protected]>
> wrote:
> > >> >
> > >> > Thanks for raising this, Max and for the feedback Peter and Manu.
> > >> >
> > >> > I am supportive of this proposal, especially with the clearly
> defined vision of eventually completely removing the need for equality
> deletes.
> > >> >
> > >> > Lifting the reliance on equality deletes in the Flink write path
> would be a significant improvement, both in terms of read efficiency (by
> moving towards delete vectors) and in terms of new capabilities, as it
> would make writing upserts from Flink a viable path to explore going
> forward.
> > >> >
> > >> > Cheers,
> > >> > Marton
> > >> >
> > >> > On 2026/03/20 11:15:26 Maximilian Michels wrote:
> > >> > > Thanks Peter and Manu for the feedback.
> > >> > >
> > >> > > @Peter:
> > >> > >
> > >> > > Good point on the end goal. The end goal should be to completely
> > >> > > remove equality deletes.
> > >> > >
> > >> > > While the staging branch, which contains the equality deletes, is
> an
> > >> > > internal implementation detail of the Flink writer, it will still
> be
> > >> > > accessible via the Iceberg reader API. For the transition period,
> I
> > >> > > think this has several advantages:
> > >> > > 1. We don't need to fundamentally change the write logic of
> existing writers.
> > >> > > 2. We still allow for the data to be inspected before converting
> it
> > >> > > and merging it to the main branch. This is also helpful for
> > >> > > troubleshooting.
> > >> > >
> > >> > > The staging branch solution is a first step towards removing
> equality deletes.
> > >> > >
> > >> > > In V4, we could already deprecate equality deletes. Once the spec
> > >> > > includes indices, we can move the index into Iceberg, which should
> > >> > > make it easier to develop an in-place resolution of equality
> deletes
> > >> > > supporting multiple writers and conflict resolution. Admittedly,
> we
> > >> > > haven't fully figured out the best in-place approach. I think it
> is a
> > >> > > good idea to take it one step at a time.
> > >> > >
> > >> > > On row lineage: If we want to preserve the row id of updated
> rows, we
> > >> > > will have to store the row id in the primary key index.
> Theoretically,
> > >> > > we should be able to then add it to the corresponding new row. The
> > >> > > question is how to do that efficiently, such that we don't have to
> > >> > > rewrite any data files. We would need some way to map the row id
> of
> > >> > > the newly inserted row to the row id of the deleted row. Do we
> already
> > >> > > have such functionality in Iceberg?
> > >> > >
> > >> > > On concurrent writes: For the time being, I think we should not
> allow
> > >> > > concurrent maintenance tasks, including equality delete
> conversion.
> > >> > > Concurrent writes are still supported, as long as they go to the
> > >> > > staging branch.
> > >> > >
> > >> > > @Manu:
> > >> > >
> > >> > > +1 to Peter's response. The primary key index is bounded and
> > >> > > independent of the number of accumulated equality deletes, so
> memory
> > >> > > doesn't blow up, as long as we have sufficient resources to load
> the
> > >> > > index. We definitely cannot rely on the full index to fit into
> memory.
> > >> > > Fortunately, Flink is already prepared for this; it supports
> spilling
> > >> > > to disk via its RocksDB state backend.
> > >> > >
> > >> > > Cheers,
> > >> > > Max
> > >> > >
> > >> > > On Fri, Mar 20, 2026 at 11:07 AM Péter Váry <[email protected]>
> wrote:
> > >> > > >
> > >> > > > Equality delete resolution could be made significantly more
> efficient by using an index (e.g., backed by RocksDB) to store the current
> mapping from primary keys to (file, position). While the memory footprint
> would not be small, it would be bounded and independent of the number of
> accumulated equality deletes. In addition, even a blocking compaction
> should block for a shorter period than the typical interval at which table
> compactions are scheduled.
> > >> > > >
> > >> > > > Manu Zhang <[email protected]> ezt írta (időpont: 2026. márc.
> 19., Cs, 15:57):
> > >> > > >>
> > >> > > >> Thanks Max for the proposal. One question here.
> > >> > > >> When the convert task can not finish in time (e.g. blocked by
> compactions), and equality deletes accumulate on the staging branch, will
> we have the same issue as loading too many equality deletes and blowing up
> memory?
> > >> > > >>
> > >> > > >> Regards,
> > >> > > >> Manu
> > >> > > >>
> > >> > > >> On Thu, Mar 19, 2026 at 2:45 PM Péter Váry <[email protected]>
> wrote:
> > >> > > >>>
> > >> > > >>> Thanks, Max, for continuing to push this forward.
> > >> > > >>>
> > >> > > >>> The proposal feels like a step in the right direction, but I
> would like to see a clearer view of the end goal. As it stands, equality
> deletes remain in the spec because the changes are committed to an
> intermediate branch. Since the long‑term objective is to remove equality
> deletes from the specification altogether, we should be clear about the
> final solution that achieves this.
> > >> > > >>>
> > >> > > >>> Flink writes will also continue to have the limitation that
> row lineage is not maintained correctly. This is unchanged from the current
> situation, but I think it’s important to explicitly call this out, or
> ideally, explore whether there’s a way to address it.
> > >> > > >>>
> > >> > > >>> In addition, concurrent writes and compactions would require
> updating the primary key index, which could be expensive.
> > >> > > >>>
> > >> > > >>> That said, I don’t see a clearly better alternative at the
> moment, and overall this seems like a reasonable way forward.
> > >> > > >>>
> > >> > > >>> Thanks again for continuing to drive the proposal.
> > >> > > >>> Peter
> > >> > > >>>
> > >> > > >>>
> > >> > > >>> On Wed, Mar 18, 2026, 16:47 Maximilian Michels <
> [email protected]> wrote:
> > >> > > >>>>
> > >> > > >>>> Hi,
> > >> > > >>>>
> > >> > > >>>> I'd like to discuss resolving equality deletes in the Flink
> write
> > >> > > >>>> path, which will get us one step closer to removing equality
> deletes
> > >> > > >>>> from the spec.
> > >> > > >>>>
> > >> > > >>>> ## tl;dr
> > >> > > >>>>
> > >> > > >>>> We're planning to add an equality delete to deletion vector
> (DV)
> > >> > > >>>> conversion to Flink. Equality deletes may remain as an
> internal
> > >> > > >>>> intermediary format.
> > >> > > >>>>
> > >> > > >>>> ## Background
> > >> > > >>>>
> > >> > > >>>> For deletes, Flink currently produces equality delete files.
> > >> > > >>>>
> > >> > > >>>> Equality deletes are used to support deletes in the write
> path, which
> > >> > > >>>> is a requirement for many use cases like CDC [3]. They are
> cheap for
> > >> > > >>>> the writer; it only notes down the to-be-deleted values of
> the
> > >> > > >>>> identifier fields inside so-called delete files, and leaves
> it up for
> > >> > > >>>> the readers to match the values to the corresponding rows.
> The heavy
> > >> > > >>>> lifting has to be done by the readers, which potentially
> need to scan
> > >> > > >>>> the entire table to resolve equality deletes.
> > >> > > >>>>
> > >> > > >>>> Therefore, equality deletes have been criticized. There are
> > >> > > >>>> discussions around deprecating / removing them [1].
> > >> > > >>>>
> > >> > > >>>> ## Resolving Equality Deletes
> > >> > > >>>>
> > >> > > >>>> Steven, Peter, and a few other contributors came up with a
> proposal to
> > >> > > >>>> convert equality deletes into DV [2]. The original solution
> was quite
> > >> > > >>>> complex, mainly due to the conflict handling between
> streaming writes,
> > >> > > >>>> table maintenance, and equality delete resolution. The
> proposal is
> > >> > > >>>> also blocked on index support in the Iceberg spec [5].
> > >> > > >>>>
> > >> > > >>>> We may need to simplify further to make some progress. The
> old table
> > >> > > >>>> specs are going to be around for some time, even after we
> have a new
> > >> > > >>>> spec with index support. Users have been asking for a
> solution to this
> > >> > > >>>> issue for quite some time [3].
> > >> > > >>>>
> > >> > > >>>> The following is a modification of the original design
> document which
> > >> > > >>>> adapts the ideas described under "use lock to avoid
> conflicts" [2].
> > >> > > >>>>
> > >> > > >>>> ## Proposed Solution
> > >> > > >>>>
> > >> > > >>>> The idea is to add the equality delete to deletion vector
> (DV)
> > >> > > >>>> conversion as a Flink table maintenance task. After recent
> changes, we
> > >> > > >>>> can now run the writer and the maintenance in the same Flink
> job and
> > >> > > >>>> use a Flink-maintained lock to avoid conflicts between the
> maintenance
> > >> > > >>>> tasks.
> > >> > > >>>>
> > >> > > >>>> 1. Instead of writing directly to the target branch, the
> writer
> > >> > > >>>> commits data files + equality deletes to a staging branch.
> > >> > > >>>> 2. The new "EqualityDeleteResolver" maintenance task reads
> from the
> > >> > > >>>> staging branch and converts the equality deletes to DVs
> using a
> > >> > > >>>> Flink-maintained primary key index, then commits data files
> + DVs to
> > >> > > >>>> the target branch.
> > >> > > >>>> 3. The existing Flink maintenance framework's lock mechanism
> ensures
> > >> > > >>>> mutual exclusion between the convert task and table
> compaction to
> > >> > > >>>> avoid conflicts.
> > >> > > >>>>
> > >> > > >>>> After conversion, the target branch contains only data files
> and DVs,
> > >> > > >>>> no equality deletes.
> > >> > > >>>>
> > >> > > >>>> ## Limitations
> > >> > > >>>>
> > >> > > >>>> - Readers will only see new data until the conversion is
> complete.
> > >> > > >>>> This is partially mitigated by the fact that snapshots with
> equality
> > >> > > >>>> deletes cannot be read properly with Flink today [4].
> > >> > > >>>> - The Flink-maintained index needs to be built initially
> which
> > >> > > >>>> requires reading the entire table. We will use Flink's state
> backend
> > >> > > >>>> which apart from heap-based storage, also supports spilling
> to disk
> > >> > > >>>> via the RocksDB state backend.
> > >> > > >>>>
> > >> > > >>>> ## Wrapping up
> > >> > > >>>>
> > >> > > >>>> This solution may not be perfect because of the above
> limitations, but
> > >> > > >>>> it provides a viable path to free users of the burden of
> equality
> > >> > > >>>> deletes, which cannot be read efficiently by most engines
> today.
> > >> > > >>>> Eventually, the Flink-maintained index can be replaced by an
> Iceberg
> > >> > > >>>> index, which will allow for the index to be shared across
> engines.
> > >> > > >>>>
> > >> > > >>>> What does the community think?
> > >> > > >>>>
> > >> > > >>>> Thanks,
> > >> > > >>>> Max
> > >> > > >>>>
> > >> > > >>>> [1] Deprecate equality deletes:
> > >> > > >>>>
> https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6
> > >> > > >>>> [2] Design doc:
> > >> > > >>>>
> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/edit
> > >> > > >>>> [3] Upserts use case:
> > >> > > >>>>
> https://lists.apache.org/thread/rt7dmg7l78xpzc9w3lwn090yzqq4fyyw
> > >> > > >>>> [4] Handling upserts downstream:
> > >> > > >>>>
> https://lists.apache.org/thread/bx1ntfr45c0g9rh643yw7w8znv6wtrno
> > >> > > >>>> [5] V4 Index:
> https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j
> > >> > >
> >
>

Re: Re: [DISCUSS] Flink: Equality delete to DV conversion

Reply via email to