I am glad to see that folks are thinking about this problem. I am looking forward to a formal proposal/design doc to discuss details!
Overall, this aligns with what we discussed in the community earlier w.r.t. the future of equality deletes and streaming upserts. If I were to summarize, we have the following options: Option 1: Add an inverted index (potentially distributed) maintained by an engine that does streaming writes to always produce DVs, even in streaming use cases. Deprecate/remove equality deletes from Iceberg. Option 2: Add native indexing to Iceberg so that the lookup of positions is quick and efficient enough to be used in streaming upserts. Deprecate/remove equality deletes from Iceberg. Option 3: Rethink equality deletes, potentially by introducing more restrictions and trying to scope them to particular data files, similar to DVs. Option 4: Standardize on a view reconciliation approach that Tabular implemented for CDC. I would like to highlight that what Spark does today during MERGE is similar to a lookup in an inverted index represented by another Iceberg table. That is OK for batch jobs but not enough for streaming. - Anton чт, 8 трав. 2025 р. о 10:08 Xiaoxuan Li <xiaoxuan.li....@gmail.com> пише: > Hi Zheng, Steven, Amogh and Gyula. Thank you all for the feedback! > > I agree with everyone, we need to narrow down the scope of this > optimization. The primary issue I'm trying to address is the slow read > performance caused by the growing number of equality delete files(streaming > CDC scenarios). The other potential use cases are only mentioned to show > the extensibility of this approach. > > And both equality and positional deletes suffer from the same core > problem, records are evaluated repeatedly against multiple delete > predicates, at read time for equality deletes, and at write time for > positional deletes. This repeated evaluation is where the real bottleneck > lies. > > It’s particularly bad for equality deletes, since we constantly recompute > row positions during reads without ever materializing them. Eventually, a > maintenance job is required just to rewrite them into positional deletes. > > The idea is to eliminate this overhead by introducing an inverted index > that maps values (or value combinations) directly to row positions. This > lets us skip full predicate evaluation and jump straight to the affected > rows, similar to how Hudi uses a record-level index for fast upserts. If we > can fetch row positions from the index, we can greatly reduce overhead > during reads (for equality deletes) and writes (for positional deletes). > > In fact, if we can make positional deletes perform as well as equality > deletes using this index, we might be able to get rid of equality deletes, > but that needs to be evaluated for the upsert case. > > That’s part of the reason why other types of indexes came up, they’re > applicable beyond just primary key columns, but CDC is the only scenario > with stricter SLA requirements. So it makes sense to align on the exact > problem and use cases. For now, I think we can define our main goal as > supporting inverted indexing over primary key columns to address the > slowness of reading caused by the growing number of equality delete files. > > Thanks, Amogh, for bringing up the comparison with known alternatives. We > should include benchmarks for those as well, to illustrate the trade-offs > in read/write performance, storage usage, and overall cost. > > Really appreciate your feedback! I’ll incorporate these into the next > revision of the proposal. > > > Thanks, > > Xiaoxuan > > On Thu, May 8, 2025 at 1:25 AM Gyula Fóra <gyula.f...@gmail.com> wrote: > >> Thank you for the proposal! >> >> I agree with what had been said above that we need to narrow down the >> scope here and what is the primary target for the optimization. >> >> As Amogh has also pointed out, CDC (streaming) read performance (with >> equality deletes) would be one of the biggest beneficiaries of this at a >> first glance. >> This is especially important for Flink users where this feature is >> currently completely missing and there is a big demand for it as we rely on >> equality deletes on the write path. [1] >> >> I am not aware of alternative proposals that would solve the equality >> delete cdc read performance question, overall I think using indices is >> reasonable and a very promising approach. >> >> Looking forward to more details and discussion! >> Gyula >> >> [1] https://lists.apache.org/thread/njmxjmjjm341fp4mgynn483v15mhk7qd >> >> >> On Thu, May 8, 2025 at 9:24 AM Amogh Jahagirdar <2am...@gmail.com> wrote: >> >>> Thank you for the proposal Xiaoxuan! I think I agree with Zheng and >>> Steven's point that it'll probably be more helpful to start out with more >>> specific "what" and "why" (known areas of improvement for Iceberg and >>> driven by any use cases) before we get too deep into the "how". >>> >>> In my mind, the specific known area of improvement for Iceberg related >>> to this proposal is improving streaming upsert behavior. One area this >>> improvement is beneficial for is being able to provide better data >>> freshness for Iceberg CDC mirror tables without the heavy read + >>> maintenance cost that currently exist with Flink upserts. >>> >>> As you mentioned, equality deletes have the benefit of being very cheap >>> to write but can come at a high and unpredictable cost at read time. >>> Challenges with equality deletes have been discussed in the past [1]. >>> I'll also add that if one of the goals is to improving streaming upserts >>> (e.g. for applying CDC change streams into Iceberg mirror tables), then >>> there are alternatives that I think we should compare against to make >>> the tradeoffs clear. These alternatives include leveraging the known >>> changelog view or merge patterns [2] or improving the existing maintenance >>> procedures. >>> >>> I think the potential for being able to use a inverted index for upsert >>> cases to more directly identify positions in a file to directly write DVs >>> is very exciting, but before getting too far into the weeds, I think it'd >>> first be helpful >>> to make sure we agree on the specific problem we're trying to solve when >>> we talk about performance improvements along with any use cases, followed >>> by comparison with known alternatives (ideally we can get numbers that >>> demonstrate the read/write/storage/cost tradeoffs for the proposed inverted >>> index). >>> >>> [1]https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6 >>> [2]https://www.tabular.io/blog/hello-world-of-cdc/ >>> >>> Thanks, >>> Amogh Jahagirdar >>> >>