Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

Anton Okolnychyi Thu, 08 May 2025 15:14:31 -0700

I am glad to see that folks are thinking about this problem. I am looking
forward to a formal proposal/design doc to discuss details!


Overall, this aligns with what we discussed in the community earlier w.r.t.
the future of equality deletes and streaming upserts. If I were to
summarize, we have the following options:

Option 1: Add an inverted index (potentially distributed) maintained by an
engine that does streaming writes to always produce DVs, even in streaming
use cases. Deprecate/remove equality deletes from Iceberg.
Option 2: Add native indexing to Iceberg so that the lookup of positions is
quick and efficient enough to be used in streaming upserts.
Deprecate/remove equality deletes from Iceberg.
Option 3: Rethink equality deletes, potentially by introducing more
restrictions and trying to scope them to particular data files, similar to
DVs.
Option 4: Standardize on a view reconciliation approach that Tabular
implemented for CDC.

I would like to highlight that what Spark does today during MERGE is
similar to a lookup in an inverted index represented by another Iceberg
table. That is OK for batch jobs but not enough for streaming.

- Anton

чт, 8 трав. 2025 р. о 10:08 Xiaoxuan Li <xiaoxuan.li....@gmail.com> пише:

> Hi Zheng, Steven, Amogh and Gyula. Thank you all for the feedback!
>
> I agree with everyone, we need to narrow down the scope of this
> optimization. The primary issue I'm trying to address is the slow read
> performance caused by the growing number of equality delete files(streaming
> CDC scenarios). The other potential use cases are only mentioned to show
> the extensibility of this approach.
>
> And both equality and positional deletes suffer from the same core
> problem, records are evaluated repeatedly against multiple delete
> predicates, at read time for equality deletes, and at write time for
> positional deletes. This repeated evaluation is where the real bottleneck
> lies.
>
> It’s particularly bad for equality deletes, since we constantly recompute
> row positions during reads without ever materializing them. Eventually, a
> maintenance job is required just to rewrite them into positional deletes.
>
> The idea is to eliminate this overhead by introducing an inverted index
> that maps values (or value combinations) directly to row positions. This
> lets us skip full predicate evaluation and jump straight to the affected
> rows, similar to how Hudi uses a record-level index for fast upserts. If we
> can fetch row positions from the index, we can greatly reduce overhead
> during reads (for equality deletes) and writes (for positional deletes).
>
> In fact, if we can make positional deletes perform as well as equality
> deletes using this index, we might be able to get rid of equality deletes,
> but that needs to be evaluated for the upsert case.
>
> That’s part of the reason why other types of indexes came up, they’re
> applicable beyond just primary key columns, but CDC is the only scenario
> with stricter SLA requirements. So it makes sense to align on the exact
> problem and use cases. For now, I think we can define our main goal as
> supporting inverted indexing over primary key columns to address the
> slowness of reading caused by the growing number of equality delete files.
>
> Thanks, Amogh, for bringing up the comparison with known alternatives. We
> should include benchmarks for those as well, to illustrate the trade-offs
> in read/write performance, storage usage, and overall cost.
>
> Really appreciate your feedback! I’ll incorporate these into the next
> revision of the proposal.
>
>
> Thanks,
>
> Xiaoxuan
>
> On Thu, May 8, 2025 at 1:25 AM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
>> Thank you for the proposal!
>>
>> I agree with what had been said above that we need to narrow down the
>> scope here and what is the primary target for the optimization.
>>
>> As Amogh has also pointed out, CDC (streaming) read performance (with
>> equality deletes) would be one of the biggest beneficiaries of this at a
>> first glance.
>> This is especially important for Flink users where this feature is
>> currently completely missing and there is a big demand for it as we rely on
>> equality deletes on the write path. [1]
>>
>> I am not aware of alternative proposals that would solve the equality
>> delete cdc read performance question, overall I think using indices is
>> reasonable and a very promising approach.
>>
>> Looking forward to more details and discussion!
>> Gyula
>>
>> [1] https://lists.apache.org/thread/njmxjmjjm341fp4mgynn483v15mhk7qd
>>
>>
>> On Thu, May 8, 2025 at 9:24 AM Amogh Jahagirdar <2am...@gmail.com> wrote:
>>
>>> Thank you for the proposal Xiaoxuan! I think I agree with Zheng and
>>> Steven's point that it'll probably be more helpful to start out with more
>>> specific "what" and "why" (known areas of improvement for Iceberg and
>>> driven by any use cases) before we get too deep into the "how".
>>>
>>> In my mind, the specific known area of improvement for Iceberg related
>>> to this proposal is improving streaming upsert behavior. One area this
>>> improvement is beneficial for is being able to provide better data
>>> freshness for Iceberg CDC mirror tables without the heavy read +
>>> maintenance cost that currently exist with Flink upserts.
>>>
>>> As you mentioned, equality deletes have the benefit of being very cheap
>>> to write but can come at a high and unpredictable cost at read time.
>>> Challenges with equality deletes have been discussed in the past [1].
>>> I'll also add that if one of the goals is to improving streaming upserts
>>> (e.g. for applying CDC change streams into Iceberg mirror tables), then
>>> there are alternatives that I think we should compare against to make
>>> the tradeoffs clear. These alternatives include leveraging the known
>>> changelog view or merge patterns [2] or improving the existing maintenance
>>> procedures.
>>>
>>> I think the potential for being able to use a inverted index for upsert
>>> cases to more directly identify positions in a file to directly write DVs
>>> is very exciting, but before getting too far into the weeds, I think it'd
>>> first be helpful
>>> to make sure we agree on the specific problem we're trying to solve when
>>> we talk about performance improvements along with any use cases, followed
>>> by comparison with known alternatives (ideally we can get numbers that
>>> demonstrate the read/write/storage/cost tradeoffs for the proposed inverted
>>> index).
>>>
>>> [1]https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6
>>> [2]https://www.tabular.io/blog/hello-world-of-cdc/
>>>
>>> Thanks,
>>> Amogh Jahagirdar
>>>
>>

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

Reply via email to