Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

Xiaoxuan Li Thu, 08 May 2025 10:16:38 -0700

Hi Zheng, Steven, Amogh and Gyula. Thank you all for the feedback!

I agree with everyone, we need to narrow down the scope of this
optimization. The primary issue I'm trying to address is the slow read
performance caused by the growing number of equality delete files(streaming
CDC scenarios). The other potential use cases are only mentioned to show
the extensibility of this approach.

And both equality and positional deletes suffer from the same core problem,
records are evaluated repeatedly against multiple delete predicates, at
read time for equality deletes, and at write time for positional deletes.
This repeated evaluation is where the real bottleneck lies.

It’s particularly bad for equality deletes, since we constantly recompute
row positions during reads without ever materializing them. Eventually, a
maintenance job is required just to rewrite them into positional deletes.

The idea is to eliminate this overhead by introducing an inverted index
that maps values (or value combinations) directly to row positions. This
lets us skip full predicate evaluation and jump straight to the affected
rows, similar to how Hudi uses a record-level index for fast upserts. If we
can fetch row positions from the index, we can greatly reduce overhead
during reads (for equality deletes) and writes (for positional deletes).

In fact, if we can make positional deletes perform as well as equality
deletes using this index, we might be able to get rid of equality deletes,
but that needs to be evaluated for the upsert case.

That’s part of the reason why other types of indexes came up, they’re
applicable beyond just primary key columns, but CDC is the only scenario
with stricter SLA requirements. So it makes sense to align on the exact
problem and use cases. For now, I think we can define our main goal as
supporting inverted indexing over primary key columns to address the
slowness of reading caused by the growing number of equality delete files.

Thanks, Amogh, for bringing up the comparison with known alternatives. We
should include benchmarks for those as well, to illustrate the trade-offs
in read/write performance, storage usage, and overall cost.

Really appreciate your feedback! I’ll incorporate these into the next
revision of the proposal.

Thanks,

Xiaoxuan

On Thu, May 8, 2025 at 1:25 AM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Thank you for the proposal!
>
> I agree with what had been said above that we need to narrow down the
> scope here and what is the primary target for the optimization.
>
> As Amogh has also pointed out, CDC (streaming) read performance (with
> equality deletes) would be one of the biggest beneficiaries of this at a
> first glance.
> This is especially important for Flink users where this feature is
> currently completely missing and there is a big demand for it as we rely on
> equality deletes on the write path. [1]
>
> I am not aware of alternative proposals that would solve the equality
> delete cdc read performance question, overall I think using indices is
> reasonable and a very promising approach.
>
> Looking forward to more details and discussion!
> Gyula
>
> [1] https://lists.apache.org/thread/njmxjmjjm341fp4mgynn483v15mhk7qd
>
>
> On Thu, May 8, 2025 at 9:24 AM Amogh Jahagirdar <2am...@gmail.com> wrote:
>
>> Thank you for the proposal Xiaoxuan! I think I agree with Zheng and
>> Steven's point that it'll probably be more helpful to start out with more
>> specific "what" and "why" (known areas of improvement for Iceberg and
>> driven by any use cases) before we get too deep into the "how".
>>
>> In my mind, the specific known area of improvement for Iceberg related to
>> this proposal is improving streaming upsert behavior. One area this
>> improvement is beneficial for is being able to provide better data
>> freshness for Iceberg CDC mirror tables without the heavy read +
>> maintenance cost that currently exist with Flink upserts.
>>
>> As you mentioned, equality deletes have the benefit of being very cheap
>> to write but can come at a high and unpredictable cost at read time.
>> Challenges with equality deletes have been discussed in the past [1].
>> I'll also add that if one of the goals is to improving streaming upserts
>> (e.g. for applying CDC change streams into Iceberg mirror tables), then
>> there are alternatives that I think we should compare against to make
>> the tradeoffs clear. These alternatives include leveraging the known
>> changelog view or merge patterns [2] or improving the existing maintenance
>> procedures.
>>
>> I think the potential for being able to use a inverted index for upsert
>> cases to more directly identify positions in a file to directly write DVs
>> is very exciting, but before getting too far into the weeds, I think it'd
>> first be helpful
>> to make sure we agree on the specific problem we're trying to solve when
>> we talk about performance improvements along with any use cases, followed
>> by comparison with known alternatives (ideally we can get numbers that
>> demonstrate the read/write/storage/cost tradeoffs for the proposed inverted
>> index).
>>
>> [1]https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6
>> [2]https://www.tabular.io/blog/hello-world-of-cdc/
>>
>> Thanks,
>> Amogh Jahagirdar
>>
>

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

Reply via email to