Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

Gyula Fóra Thu, 08 May 2025 01:25:27 -0700

Thank you for the proposal!

I agree with what had been said above that we need to narrow down the scope
here and what is the primary target for the optimization.


As Amogh has also pointed out, CDC (streaming) read performance (with
equality deletes) would be one of the biggest beneficiaries of this at a
first glance.
This is especially important for Flink users where this feature is
currently completely missing and there is a big demand for it as we rely on
equality deletes on the write path. [1]

I am not aware of alternative proposals that would solve the equality
delete cdc read performance question, overall I think using indices is
reasonable and a very promising approach.

Looking forward to more details and discussion!
Gyula

[1] https://lists.apache.org/thread/njmxjmjjm341fp4mgynn483v15mhk7qd


On Thu, May 8, 2025 at 9:24 AM Amogh Jahagirdar <[email protected]> wrote:

> Thank you for the proposal Xiaoxuan! I think I agree with Zheng and
> Steven's point that it'll probably be more helpful to start out with more
> specific "what" and "why" (known areas of improvement for Iceberg and
> driven by any use cases) before we get too deep into the "how".
>
> In my mind, the specific known area of improvement for Iceberg related to
> this proposal is improving streaming upsert behavior. One area this
> improvement is beneficial for is being able to provide better data
> freshness for Iceberg CDC mirror tables without the heavy read +
> maintenance cost that currently exist with Flink upserts.
>
> As you mentioned, equality deletes have the benefit of being very cheap to
> write but can come at a high and unpredictable cost at read time.
> Challenges with equality deletes have been discussed in the past [1].
> I'll also add that if one of the goals is to improving streaming upserts
> (e.g. for applying CDC change streams into Iceberg mirror tables), then
> there are alternatives that I think we should compare against to make
> the tradeoffs clear. These alternatives include leveraging the known
> changelog view or merge patterns [2] or improving the existing maintenance
> procedures.
>
> I think the potential for being able to use a inverted index for upsert
> cases to more directly identify positions in a file to directly write DVs
> is very exciting, but before getting too far into the weeds, I think it'd
> first be helpful
> to make sure we agree on the specific problem we're trying to solve when
> we talk about performance improvements along with any use cases, followed
> by comparison with known alternatives (ideally we can get numbers that
> demonstrate the read/write/storage/cost tradeoffs for the proposed inverted
> index).
>
> [1]https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6
> [2]https://www.tabular.io/blog/hello-world-of-cdc/
>
> Thanks,
> Amogh Jahagirdar
>

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

Reply via email to