Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

Péter Váry Fri, 09 May 2025 08:11:33 -0700

When going through the options mentioned by Anton, I feel that Option 1 and
4 are just pushing the responsibility of converting the equality deletes to
positional deletes to the engine side. The only difference is whether the
conversion happens on the write side or on the read side. This is a step
back, and doesn't help solving the problem for streaming engines.


Option 2 needs a bit of thought. We need to find a good indexing strategy
which allows big tables and plays nice with streaming writes too. It would
be good to allow the engines to cache a limited part of the index, and
distribute records to the writers in a way that the cache locality is
considered. Also creating an index file for every data file might be
suboptimal, as reading that many index files could tank performance.

I have yet to see a good idea for Option 3.

Another option could be to provide a way to convert the equality deletes to
positional deletes as soon as possible. Maybe in frequent compaction tasks,
or at first read? When we first apply an equality delete for a file, it is
very easy to calculate the equivalent DV for the equality delete.
Subsequent readers could depend on the DV instead of the equality delete.
This could be the least disruptive change, but I see a few issues with this
solution as well:
- First read still depends on the equality delete which in edge cases could
be very costly
- Filters don't play well with this method

All-in-all, I like the index solution (Option 2) for several reasons:
- Cuts out the need for equality deletes which would reduce the code quite
a bit
- While slows down the write path a bit, the overall workflow (write+read)
benefits from it
- The index could be reused for other use-cases as well

Thanks,
Peter

Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2025. máj. 9., P,
12:16):

> Hi Anton,
>
> Thank you for summarizing the options we see at this stage in a structured
> and concise way.
>
> Based on the use-cases I see in the industry, I feel that not all of the
> highlighted options are feasible (or desirable).
>
>
> Option 4 would basically remove any possibilities for native streaming CDC
> reads on tables, severely limiting how Iceberg can be used in the future
> for real-time use-cases from Flink and other similar engines that may want
> to connect. I understand that the view reconciliation approach is
> implemented in Spark SQL already but implementing it in a proper streaming
> way would likely lead to similar problems that we are trying to solve here
> in the first place.
>
>
> Regarding Option 1, introducing engine specific/custom indexing solutions
> would go against the core design principles as it would be hard to mix
> different engines when writing / reading tables for CDC use-cases. (A
> streaming job would have a hard time writing upserts/equality deletes to
> tables that were written by a different engine). To me this sounds very
> similar to Option 4 in a way that it pushes too much logic to the engines
> in a way that would hurt the compatibility across the engines.
>
>
> Based on this and the context of the discussion, Option 2 or a combination
> of Option 2 & 3 sounds the most reasonable to me. There are still a lot of
> questions on the practical implementation of the indexes and how we can do
> this efficiently so this is only a very early feedback from my end.
>
>
> Cheers,
>
> Gyula
>
>
>
> On Fri, May 9, 2025 at 12:14 AM Anton Okolnychyi <aokolnyc...@gmail.com>
> wrote:
>
>> I am glad to see that folks are thinking about this problem. I am looking
>> forward to a formal proposal/design doc to discuss details!
>>
>> Overall, this aligns with what we discussed in the
>> community earlier w.r.t. the future of equality deletes and streaming
>> upserts. If I were to summarize, we have the following options:
>>
>> Option 1: Add an inverted index (potentially distributed) maintained by
>> an engine that does streaming writes to always produce DVs, even in
>> streaming use cases. Deprecate/remove equality deletes from Iceberg.
>> Option 2: Add native indexing to Iceberg so that the lookup of
>> positions is quick and efficient enough to be used in streaming upserts.
>> Deprecate/remove equality deletes from Iceberg.
>> Option 3: Rethink equality deletes, potentially by introducing more
>> restrictions and trying to scope them to particular data files, similar to
>> DVs.
>> Option 4: Standardize on a view reconciliation approach that Tabular
>> implemented for CDC.
>>
>> I would like to highlight that what Spark does today during MERGE is
>> similar to a lookup in an inverted index represented by another Iceberg
>> table. That is OK for batch jobs but not enough for streaming.
>>
>> - Anton
>>
>> чт, 8 трав. 2025 р. о 10:08 Xiaoxuan Li <xiaoxuan.li....@gmail.com> пише:
>>
>>> Hi Zheng, Steven, Amogh and Gyula. Thank you all for the feedback!
>>>
>>> I agree with everyone, we need to narrow down the scope of this
>>> optimization. The primary issue I'm trying to address is the slow read
>>> performance caused by the growing number of equality delete files(streaming
>>> CDC scenarios). The other potential use cases are only mentioned to show
>>> the extensibility of this approach.
>>>
>>> And both equality and positional deletes suffer from the same core
>>> problem, records are evaluated repeatedly against multiple delete
>>> predicates, at read time for equality deletes, and at write time for
>>> positional deletes. This repeated evaluation is where the real bottleneck
>>> lies.
>>>
>>> It’s particularly bad for equality deletes, since we constantly
>>> recompute row positions during reads without ever materializing them.
>>> Eventually, a maintenance job is required just to rewrite them into
>>> positional deletes.
>>>
>>> The idea is to eliminate this overhead by introducing an inverted index
>>> that maps values (or value combinations) directly to row positions. This
>>> lets us skip full predicate evaluation and jump straight to the affected
>>> rows, similar to how Hudi uses a record-level index for fast upserts. If we
>>> can fetch row positions from the index, we can greatly reduce overhead
>>> during reads (for equality deletes) and writes (for positional deletes).
>>>
>>> In fact, if we can make positional deletes perform as well as equality
>>> deletes using this index, we might be able to get rid of equality deletes,
>>> but that needs to be evaluated for the upsert case.
>>>
>>> That’s part of the reason why other types of indexes came up, they’re
>>> applicable beyond just primary key columns, but CDC is the only scenario
>>> with stricter SLA requirements. So it makes sense to align on the exact
>>> problem and use cases. For now, I think we can define our main goal as
>>> supporting inverted indexing over primary key columns to address the
>>> slowness of reading caused by the growing number of equality delete files.
>>>
>>> Thanks, Amogh, for bringing up the comparison with known alternatives.
>>> We should include benchmarks for those as well, to illustrate the
>>> trade-offs in read/write performance, storage usage, and overall cost.
>>>
>>> Really appreciate your feedback! I’ll incorporate these into the next
>>> revision of the proposal.
>>>
>>>
>>> Thanks,
>>>
>>> Xiaoxuan
>>>
>>> On Thu, May 8, 2025 at 1:25 AM Gyula Fóra <gyula.f...@gmail.com> wrote:
>>>
>>>> Thank you for the proposal!
>>>>
>>>> I agree with what had been said above that we need to narrow down the
>>>> scope here and what is the primary target for the optimization.
>>>>
>>>> As Amogh has also pointed out, CDC (streaming) read performance (with
>>>> equality deletes) would be one of the biggest beneficiaries of this at a
>>>> first glance.
>>>> This is especially important for Flink users where this feature is
>>>> currently completely missing and there is a big demand for it as we rely on
>>>> equality deletes on the write path. [1]
>>>>
>>>> I am not aware of alternative proposals that would solve the equality
>>>> delete cdc read performance question, overall I think using indices is
>>>> reasonable and a very promising approach.
>>>>
>>>> Looking forward to more details and discussion!
>>>> Gyula
>>>>
>>>> [1] https://lists.apache.org/thread/njmxjmjjm341fp4mgynn483v15mhk7qd
>>>>
>>>>
>>>> On Thu, May 8, 2025 at 9:24 AM Amogh Jahagirdar <2am...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thank you for the proposal Xiaoxuan! I think I agree with Zheng and
>>>>> Steven's point that it'll probably be more helpful to start out with more
>>>>> specific "what" and "why" (known areas of improvement for Iceberg and
>>>>> driven by any use cases) before we get too deep into the "how".
>>>>>
>>>>> In my mind, the specific known area of improvement for Iceberg related
>>>>> to this proposal is improving streaming upsert behavior. One area this
>>>>> improvement is beneficial for is being able to provide better data
>>>>> freshness for Iceberg CDC mirror tables without the heavy read +
>>>>> maintenance cost that currently exist with Flink upserts.
>>>>>
>>>>> As you mentioned, equality deletes have the benefit of being very
>>>>> cheap to write but can come at a high and unpredictable cost at read time.
>>>>> Challenges with equality deletes have been discussed in the past [1].
>>>>> I'll also add that if one of the goals is to improving streaming
>>>>> upserts (e.g. for applying CDC change streams into Iceberg mirror tables),
>>>>> then there are alternatives that I think we should compare against to make
>>>>> the tradeoffs clear. These alternatives include leveraging the known
>>>>> changelog view or merge patterns [2] or improving the existing maintenance
>>>>> procedures.
>>>>>
>>>>> I think the potential for being able to use a inverted index for
>>>>> upsert cases to more directly identify positions in a file to directly
>>>>> write DVs  is very exciting, but before getting too far into the weeds, I
>>>>> think it'd first be helpful
>>>>> to make sure we agree on the specific problem we're trying to solve
>>>>> when we talk about performance improvements along with any use cases,
>>>>> followed by comparison with known alternatives (ideally we can get numbers
>>>>> that demonstrate the read/write/storage/cost tradeoffs for the proposed
>>>>> inverted index).
>>>>>
>>>>> [1]https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6
>>>>> [2]https://www.tabular.io/blog/hello-world-of-cdc/
>>>>>
>>>>> Thanks,
>>>>> Amogh Jahagirdar
>>>>>
>>>>

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

Reply via email to