Re: Improve Change Data Capture Use Case for Iceberg

Ryan Blue Sun, 25 Feb 2024 13:13:10 -0800

Manu,

I haven't seen much improvement to the Flink/UPSERT approach to CDC. It's
still half-finished. There are some efforts to add table maintenance to
Flink, but the main issues -- work is being deferred and never done --
haven't been addressed. I don't recommend this approach.


The other approach has seen progress. I wrote a blog series
<https://tabular.io/blog/cdc-zen-art-of-cdc-performance/> about CDC
patterns that builds to the pattern of using separate changelog and mirror
tables to help people understand how to do it and what the benefits are. We
are also getting closer to a release with views that will allow us to use
the latest from the changelog table at read time. That's going to be the
most efficient implementation overall.

Ryan

On Tue, Feb 20, 2024 at 1:41 AM Manu Zhang <[email protected]> wrote:

> Bump up this thread again. Are we actively working on any proposed
> approaches?
>
> Manu
>
> On Fri, May 5, 2023 at 9:14 AM Ryan Blue <[email protected]> wrote:
>
>> Thanks for taking the time to write this up, Jack! It definitely overlaps
>> my own thinking, which is a good confirmation that we're on the right
>> track. There are a couple of things I want to add to the discussion.
>>
>> First, I think the doc relies fairly heavily on the Iceberg/Flink
>> approach being _the_ approach to CDC with Iceberg. That's not how I've seen
>> the pattern implemented at scale, and I generally think of the
>> Iceberg/Flink implementation as unfinished -- it is missing critical
>> components.
>>
>> As a format, Iceberg tries to be flexible and allow you to make
>> trade-offs. The main trade-off you get is to defer work until later.
>> Sorting is a good example, where Flink can't sort easily so it leaves
>> sorting for downstream systems. Another way we defer work is using delete
>> files to keep write amplification down. And yet another way is using
>> equality delete files to avoid needing to locate records. The issue with
>> the Iceberg/Flink approach is that it uses all 3 of these and defers a ton
>> of work that never gets completed. I think the Iceberg/Flink UPSERT
>> feature is incomplete and would not recommend it without something cleaning
>> up the tables.
>>
>> It seems to me that Approaches 1 and 2 are trying to fix this, but not
>> really in straightforward ways. I like that Approach 1 also preserves
>> history, but I think there's a better and more direction option in Approach
>> 3.
>>
>> Second, I think we also need to consider transactional consistency. This
>> isn't hard to achieve and is how people consuming the table think about the
>> data. I think we should always try to mirror consistency in the source
>> table downstream in the mirror table.
>>
>> In the end, I think we probably agree on the overall approach (3). Keep a
>> changelog branch or table and a mirror table, then keep them in sync. I'd
>> also add views to the mix to get the latest up-to-date information. We can
>> also make this better by adapting Approach 2 for streaming writes. I think
>> Anton has been working on an index approach that would work.
>>
>> If we're aligned, I think it should be easy to start building this
>> pattern and adding support in Iceberg for things like updating the schemas
>> at the same time.
>>
>> Ryan
>>
>> On Thu, May 4, 2023 at 3:00 PM Steven Wu <[email protected]> wrote:
>>
>>> Thanks Jack for the great write-up. Good summary of the current
>>> landscape of CDC too. Left a few comments to discuss.
>>>
>>> On Wed, Apr 26, 2023 at 11:38 AM Anton Okolnychyi
>>> <[email protected]> wrote:
>>>
>>>> Thanks for starting a thread, Jack! I am yet to go through the
>>>> proposal.
>>>>
>>>> I recently came across a similar idea in BigQuery, which relies on a
>>>> staleness threshold:
>>>>
>>>> https://cloud.google.com/blog/products/data-analytics/bigquery-gains-change-data-capture-functionality/
>>>>
>>>> It would also be nice to check if there are any applicable ideas in
>>>> Paimon:
>>>> https://github.com/apache/incubator-paimon/
>>>>
>>>> - Anton
>>>>
>>>> On Apr 26, 2023, at 11:32 AM, Jack Ye <[email protected]> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> As we discussed in the community sync, it looks like we have some
>>>> general interest in improving the CDC streaming process. Dan mentioned that
>>>> Ryan has a proposal about an alternative CDC approach that has an
>>>> accumulated changelog that is periodically synced to a target table.
>>>>
>>>> I have a very similar design doc I have been working on for quite some
>>>> time to describe a set of improvements we could do to the Iceberg CDC use
>>>> case, and it contains a very similar improvement (see improvement 3).
>>>>
>>>> I would appreciate feedback from the community about this doc, and I
>>>> can organize some meetings to discuss our thoughts about this topic
>>>> afterwards.
>>>>
>>>> Doc link:
>>>> https://docs.google.com/document/d/1kyyJp4masbd1FrIKUHF1ED_z1hTARL8bNoKCgb7fhSQ/edit#
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Improve Change Data Capture Use Case for Iceberg

Reply via email to