Re: Improve Change Data Capture Use Case for Iceberg

Manu Zhang Sun, 25 Feb 2024 17:42:46 -0800

Thanks Ryan for the updates! I've already read your blog series and learned
a lot.
Our customer is using a similar Flink Append + Spark MergeInto approach.
I'm wondering whether there's a plan to implement this pattern with two
branches as proposed by Jack in Improvement 3.


Manu



On Mon, Feb 26, 2024 at 5:13 AM Ryan Blue <b...@tabular.io> wrote:

> Manu,
>
> I haven't seen much improvement to the Flink/UPSERT approach to CDC. It's
> still half-finished. There are some efforts to add table maintenance to
> Flink, but the main issues -- work is being deferred and never done --
> haven't been addressed. I don't recommend this approach.
>
> The other approach has seen progress. I wrote a blog series
> <https://tabular.io/blog/cdc-zen-art-of-cdc-performance/> about CDC
> patterns that builds to the pattern of using separate changelog and mirror
> tables to help people understand how to do it and what the benefits are. We
> are also getting closer to a release with views that will allow us to use
> the latest from the changelog table at read time. That's going to be the
> most efficient implementation overall.
>
> Ryan
>
> On Tue, Feb 20, 2024 at 1:41 AM Manu Zhang <owenzhang1...@gmail.com>
> wrote:
>
>> Bump up this thread again. Are we actively working on any proposed
>> approaches?
>>
>> Manu
>>
>> On Fri, May 5, 2023 at 9:14 AM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Thanks for taking the time to write this up, Jack! It definitely
>>> overlaps my own thinking, which is a good confirmation that we're on the
>>> right track. There are a couple of things I want to add to the discussion.
>>>
>>> First, I think the doc relies fairly heavily on the Iceberg/Flink
>>> approach being _the_ approach to CDC with Iceberg. That's not how I've seen
>>> the pattern implemented at scale, and I generally think of the
>>> Iceberg/Flink implementation as unfinished -- it is missing critical
>>> components.
>>>
>>> As a format, Iceberg tries to be flexible and allow you to make
>>> trade-offs. The main trade-off you get is to defer work until later.
>>> Sorting is a good example, where Flink can't sort easily so it leaves
>>> sorting for downstream systems. Another way we defer work is using delete
>>> files to keep write amplification down. And yet another way is using
>>> equality delete files to avoid needing to locate records. The issue with
>>> the Iceberg/Flink approach is that it uses all 3 of these and defers a ton
>>> of work that never gets completed. I think the Iceberg/Flink UPSERT
>>> feature is incomplete and would not recommend it without something cleaning
>>> up the tables.
>>>
>>> It seems to me that Approaches 1 and 2 are trying to fix this, but not
>>> really in straightforward ways. I like that Approach 1 also preserves
>>> history, but I think there's a better and more direction option in Approach
>>> 3.
>>>
>>> Second, I think we also need to consider transactional consistency. This
>>> isn't hard to achieve and is how people consuming the table think about the
>>> data. I think we should always try to mirror consistency in the source
>>> table downstream in the mirror table.
>>>
>>> In the end, I think we probably agree on the overall approach (3). Keep
>>> a changelog branch or table and a mirror table, then keep them in sync. I'd
>>> also add views to the mix to get the latest up-to-date information. We can
>>> also make this better by adapting Approach 2 for streaming writes. I think
>>> Anton has been working on an index approach that would work.
>>>
>>> If we're aligned, I think it should be easy to start building this
>>> pattern and adding support in Iceberg for things like updating the schemas
>>> at the same time.
>>>
>>> Ryan
>>>
>>> On Thu, May 4, 2023 at 3:00 PM Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>>> Thanks Jack for the great write-up. Good summary of the current
>>>> landscape of CDC too. Left a few comments to discuss.
>>>>
>>>> On Wed, Apr 26, 2023 at 11:38 AM Anton Okolnychyi
>>>> <aokolnyc...@apple.com.invalid> wrote:
>>>>
>>>>> Thanks for starting a thread, Jack! I am yet to go through the
>>>>> proposal.
>>>>>
>>>>> I recently came across a similar idea in BigQuery, which relies on a
>>>>> staleness threshold:
>>>>>
>>>>> https://cloud.google.com/blog/products/data-analytics/bigquery-gains-change-data-capture-functionality/
>>>>>
>>>>> It would also be nice to check if there are any applicable ideas in
>>>>> Paimon:
>>>>> https://github.com/apache/incubator-paimon/
>>>>>
>>>>> - Anton
>>>>>
>>>>> On Apr 26, 2023, at 11:32 AM, Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> As we discussed in the community sync, it looks like we have some
>>>>> general interest in improving the CDC streaming process. Dan mentioned 
>>>>> that
>>>>> Ryan has a proposal about an alternative CDC approach that has an
>>>>> accumulated changelog that is periodically synced to a target table.
>>>>>
>>>>> I have a very similar design doc I have been working on for quite some
>>>>> time to describe a set of improvements we could do to the Iceberg CDC use
>>>>> case, and it contains a very similar improvement (see improvement 3).
>>>>>
>>>>> I would appreciate feedback from the community about this doc, and I
>>>>> can organize some meetings to discuss our thoughts about this topic
>>>>> afterwards.
>>>>>
>>>>> Doc link:
>>>>> https://docs.google.com/document/d/1kyyJp4masbd1FrIKUHF1ED_z1hTARL8bNoKCgb7fhSQ/edit#
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Improve Change Data Capture Use Case for Iceberg

Reply via email to