Re: Improve Change Data Capture Use Case for Iceberg

Manu Zhang Tue, 20 Feb 2024 01:37:12 -0800

Bump up this thread again. Are we actively working on any proposed
approaches?


Manu

On Fri, May 5, 2023 at 9:14 AM Ryan Blue <b...@tabular.io> wrote:

> Thanks for taking the time to write this up, Jack! It definitely overlaps
> my own thinking, which is a good confirmation that we're on the right
> track. There are a couple of things I want to add to the discussion.
>
> First, I think the doc relies fairly heavily on the Iceberg/Flink approach
> being _the_ approach to CDC with Iceberg. That's not how I've seen the
> pattern implemented at scale, and I generally think of the Iceberg/Flink
> implementation as unfinished -- it is missing critical components.
>
> As a format, Iceberg tries to be flexible and allow you to make
> trade-offs. The main trade-off you get is to defer work until later.
> Sorting is a good example, where Flink can't sort easily so it leaves
> sorting for downstream systems. Another way we defer work is using delete
> files to keep write amplification down. And yet another way is using
> equality delete files to avoid needing to locate records. The issue with
> the Iceberg/Flink approach is that it uses all 3 of these and defers a ton
> of work that never gets completed. I think the Iceberg/Flink UPSERT
> feature is incomplete and would not recommend it without something cleaning
> up the tables.
>
> It seems to me that Approaches 1 and 2 are trying to fix this, but not
> really in straightforward ways. I like that Approach 1 also preserves
> history, but I think there's a better and more direction option in Approach
> 3.
>
> Second, I think we also need to consider transactional consistency. This
> isn't hard to achieve and is how people consuming the table think about the
> data. I think we should always try to mirror consistency in the source
> table downstream in the mirror table.
>
> In the end, I think we probably agree on the overall approach (3). Keep a
> changelog branch or table and a mirror table, then keep them in sync. I'd
> also add views to the mix to get the latest up-to-date information. We can
> also make this better by adapting Approach 2 for streaming writes. I think
> Anton has been working on an index approach that would work.
>
> If we're aligned, I think it should be easy to start building this pattern
> and adding support in Iceberg for things like updating the schemas at the
> same time.
>
> Ryan
>
> On Thu, May 4, 2023 at 3:00 PM Steven Wu <stevenz...@gmail.com> wrote:
>
>> Thanks Jack for the great write-up. Good summary of the current landscape
>> of CDC too. Left a few comments to discuss.
>>
>> On Wed, Apr 26, 2023 at 11:38 AM Anton Okolnychyi
>> <aokolnyc...@apple.com.invalid> wrote:
>>
>>> Thanks for starting a thread, Jack! I am yet to go through the proposal.
>>>
>>> I recently came across a similar idea in BigQuery, which relies on a
>>> staleness threshold:
>>>
>>> https://cloud.google.com/blog/products/data-analytics/bigquery-gains-change-data-capture-functionality/
>>>
>>> It would also be nice to check if there are any applicable ideas in
>>> Paimon:
>>> https://github.com/apache/incubator-paimon/
>>>
>>> - Anton
>>>
>>> On Apr 26, 2023, at 11:32 AM, Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>> Hi everyone,
>>>
>>> As we discussed in the community sync, it looks like we have some
>>> general interest in improving the CDC streaming process. Dan mentioned that
>>> Ryan has a proposal about an alternative CDC approach that has an
>>> accumulated changelog that is periodically synced to a target table.
>>>
>>> I have a very similar design doc I have been working on for quite some
>>> time to describe a set of improvements we could do to the Iceberg CDC use
>>> case, and it contains a very similar improvement (see improvement 3).
>>>
>>> I would appreciate feedback from the community about this doc, and I can
>>> organize some meetings to discuss our thoughts about this topic afterwards.
>>>
>>> Doc link:
>>> https://docs.google.com/document/d/1kyyJp4masbd1FrIKUHF1ED_z1hTARL8bNoKCgb7fhSQ/edit#
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>>
>
> --
> Ryan Blue
> Tabular
>

Re: Improve Change Data Capture Use Case for Iceberg

Reply via email to