Thanks Ryan for the updates! I've already read your blog series and learned a lot. Our customer is using a similar Flink Append + Spark MergeInto approach. I'm wondering whether there's a plan to implement this pattern with two branches as proposed by Jack in Improvement 3.
Manu On Mon, Feb 26, 2024 at 5:13 AM Ryan Blue <b...@tabular.io> wrote: > Manu, > > I haven't seen much improvement to the Flink/UPSERT approach to CDC. It's > still half-finished. There are some efforts to add table maintenance to > Flink, but the main issues -- work is being deferred and never done -- > haven't been addressed. I don't recommend this approach. > > The other approach has seen progress. I wrote a blog series > <https://tabular.io/blog/cdc-zen-art-of-cdc-performance/> about CDC > patterns that builds to the pattern of using separate changelog and mirror > tables to help people understand how to do it and what the benefits are. We > are also getting closer to a release with views that will allow us to use > the latest from the changelog table at read time. That's going to be the > most efficient implementation overall. > > Ryan > > On Tue, Feb 20, 2024 at 1:41 AM Manu Zhang <owenzhang1...@gmail.com> > wrote: > >> Bump up this thread again. Are we actively working on any proposed >> approaches? >> >> Manu >> >> On Fri, May 5, 2023 at 9:14 AM Ryan Blue <b...@tabular.io> wrote: >> >>> Thanks for taking the time to write this up, Jack! It definitely >>> overlaps my own thinking, which is a good confirmation that we're on the >>> right track. There are a couple of things I want to add to the discussion. >>> >>> First, I think the doc relies fairly heavily on the Iceberg/Flink >>> approach being _the_ approach to CDC with Iceberg. That's not how I've seen >>> the pattern implemented at scale, and I generally think of the >>> Iceberg/Flink implementation as unfinished -- it is missing critical >>> components. >>> >>> As a format, Iceberg tries to be flexible and allow you to make >>> trade-offs. The main trade-off you get is to defer work until later. >>> Sorting is a good example, where Flink can't sort easily so it leaves >>> sorting for downstream systems. Another way we defer work is using delete >>> files to keep write amplification down. And yet another way is using >>> equality delete files to avoid needing to locate records. The issue with >>> the Iceberg/Flink approach is that it uses all 3 of these and defers a ton >>> of work that never gets completed. I think the Iceberg/Flink UPSERT >>> feature is incomplete and would not recommend it without something cleaning >>> up the tables. >>> >>> It seems to me that Approaches 1 and 2 are trying to fix this, but not >>> really in straightforward ways. I like that Approach 1 also preserves >>> history, but I think there's a better and more direction option in Approach >>> 3. >>> >>> Second, I think we also need to consider transactional consistency. This >>> isn't hard to achieve and is how people consuming the table think about the >>> data. I think we should always try to mirror consistency in the source >>> table downstream in the mirror table. >>> >>> In the end, I think we probably agree on the overall approach (3). Keep >>> a changelog branch or table and a mirror table, then keep them in sync. I'd >>> also add views to the mix to get the latest up-to-date information. We can >>> also make this better by adapting Approach 2 for streaming writes. I think >>> Anton has been working on an index approach that would work. >>> >>> If we're aligned, I think it should be easy to start building this >>> pattern and adding support in Iceberg for things like updating the schemas >>> at the same time. >>> >>> Ryan >>> >>> On Thu, May 4, 2023 at 3:00 PM Steven Wu <stevenz...@gmail.com> wrote: >>> >>>> Thanks Jack for the great write-up. Good summary of the current >>>> landscape of CDC too. Left a few comments to discuss. >>>> >>>> On Wed, Apr 26, 2023 at 11:38 AM Anton Okolnychyi >>>> <aokolnyc...@apple.com.invalid> wrote: >>>> >>>>> Thanks for starting a thread, Jack! I am yet to go through the >>>>> proposal. >>>>> >>>>> I recently came across a similar idea in BigQuery, which relies on a >>>>> staleness threshold: >>>>> >>>>> https://cloud.google.com/blog/products/data-analytics/bigquery-gains-change-data-capture-functionality/ >>>>> >>>>> It would also be nice to check if there are any applicable ideas in >>>>> Paimon: >>>>> https://github.com/apache/incubator-paimon/ >>>>> >>>>> - Anton >>>>> >>>>> On Apr 26, 2023, at 11:32 AM, Jack Ye <yezhao...@gmail.com> wrote: >>>>> >>>>> Hi everyone, >>>>> >>>>> As we discussed in the community sync, it looks like we have some >>>>> general interest in improving the CDC streaming process. Dan mentioned >>>>> that >>>>> Ryan has a proposal about an alternative CDC approach that has an >>>>> accumulated changelog that is periodically synced to a target table. >>>>> >>>>> I have a very similar design doc I have been working on for quite some >>>>> time to describe a set of improvements we could do to the Iceberg CDC use >>>>> case, and it contains a very similar improvement (see improvement 3). >>>>> >>>>> I would appreciate feedback from the community about this doc, and I >>>>> can organize some meetings to discuss our thoughts about this topic >>>>> afterwards. >>>>> >>>>> Doc link: >>>>> https://docs.google.com/document/d/1kyyJp4masbd1FrIKUHF1ED_z1hTARL8bNoKCgb7fhSQ/edit# >>>>> >>>>> Best, >>>>> Jack Ye >>>>> >>>>> >>>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular >