Bump up this thread again. Are we actively working on any proposed approaches?
Manu On Fri, May 5, 2023 at 9:14 AM Ryan Blue <b...@tabular.io> wrote: > Thanks for taking the time to write this up, Jack! It definitely overlaps > my own thinking, which is a good confirmation that we're on the right > track. There are a couple of things I want to add to the discussion. > > First, I think the doc relies fairly heavily on the Iceberg/Flink approach > being _the_ approach to CDC with Iceberg. That's not how I've seen the > pattern implemented at scale, and I generally think of the Iceberg/Flink > implementation as unfinished -- it is missing critical components. > > As a format, Iceberg tries to be flexible and allow you to make > trade-offs. The main trade-off you get is to defer work until later. > Sorting is a good example, where Flink can't sort easily so it leaves > sorting for downstream systems. Another way we defer work is using delete > files to keep write amplification down. And yet another way is using > equality delete files to avoid needing to locate records. The issue with > the Iceberg/Flink approach is that it uses all 3 of these and defers a ton > of work that never gets completed. I think the Iceberg/Flink UPSERT > feature is incomplete and would not recommend it without something cleaning > up the tables. > > It seems to me that Approaches 1 and 2 are trying to fix this, but not > really in straightforward ways. I like that Approach 1 also preserves > history, but I think there's a better and more direction option in Approach > 3. > > Second, I think we also need to consider transactional consistency. This > isn't hard to achieve and is how people consuming the table think about the > data. I think we should always try to mirror consistency in the source > table downstream in the mirror table. > > In the end, I think we probably agree on the overall approach (3). Keep a > changelog branch or table and a mirror table, then keep them in sync. I'd > also add views to the mix to get the latest up-to-date information. We can > also make this better by adapting Approach 2 for streaming writes. I think > Anton has been working on an index approach that would work. > > If we're aligned, I think it should be easy to start building this pattern > and adding support in Iceberg for things like updating the schemas at the > same time. > > Ryan > > On Thu, May 4, 2023 at 3:00 PM Steven Wu <stevenz...@gmail.com> wrote: > >> Thanks Jack for the great write-up. Good summary of the current landscape >> of CDC too. Left a few comments to discuss. >> >> On Wed, Apr 26, 2023 at 11:38 AM Anton Okolnychyi >> <aokolnyc...@apple.com.invalid> wrote: >> >>> Thanks for starting a thread, Jack! I am yet to go through the proposal. >>> >>> I recently came across a similar idea in BigQuery, which relies on a >>> staleness threshold: >>> >>> https://cloud.google.com/blog/products/data-analytics/bigquery-gains-change-data-capture-functionality/ >>> >>> It would also be nice to check if there are any applicable ideas in >>> Paimon: >>> https://github.com/apache/incubator-paimon/ >>> >>> - Anton >>> >>> On Apr 26, 2023, at 11:32 AM, Jack Ye <yezhao...@gmail.com> wrote: >>> >>> Hi everyone, >>> >>> As we discussed in the community sync, it looks like we have some >>> general interest in improving the CDC streaming process. Dan mentioned that >>> Ryan has a proposal about an alternative CDC approach that has an >>> accumulated changelog that is periodically synced to a target table. >>> >>> I have a very similar design doc I have been working on for quite some >>> time to describe a set of improvements we could do to the Iceberg CDC use >>> case, and it contains a very similar improvement (see improvement 3). >>> >>> I would appreciate feedback from the community about this doc, and I can >>> organize some meetings to discuss our thoughts about this topic afterwards. >>> >>> Doc link: >>> https://docs.google.com/document/d/1kyyJp4masbd1FrIKUHF1ED_z1hTARL8bNoKCgb7fhSQ/edit# >>> >>> Best, >>> Jack Ye >>> >>> >>> > > -- > Ryan Blue > Tabular >