Re: Improve Change Data Capture Use Case for Iceberg

2024-02-28 Thread Péter Váry
I have been thinking about this quite a bit. Moving the temporary manifest files could work, but the prepared and not yet committed data files are also present in their final place. These data files are also not part of the table yet, and could be removed by the orphan files removal process. Movin

Re: Improve Change Data Capture Use Case for Iceberg

2024-02-28 Thread Ryan Blue
> No removed temporary files on Flink failure. (Spark orphan file removal needs to be configured to prevent removal of Flink temporary files which are needed on recovery) This sounds like it's a larger problem. Shouldn't Flink store its state in a different prefix that won't be cleaned up by orpha

Re: Improve Change Data Capture Use Case for Iceberg

2024-02-28 Thread Péter Váry
Sorry to chime in a bit late to the conversation. I am currently working in implementing Flink in-job maintenance. The main target audience: - Users who can't or don't want to use Spark - Users who need frequent checkpointing (low latency in the Iceberg table) and have many small files - CDC user

Re: Improve Change Data Capture Use Case for Iceberg

2024-02-26 Thread Renjie Liu
> > The multi-table solution also fits better with materialized views, > eventually. Good point! On Tue, Feb 27, 2024 at 6:39 AM Ryan Blue wrote: > I can give an update on using branches vs using tables. I originally > wanted to use branches, but I now prefer the approach using separate table

Re: Improve Change Data Capture Use Case for Iceberg

2024-02-26 Thread Ryan Blue
I can give an update on using branches vs using tables. I originally wanted to use branches, but I now prefer the approach using separate tables. The main reason to use a separate table is to avoid schema differences between branches. We recently looked into whether to support different schemas ac

Re: Improve Change Data Capture Use Case for Iceberg

2024-02-25 Thread Manu Zhang
Thanks Ryan for the updates! I've already read your blog series and learned a lot. Our customer is using a similar Flink Append + Spark MergeInto approach. I'm wondering whether there's a plan to implement this pattern with two branches as proposed by Jack in Improvement 3. Manu On Mon, Feb 26,

Re: Improve Change Data Capture Use Case for Iceberg

2024-02-25 Thread Ryan Blue
Manu, I haven't seen much improvement to the Flink/UPSERT approach to CDC. It's still half-finished. There are some efforts to add table maintenance to Flink, but the main issues -- work is being deferred and never done -- haven't been addressed. I don't recommend this approach. The other approac

Re: Improve Change Data Capture Use Case for Iceberg

2024-02-20 Thread Manu Zhang
Bump up this thread again. Are we actively working on any proposed approaches? Manu On Fri, May 5, 2023 at 9:14 AM Ryan Blue wrote: > Thanks for taking the time to write this up, Jack! It definitely overlaps > my own thinking, which is a good confirmation that we're on the right > track. There

Re: Improve Change Data Capture Use Case for Iceberg

2023-05-04 Thread Ryan Blue
Thanks for taking the time to write this up, Jack! It definitely overlaps my own thinking, which is a good confirmation that we're on the right track. There are a couple of things I want to add to the discussion. First, I think the doc relies fairly heavily on the Iceberg/Flink approach being _the

Re: Improve Change Data Capture Use Case for Iceberg

2023-05-04 Thread Steven Wu
Thanks Jack for the great write-up. Good summary of the current landscape of CDC too. Left a few comments to discuss. On Wed, Apr 26, 2023 at 11:38 AM Anton Okolnychyi wrote: > Thanks for starting a thread, Jack! I am yet to go through the proposal. > > I recently came across a similar idea in B

Re: Improve Change Data Capture Use Case for Iceberg

2023-04-26 Thread Anton Okolnychyi
Thanks for starting a thread, Jack! I am yet to go through the proposal. I recently came across a similar idea in BigQuery, which relies on a staleness threshold: https://cloud.google.com/blog/products/data-analytics/bigquery-gains-change-data-capture-functionality/