Hi, We have been working to support Row-level updates to our data ingestion pipeline using Spark on Batch API (no streaming use case for now). Currently, we are looking to adhere to the "copy on write" implementation of DELETE, MERGE INTO (WIP via #1947).
For us, our main use case is primary key based dataSets (like MySQL bin log export) where the DELETE and MERGE always update records based on the primary key. Considering that I know the primary key and this fixed use case of primary key-based updates, I can easily construct back CDC from the appended/deleted data from the table, by taking full outer join on primary key between appended data and deleted data, and expose what all rows were updated/inserted/deleted, along with the previous value in case of updates/deletes. So two queries/clarification, 1. In order to do read all the appended data with "copy on write", I should be able to query it incrementally (at least via BatchAPIs), but currently Iceberg doesn't support Incremental reads on "overwrite" snapshots. I logged an issue to at least support that via ticket https://github.com/apache/iceberg/issues/1949 by passing additional read options. Additionally, there should be a support to expose *only *deleted data incrementally so that I can achieve the above join between appended and deleted data over Batch API. 2. Does anyone envision that we should have some generic support for CDC in case of "copy on write" implementation? Because the approach above is quite limited to my use case. Let me know your thoughts, I was thinking to pick https://github.com/apache/iceberg/issues/1949 and raise a PR if people think this is useful and doesn't break any constructs. Thanks, Ashish