CDC with "copy on write"

Ashish Mehta Fri, 18 Dec 2020 07:16:57 -0800

Hi,

We have been working to support Row-level updates to our data ingestion
pipeline using Spark on Batch API (no streaming use case for now).
Currently, we are looking to adhere to the "copy on write" implementation
of DELETE, MERGE INTO (WIP via #1947).


For us, our main use case is primary key based dataSets (like MySQL bin log
export) where the DELETE and MERGE always update records based on the
primary key. Considering that I know the primary key and this fixed use
case of primary key-based updates, I can easily construct back CDC from the
appended/deleted data from the table, by taking full outer join on primary
key between appended data and deleted data, and expose what all rows were
updated/inserted/deleted, along with the previous value in case of
updates/deletes.

So two queries/clarification,

   1. In order to do read all the appended data with "copy on write", I
   should be able to query it incrementally (at least via BatchAPIs), but
   currently Iceberg doesn't support Incremental reads on "overwrite"
   snapshots. I logged an issue to at least support that via ticket
   https://github.com/apache/iceberg/issues/1949 by passing additional read
   options.
   Additionally, there should be a support to expose *only *deleted data
   incrementally so that I can achieve the above join between appended and
   deleted data over Batch API.
   2. Does anyone envision that we should have some generic support for CDC
   in case of "copy on write" implementation? Because the approach above is
   quite limited to my use case.


Let me know your thoughts, I was thinking to pick
https://github.com/apache/iceberg/issues/1949 and raise a PR if people
think this is useful and doesn't break any constructs.

Thanks,
Ashish

CDC with "copy on write"

Reply via email to