Re: CDC with "copy on write"

Ryan Blue Tue, 22 Dec 2020 17:00:54 -0800

Hi Ashish,

I think it makes sense for your use case (#1949) to expose a way to read
overwrites using incremental scans, but I'm not sure how best to expose it.
This is safe for your case because you know which records were deleted
based on the ID, so you're basically replaying the data as incremental
upserts rather than appends. The simplest way is to allow configuring the
incremental scan so that it ignores deletes, but that doesn't seem like a
good thing to expose more generally. You know it is safe and that these are
upserts, but other people might just see an option to stop failing on
overwrite and use it without thinking. The only solution I can think of is
to make this a scan option string, so it is available to you, but not
clearly visible when calling the API. That's not very clean, but it would
work.


For your second question, I think there are similar use cases. I think we
can make this easier by extending table metadata to make it easy to detect
this case. Maybe we could keep the key used for the MATCH clause in a MERGE
INTO command in the commit metadata, so we can detect that the overwrite
was by key and reconstruct the CDC events. I think maybe the best option
for now is to have you explore the space a bit more and when other people
want to do something similar, try to find out what Iceberg should do at
that point.

On Fri, Dec 18, 2020 at 7:16 AM Ashish Mehta <mehta.ashis...@gmail.com>
wrote:

> Hi,
>
> We have been working to support Row-level updates to our data ingestion
> pipeline using Spark on Batch API (no streaming use case for now).
> Currently, we are looking to adhere to the "copy on write" implementation
> of DELETE, MERGE INTO (WIP via #1947).
>
> For us, our main use case is primary key based dataSets (like MySQL bin
> log export) where the DELETE and MERGE always update records based on the
> primary key. Considering that I know the primary key and this fixed use
> case of primary key-based updates, I can easily construct back CDC from the
> appended/deleted data from the table, by taking full outer join on primary
> key between appended data and deleted data, and expose what all rows were
> updated/inserted/deleted, along with the previous value in case of
> updates/deletes.
>
> So two queries/clarification,
>
>    1. In order to do read all the appended data with "copy on write", I
>    should be able to query it incrementally (at least via BatchAPIs), but
>    currently Iceberg doesn't support Incremental reads on "overwrite"
>    snapshots. I logged an issue to at least support that via ticket
>    https://github.com/apache/iceberg/issues/1949 by passing additional
>    read options.
>    Additionally, there should be a support to expose *only *deleted data
>    incrementally so that I can achieve the above join between appended and
>    deleted data over Batch API.
>    2. Does anyone envision that we should have some generic support for
>    CDC in case of "copy on write" implementation? Because the approach above
>    is quite limited to my use case.
>
>
> Let me know your thoughts, I was thinking to pick
> https://github.com/apache/iceberg/issues/1949 and raise a PR if people
> think this is useful and doesn't break any constructs.
>
> Thanks,
> Ashish
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: CDC with "copy on write"

Reply via email to