I think from the correctness perspective only the option (a) is valid. The
difference between snapshot2 and snapshot3 is one delete and one insertion.

Jason Fine <ja...@upsolver.com.invalid> ezt írta (időpont: 2024. aug. 21.,
Sze, 15:26):

> Great to see someone is working on this feature!
>
> IMHO option (a) is preferred. My (impulsive) reasoning for this is for the
> following reasons:
>
>    1. I think in CDC you shouldn't be skipping snapshots so you would get
>    the deleted event while processing snapshot 2 anyway.
>    2. If you consider delete files with a strictly smaller sequence
>    number than the current data-sequence-number it's not clear when they stop
>    being relevant. In a table scan they are no longer returned when there are
>    no data files with a sequence number small enough, but in CDC you are not
>    looking at all the files.
>    3. If the table contains many update commits there could be many
>    active delete files at any given time so you would re-publish every event
>    many times (for every subsequent snapshot after the initial delete)
>
>
> On Wed, Aug 21, 2024 at 4:07 AM Wing Yew Poon <wyp...@cloudera.com.invalid>
> wrote:
>
>> Hi,
>>
>> I have a PR open to add changelog support for the case where delete files
>> are present (https://github.com/apache/iceberg/pull/10935). I have a
>> question about what the changelog should emit in the following scenario:
>>
>> The table has a schema with a primary key/identifier column PK and
>> additional column V.
>> In snapshot 1, we write a data file DF1 with rows
>> PK1, V1
>> PK2, V2
>> etc.
>> In snapshot 2, we write an equality delete file ED1 with PK=PK1, and new
>> data file DF2 with rows
>> PK1, V1b
>> (possibly other rows)
>> In snapshot 3, we write an equality delete file ED2 with PK=PK1, and new
>> data file DF3 with rows
>> PK1, V1c
>> (possibly other rows)
>>
>> Thus, in snapshot 2 and snapshot 3, we update the row identified by PK1
>> with new values by using an equality delete and writing new data for the
>> row.
>> These are the files present in snapshot 3:
>> DF1 (sequence number 1)
>> DF2 (sequence number 2)
>> DF3 (sequence number 3)
>> ED1 (sequence number 2)
>> ED2 (sequence number 3)
>>
>> The question I have is what should the changelog emit for snapshot 3?
>> For snapshot 1, the changelog should emit a row for each row in DF1 as
>> INSERTED.
>> For snapshot 2, it should emit a row for PK1, V1 as DELETED; and a row
>> for PK1, V1b as INSERTED.
>> For snapshot 3, I see two possibilities:
>> (a)
>> PK1,V1b,DELETED
>> PK1,V1c,INSERTED
>>
>> (b)
>> PK1,V1,DELETED
>> PK1,V1b,DELETED
>> PK1,V1c,INSERTED
>>
>> The interpretation for (b) is that both ED1 and ED2 apply to DF1, with
>> ED1 being an existing delete file and ED2 being an added delete file for
>> it. We discount ED1 and apply ED2 and get a DELETED row for PK1,V1.
>> ED2 also applies to DF2, from which we get a DELETED row for PK1, V1b.
>>
>> The interpretation for (a) is that ED1 is an existing delete file for DF1
>> and in snapshot 3, the row PK1,V1 already does not exist before the
>> snapshot. Thus we do emit a row for it. (We can think of it as ED1 is
>> already applied to DF1, and we only consider any additional rows that get
>> deleted when ED2 is applied.)
>>
>> I lean towards (a), as I think it is more reflective of net changes.
>> I am interested to hear what folks think.
>>
>> Thank you,
>> Wing Yew
>>
>>
>>
>
> --
>
> *Jason Fine*
> Chief Software Architect
> ja...@upsolver.com  | www.upsolver.com
>

Reply via email to