+1 for option (a). 

Shani.

On 21 Aug 2024, at 19:07, Péter Váry <peter.vary.apa...@gmail.com> wrote:


I think from the correctness perspective only the option (a) is valid. The difference between snapshot2 and snapshot3 is one delete and one insertion.

Jason Fine <ja...@upsolver.com.invalid> ezt írta (időpont: 2024. aug. 21., Sze, 15:26):
Great to see someone is working on this feature!

IMHO option (a) is preferred. My (impulsive) reasoning for this is for the following reasons:
  1. I think in CDC you shouldn't be skipping snapshots so you would get the deleted event while processing snapshot 2 anyway. 
  2. If you consider delete files with a strictly smaller sequence number than the current data-sequence-number it's not clear when they stop being relevant. In a table scan they are no longer returned when there are no data files with a sequence number small enough, but in CDC you are not looking at all the files.
  3. If the table contains many update commits there could be many active delete files at any given time so you would re-publish every event many times (for every subsequent snapshot after the initial delete)

On Wed, Aug 21, 2024 at 4:07 AM Wing Yew Poon <wyp...@cloudera.com.invalid> wrote:
Hi,

I have a PR open to add changelog support for the case where delete files are present (https://github.com/apache/iceberg/pull/10935). I have a question about what the changelog should emit in the following scenario:

The table has a schema with a primary key/identifier column PK and additional column V.
In snapshot 1, we write a data file DF1 with rows
PK1, V1
PK2, V2
etc.
In snapshot 2, we write an equality delete file ED1 with PK=PK1, and new data file DF2 with rows
PK1, V1b
(possibly other rows)
In snapshot 3, we write an equality delete file ED2 with PK=PK1, and new data file DF3 with rows
PK1, V1c
(possibly other rows)

Thus, in snapshot 2 and snapshot 3, we update the row identified by PK1 with new values by using an equality delete and writing new data for the row.
These are the files present in snapshot 3:
DF1 (sequence number 1)
DF2 (sequence number 2)
DF3 (sequence number 3)
ED1 (sequence number 2)
ED2 (sequence number 3)

The question I have is what should the changelog emit for snapshot 3?
For snapshot 1, the changelog should emit a row for each row in DF1 as INSERTED.
For snapshot 2, it should emit a row for PK1, V1 as DELETED; and a row for PK1, V1b as INSERTED.
For snapshot 3, I see two possibilities:
(a)
PK1,V1b,DELETED
PK1,V1c,INSERTED

(b)
PK1,V1,DELETED
PK1,V1b,DELETED
PK1,V1c,INSERTED

The interpretation for (b) is that both ED1 and ED2 apply to DF1, with ED1 being an existing delete file and ED2 being an added delete file for it. We discount ED1 and apply ED2 and get a DELETED row for PK1,V1.
ED2 also applies to DF2, from which we get a DELETED row for PK1, V1b.

The interpretation for (a) is that ED1 is an existing delete file for DF1 and in snapshot 3, the row PK1,V1 already does not exist before the snapshot. Thus we do emit a row for it. (We can think of it as ED1 is already applied to DF1, and we only consider any additional rows that get deleted when ED2 is applied.)

I lean towards (a), as I think it is more reflective of net changes.
I am interested to hear what folks think.

Thank you,
Wing Yew




--

Jason Fine
Chief Software Architect

Reply via email to