Technically different, but somewhat similar question: What is the expected behaviour when the `IncrementalScan` is created for not a single snapshot, but for multiple snapshots? S1 added PK1-V1 S2 updated PK1-V1 to PK1-V1b (removed PK1-V1 and added PK1-V1b) S3 updated PK1-V1b to PK1-V1c (removed PK1-V1b and added PK1-V1c)
Let's say we have *IncrementalScan.fromSnapshotInclusive(S1).toSnapshot(S3)* . Or we need to return: (a) - PK1,V1c,INSERTED Or is it ok, to return: (b) - PK1,V1,INSERTED - PK1,V1,DELETED - PK1,V1b,INSERTED - PK1,V1b,DELETED - PK1,V1c,INSERTED I think the (a) is the correct behaviour. Thanks, Peter Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2024. aug. 21., Sze, 22:27): > Agree with everyone that option (a) is the correct behavior. > > On Wed, Aug 21, 2024 at 11:57 AM Steve Zhang > <hongyue_zh...@apple.com.invalid> wrote: > >> I agree that option (a) is what user expects for row level changes. >> >> I feel the added deletes in given snapshots provides a PK of DELETED >> entry, existing deletes are used to read together with data files to find >> DELETED value (V1b) and result of columns. >> >> Thanks, >> Steve Zhang >> >> >> >> On Aug 20, 2024, at 6:06 PM, Wing Yew Poon <wyp...@cloudera.com.INVALID> >> wrote: >> >> Hi, >> >> I have a PR open to add changelog support for the case where delete files >> are present (https://github.com/apache/iceberg/pull/10935). I have a >> question about what the changelog should emit in the following scenario: >> >> The table has a schema with a primary key/identifier column PK and >> additional column V. >> In snapshot 1, we write a data file DF1 with rows >> PK1, V1 >> PK2, V2 >> etc. >> In snapshot 2, we write an equality delete file ED1 with PK=PK1, and new >> data file DF2 with rows >> PK1, V1b >> (possibly other rows) >> In snapshot 3, we write an equality delete file ED2 with PK=PK1, and new >> data file DF3 with rows >> PK1, V1c >> (possibly other rows) >> >> Thus, in snapshot 2 and snapshot 3, we update the row identified by PK1 >> with new values by using an equality delete and writing new data for the >> row. >> These are the files present in snapshot 3: >> DF1 (sequence number 1) >> DF2 (sequence number 2) >> DF3 (sequence number 3) >> ED1 (sequence number 2) >> ED2 (sequence number 3) >> >> The question I have is what should the changelog emit for snapshot 3? >> For snapshot 1, the changelog should emit a row for each row in DF1 as >> INSERTED. >> For snapshot 2, it should emit a row for PK1, V1 as DELETED; and a row >> for PK1, V1b as INSERTED. >> For snapshot 3, I see two possibilities: >> (a) >> PK1,V1b,DELETED >> PK1,V1c,INSERTED >> >> (b) >> PK1,V1,DELETED >> PK1,V1b,DELETED >> PK1,V1c,INSERTED >> >> The interpretation for (b) is that both ED1 and ED2 apply to DF1, with >> ED1 being an existing delete file and ED2 being an added delete file for >> it. We discount ED1 and apply ED2 and get a DELETED row for PK1,V1. >> ED2 also applies to DF2, from which we get a DELETED row for PK1, V1b. >> >> The interpretation for (a) is that ED1 is an existing delete file for DF1 >> and in snapshot 3, the row PK1,V1 already does not exist before the >> snapshot. Thus we do emit a row for it. (We can think of it as ED1 is >> already applied to DF1, and we only consider any additional rows that get >> deleted when ED2 is applied.) >> >> I lean towards (a), as I think it is more reflective of net changes. >> I am interested to hear what folks think. >> >> Thank you, >> Wing Yew >> >> >> >>