Peter, good question. In this case, (b) is the complete change history. (a)
is the squashed version.

I would probably check how other changelog systems deal with this scenario.

On Thu, Aug 22, 2024 at 3:49 AM Péter Váry <>

> Technically different, but somewhat similar question:
> What is the expected behaviour when the `IncrementalScan` is created for
> not a single snapshot, but for multiple snapshots?
> S1 added PK1-V1
> S2 updated PK1-V1 to PK1-V1b (removed PK1-V1 and added PK1-V1b)
> S3 updated PK1-V1b to PK1-V1c (removed PK1-V1b and added PK1-V1c)
> Let's say we have
> *IncrementalScan.fromSnapshotInclusive(S1).toSnapshot(S3)*.
> Or we need to return:
> (a)
> Or is it ok, to return:
> (b)
> I think the (a) is the correct behaviour.
> Thanks,
> Peter
> Steven Wu <> ezt írta (időpont: 2024. aug. 21., Sze,
> 22:27):
>> Agree with everyone that option (a) is the correct behavior.
>> On Wed, Aug 21, 2024 at 11:57 AM Steve Zhang
>> <> wrote:
>>> I agree that option (a) is what user expects for row level changes.
>>> I feel the added deletes in given snapshots provides a PK of DELETED
>>> entry, existing deletes are used to read together with data files to find
>>> DELETED value (V1b) and result of columns.
>>> Thanks,
>>> Steve Zhang
>>> On Aug 20, 2024, at 6:06 PM, Wing Yew Poon <>
>>> wrote:
>>> Hi,
>>> I have a PR open to add changelog support for the case where delete
>>> files are present ( I
>>> have a question about what the changelog should emit in the following
>>> scenario:
>>> The table has a schema with a primary key/identifier column PK and
>>> additional column V.
>>> In snapshot 1, we write a data file DF1 with rows
>>> PK1, V1
>>> PK2, V2
>>> etc.
>>> In snapshot 2, we write an equality delete file ED1 with PK=PK1, and new
>>> data file DF2 with rows
>>> PK1, V1b
>>> (possibly other rows)
>>> In snapshot 3, we write an equality delete file ED2 with PK=PK1, and new
>>> data file DF3 with rows
>>> PK1, V1c
>>> (possibly other rows)
>>> Thus, in snapshot 2 and snapshot 3, we update the row identified by PK1
>>> with new values by using an equality delete and writing new data for the
>>> row.
>>> These are the files present in snapshot 3:
>>> DF1 (sequence number 1)
>>> DF2 (sequence number 2)
>>> DF3 (sequence number 3)
>>> ED1 (sequence number 2)
>>> ED2 (sequence number 3)
>>> The question I have is what should the changelog emit for snapshot 3?
>>> For snapshot 1, the changelog should emit a row for each row in DF1 as
>>> For snapshot 2, it should emit a row for PK1, V1 as DELETED; and a row
>>> for PK1, V1b as INSERTED.
>>> For snapshot 3, I see two possibilities:
>>> (a)
>>> (b)
>>> The interpretation for (b) is that both ED1 and ED2 apply to DF1, with
>>> ED1 being an existing delete file and ED2 being an added delete file for
>>> it. We discount ED1 and apply ED2 and get a DELETED row for PK1,V1.
>>> ED2 also applies to DF2, from which we get a DELETED row for PK1, V1b.
>>> The interpretation for (a) is that ED1 is an existing delete file for
>>> DF1 and in snapshot 3, the row PK1,V1 already does not exist before the
>>> snapshot. Thus we do emit a row for it. (We can think of it as ED1 is
>>> already applied to DF1, and we only consider any additional rows that get
>>> deleted when ED2 is applied.)
>>> I lean towards (a), as I think it is more reflective of net changes.
>>> I am interested to hear what folks think.
>>> Thank you,
>>> Wing Yew

Reply via email to