Agree with everyone that option (a) is the correct behavior.

On Wed, Aug 21, 2024 at 11:57 AM Steve Zhang
<hongyue_zh...@apple.com.invalid> wrote:

> I agree that option (a) is what user expects for row level changes.
>
> I feel the added deletes in given snapshots provides a PK of DELETED
> entry, existing deletes are used to read together with data files to find
> DELETED value (V1b) and result of columns.
>
> Thanks,
> Steve Zhang
>
>
>
> On Aug 20, 2024, at 6:06 PM, Wing Yew Poon <wyp...@cloudera.com.INVALID>
> wrote:
>
> Hi,
>
> I have a PR open to add changelog support for the case where delete files
> are present (https://github.com/apache/iceberg/pull/10935). I have a
> question about what the changelog should emit in the following scenario:
>
> The table has a schema with a primary key/identifier column PK and
> additional column V.
> In snapshot 1, we write a data file DF1 with rows
> PK1, V1
> PK2, V2
> etc.
> In snapshot 2, we write an equality delete file ED1 with PK=PK1, and new
> data file DF2 with rows
> PK1, V1b
> (possibly other rows)
> In snapshot 3, we write an equality delete file ED2 with PK=PK1, and new
> data file DF3 with rows
> PK1, V1c
> (possibly other rows)
>
> Thus, in snapshot 2 and snapshot 3, we update the row identified by PK1
> with new values by using an equality delete and writing new data for the
> row.
> These are the files present in snapshot 3:
> DF1 (sequence number 1)
> DF2 (sequence number 2)
> DF3 (sequence number 3)
> ED1 (sequence number 2)
> ED2 (sequence number 3)
>
> The question I have is what should the changelog emit for snapshot 3?
> For snapshot 1, the changelog should emit a row for each row in DF1 as
> INSERTED.
> For snapshot 2, it should emit a row for PK1, V1 as DELETED; and a row for
> PK1, V1b as INSERTED.
> For snapshot 3, I see two possibilities:
> (a)
> PK1,V1b,DELETED
> PK1,V1c,INSERTED
>
> (b)
> PK1,V1,DELETED
> PK1,V1b,DELETED
> PK1,V1c,INSERTED
>
> The interpretation for (b) is that both ED1 and ED2 apply to DF1, with ED1
> being an existing delete file and ED2 being an added delete file for it. We
> discount ED1 and apply ED2 and get a DELETED row for PK1,V1.
> ED2 also applies to DF2, from which we get a DELETED row for PK1, V1b.
>
> The interpretation for (a) is that ED1 is an existing delete file for DF1
> and in snapshot 3, the row PK1,V1 already does not exist before the
> snapshot. Thus we do emit a row for it. (We can think of it as ED1 is
> already applied to DF1, and we only consider any additional rows that get
> deleted when ED2 is applied.)
>
> I lean towards (a), as I think it is more reflective of net changes.
> I am interested to hear what folks think.
>
> Thank you,
> Wing Yew
>
>
>
>

Reply via email to