Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Péter Váry
Thanks for the details! I agree, that the first iteration of the planning doesn't need to contain all of the options. In the long run it would be nice to provide the same configuration options for the BaseIncrementalChangelogScan that we have for the create_changelog_view, so we could rebase the i

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Wing Yew Poon
Just a note that the functionality to compute net changes was added by Yufei only in Iceberg 1.4.0, in #7326 . On Thu, Aug 22, 2024 at 12:48 PM Wing Yew Poon wrote: > Peter, > > The Spark procedure is implemented by CreateChangelogViewProcedure.java >

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Wing Yew Poon
Peter, The Spark procedure is implemented by CreateChangelogViewProcedure.java . This was already added by Yufei in Iceberg 1.2.0. ChangelogIterator

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Péter Váry
That's good info. I didn't know that we already have the Spark procedure at hand. How does Spark calculate the `changelog_view`? Do we already have an implementation at hand somewhere? Could it be reused? Anyways, if we want to reuse the new changelogscan for the changelog_view as well, then I agr

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Steve Zhang
Yeah agree on this, I think for changelogscan to convert per snapshot scan to tasks the option b with complete history is the right way. While there shall be an option to configure if net/squashed changes are desired. Also, In spark create_catalog_view, the net changes and compute update cannot

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Steven Wu
> It should emit changes for each snapshot in the requested range. Wing Yew has a good point here. +1 On Thu, Aug 22, 2024 at 8:46 AM Wing Yew Poon wrote: > First, thank you all for your responses to my question. > > For Peter's question, I believe that (b) is the correct behavior. It is > als

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Wing Yew Poon
First, thank you all for your responses to my question. For Peter's question, I believe that (b) is the correct behavior. It is also the current behavior when using copy-on-write (deletes and updates are still supported but not using delete files). A changelog scan is an incremental scan over mult

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Steven Wu
Peter, good question. In this case, (b) is the complete change history. (a) is the squashed version. I would probably check how other changelog systems deal with this scenario. On Thu, Aug 22, 2024 at 3:49 AM Péter Váry wrote: > Technically different, but somewhat similar question: > > What is

Re: clarification on changelog behavior for equality deletes

2024-08-22 Thread Péter Váry
Technically different, but somewhat similar question: What is the expected behaviour when the `IncrementalScan` is created for not a single snapshot, but for multiple snapshots? S1 added PK1-V1 S2 updated PK1-V1 to PK1-V1b (removed PK1-V1 and added PK1-V1b) S3 updated PK1-V1b to PK1-V1c (removed P

Re: clarification on changelog behavior for equality deletes

2024-08-21 Thread Steven Wu
Agree with everyone that option (a) is the correct behavior. On Wed, Aug 21, 2024 at 11:57 AM Steve Zhang wrote: > I agree that option (a) is what user expects for row level changes. > > I feel the added deletes in given snapshots provides a PK of DELETED > entry, existing deletes are used to re

Re: clarification on changelog behavior for equality deletes

2024-08-21 Thread Steve Zhang
I agree that option (a) is what user expects for row level changes. I feel the added deletes in given snapshots provides a PK of DELETED entry, existing deletes are used to read together with data files to find DELETED value (V1b) and result of columns. Thanks, Steve Zhang > On Aug 20, 2024

Re: clarification on changelog behavior for equality deletes

2024-08-21 Thread Shani Elharrar
+1 for option (a). Shani.On 21 Aug 2024, at 19:07, Péter Váry wrote:I think from the correctness perspective only the option (a) is valid. The difference between snapshot2 and snapshot3 is one delete and one insertion.Jason Fine ezt írta (időpont: 2024. aug. 21., Sze, 15:26):Great to see someone

Re: clarification on changelog behavior for equality deletes

2024-08-21 Thread Péter Váry
I think from the correctness perspective only the option (a) is valid. The difference between snapshot2 and snapshot3 is one delete and one insertion. Jason Fine ezt írta (időpont: 2024. aug. 21., Sze, 15:26): > Great to see someone is working on this feature! > > IMHO option (a) is preferred. M

Re: clarification on changelog behavior for equality deletes

2024-08-21 Thread Jason Fine
Great to see someone is working on this feature! IMHO option (a) is preferred. My (impulsive) reasoning for this is for the following reasons: 1. I think in CDC you shouldn't be skipping snapshots so you would get the deleted event while processing snapshot 2 anyway. 2. If you consider d

clarification on changelog behavior for equality deletes

2024-08-20 Thread Wing Yew Poon
Hi, I have a PR open to add changelog support for the case where delete files are present (https://github.com/apache/iceberg/pull/10935). I have a question about what the changelog should emit in the following scenario: The table has a schema with a primary key/identifier column PK and additional