Yeah agree on this, I think for changelogscan to convert per snapshot scan to tasks the option b with complete history is the right way. While there shall be an option to configure if net/squashed changes are desired.
Also, In spark create_catalog_view, the net changes and compute update cannot be used together. Thanks, Steve Zhang > On Aug 22, 2024, at 8:50 AM, Steven Wu <stevenz...@gmail.com> wrote: > > > It should emit changes for each snapshot in the requested range. > > Wing Yew has a good point here. +1 > > On Thu, Aug 22, 2024 at 8:46 AM Wing Yew Poon <wyp...@cloudera.com.invalid> > wrote: >> First, thank you all for your responses to my question. >> >> For Peter's question, I believe that (b) is the correct behavior. It is also >> the current behavior when using copy-on-write (deletes and updates are still >> supported but not using delete files). A changelog scan is an incremental >> scan over multiple snapshots. It should emit changes for each snapshot in >> the requested range. Spark provides additional functionality on top of the >> changelog scan, to produce net changes for the requested range. See >> https://iceberg.apache.org/docs/latest/spark-procedures/#create_changelog_view. >> Basically the create_changelog_view procedure uses a changelog scan (read >> the changelog table, i.e., <table>.changes) to get a DataFrame which is >> saved to a temporary Spark view which can then be queried; if net_changes is >> true, only the net changes are produced for this temporary view. This >> functionality uses ChangelogIterator.removeNetCarryovers (which is in Spark). >> >> >> On Thu, Aug 22, 2024 at 7:51 AM Steven Wu <stevenz...@gmail.com >> <mailto:stevenz...@gmail.com>> wrote: >>> Peter, good question. In this case, (b) is the complete change history. (a) >>> is the squashed version. >>> >>> I would probably check how other changelog systems deal with this scenario. >>> >>> On Thu, Aug 22, 2024 at 3:49 AM Péter Váry <peter.vary.apa...@gmail.com >>> <mailto:peter.vary.apa...@gmail.com>> wrote: >>>> Technically different, but somewhat similar question: >>>> >>>> What is the expected behaviour when the `IncrementalScan` is created for >>>> not a single snapshot, but for multiple snapshots? >>>> S1 added PK1-V1 >>>> S2 updated PK1-V1 to PK1-V1b (removed PK1-V1 and added PK1-V1b) >>>> S3 updated PK1-V1b to PK1-V1c (removed PK1-V1b and added PK1-V1c) >>>> >>>> Let's say we have IncrementalScan.fromSnapshotInclusive(S1).toSnapshot(S3). >>>> Or we need to return: >>>> (a) >>>> - PK1,V1c,INSERTED >>>> >>>> Or is it ok, to return: >>>> (b) >>>> - PK1,V1,INSERTED >>>> - PK1,V1,DELETED >>>> - PK1,V1b,INSERTED >>>> - PK1,V1b,DELETED >>>> - PK1,V1c,INSERTED >>>> >>>> I think the (a) is the correct behaviour. >>>> >>>> Thanks, >>>> Peter >>>> >>>> Steven Wu <stevenz...@gmail.com <mailto:stevenz...@gmail.com>> ezt írta >>>> (időpont: 2024. aug. 21., Sze, 22:27): >>>>> Agree with everyone that option (a) is the correct behavior. >>>>> >>>>> On Wed, Aug 21, 2024 at 11:57 AM Steve Zhang >>>>> <hongyue_zh...@apple.com.invalid> wrote: >>>>>> I agree that option (a) is what user expects for row level changes. >>>>>> >>>>>> I feel the added deletes in given snapshots provides a PK of DELETED >>>>>> entry, existing deletes are used to read together with data files to >>>>>> find DELETED value (V1b) and result of columns. >>>>>> >>>>>> Thanks, >>>>>> Steve Zhang >>>>>> >>>>>> >>>>>> >>>>>>> On Aug 20, 2024, at 6:06 PM, Wing Yew Poon >>>>>>> <wyp...@cloudera.com.INVALID> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have a PR open to add changelog support for the case where delete >>>>>>> files are present (https://github.com/apache/iceberg/pull/10935). I >>>>>>> have a question about what the changelog should emit in the following >>>>>>> scenario: >>>>>>> >>>>>>> The table has a schema with a primary key/identifier column PK and >>>>>>> additional column V. >>>>>>> In snapshot 1, we write a data file DF1 with rows >>>>>>> PK1, V1 >>>>>>> PK2, V2 >>>>>>> etc. >>>>>>> In snapshot 2, we write an equality delete file ED1 with PK=PK1, and >>>>>>> new data file DF2 with rows >>>>>>> PK1, V1b >>>>>>> (possibly other rows) >>>>>>> In snapshot 3, we write an equality delete file ED2 with PK=PK1, and >>>>>>> new data file DF3 with rows >>>>>>> PK1, V1c >>>>>>> (possibly other rows) >>>>>>> >>>>>>> Thus, in snapshot 2 and snapshot 3, we update the row identified by PK1 >>>>>>> with new values by using an equality delete and writing new data for >>>>>>> the row. >>>>>>> These are the files present in snapshot 3: >>>>>>> DF1 (sequence number 1) >>>>>>> DF2 (sequence number 2) >>>>>>> DF3 (sequence number 3) >>>>>>> ED1 (sequence number 2) >>>>>>> ED2 (sequence number 3) >>>>>>> >>>>>>> The question I have is what should the changelog emit for snapshot 3? >>>>>>> For snapshot 1, the changelog should emit a row for each row in DF1 as >>>>>>> INSERTED. >>>>>>> For snapshot 2, it should emit a row for PK1, V1 as DELETED; and a row >>>>>>> for PK1, V1b as INSERTED. >>>>>>> For snapshot 3, I see two possibilities: >>>>>>> (a) >>>>>>> PK1,V1b,DELETED >>>>>>> PK1,V1c,INSERTED >>>>>>> >>>>>>> (b) >>>>>>> PK1,V1,DELETED >>>>>>> PK1,V1b,DELETED >>>>>>> PK1,V1c,INSERTED >>>>>>> >>>>>>> The interpretation for (b) is that both ED1 and ED2 apply to DF1, with >>>>>>> ED1 being an existing delete file and ED2 being an added delete file >>>>>>> for it. We discount ED1 and apply ED2 and get a DELETED row for PK1,V1. >>>>>>> ED2 also applies to DF2, from which we get a DELETED row for PK1, V1b. >>>>>>> >>>>>>> The interpretation for (a) is that ED1 is an existing delete file for >>>>>>> DF1 and in snapshot 3, the row PK1,V1 already does not exist before the >>>>>>> snapshot. Thus we do emit a row for it. (We can think of it as ED1 is >>>>>>> already applied to DF1, and we only consider any additional rows that >>>>>>> get deleted when ED2 is applied.) >>>>>>> >>>>>>> I lean towards (a), as I think it is more reflective of net changes. >>>>>>> I am interested to hear what folks think. >>>>>>> >>>>>>> Thank you, >>>>>>> Wing Yew >>>>>>> >>>>>>> >>>>>>