Hi, Sorry, I re-read the thread and Peter's question more closely, and wanted to explore that we are not precluding something unnecessarily, and if we can solve the code problem in other ways.
The concern is that in the 'undeleted' row, the row_id and last_updated_seq_number are wrong. - If 'row-id' is not set, it inherits a row-id that is changed, which is wrong - If 'last_updated_sequence_number' is set, then it is wrong because it should refer to the snapshot that 'undeleted it'. Is that correct? But if a data file has all rows that have 'row-id' set and 'last_updated_sequence_number' unset, technically this can be a valid undelete, is it right? Thanks Szehon On Mon, Dec 1, 2025 at 11:08 AM Steven Wu <[email protected]> wrote: > > > _row_id a unique long identifier for every row within the table. The > value is assigned via inheritance when a row is first added to the table. > > Actually, current spec doesn't allow explicitly assigning row-id for new > rows. > > So currently we don't need to worry about the question if it is allowed to > have *new* rows with explicitly assigned row-id values lower than the > snapshot's first-row-id. > > On Mon, Dec 1, 2025 at 9:50 AM Steven Wu <[email protected]> wrote: > >> Here is the spec PR to clarify undelete is not allowed. Will start a vote >> thread for that. >> https://github.com/apache/iceberg/pull/14731 >> >> Let me start a new discussion thread for the first-row-id and row-id >> question for row lineage to get more attention and input. >> >> On Sat, Nov 22, 2025 at 7:02 AM Péter Váry <[email protected]> >> wrote: >> >>> Apologies if I was unclear. As Steven also mentioned, I wanted to >>> confirm whether we agree on the clarification regarding the `row-id` and >>> `first-row-id`. >>> >>> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. 22., >>> Szo, 15:28): >>> >>>> Just to clarify, I was asking a question. >>>> >>>> Is it valid to add a new data file with a row? >>>> >>>> - whose persisted row-id value is lower than the snapshot's >>>> first-row-id >>>> - whose last-updated-seq-number is not set and inherit from the >>>> snapshot sequence number >>>> >>>> Thanks, >>>> Steven >>>> >>>> On Fri, Nov 21, 2025 at 11:25 PM Péter Váry < >>>> [email protected]> wrote: >>>> >>>>> +1 for this proposal >>>>> >>>>> Slightly related, but we can move this to a separate thread if it >>>>> needs independent discussion: We should clarify the relationship between >>>>> `row-id` and `first-row-id`. This has come up several times in our >>>>> discussions about the equality delete removal proposal, where we >>>>> considered >>>>> generating `row-ids` manually instead of relying on the auto-assignment >>>>> feature. >>>>> >>>>> As discussed with Steven: >>>>> >>>>>> It is valid to add a new data file with a row: >>>>>> >>>>>> - whose persisted row-id value is lower than the snapshot's >>>>>> first-row-id >>>>>> - whose last-updated-seq-number is not set and inherit from the >>>>>> snapshot sequence number >>>>>> >>>>>> >>>>> Prashant Singh <[email protected]> ezt írta (időpont: 2025. >>>>> nov. 22., Szo, 5:29): >>>>> >>>>>> +1 for making it explicit that an *undelete *of a row can't be done >>>>>> by unsetting the corresponding bit in DV >>>>>> >>>>>> *Rows should only be added via new data files*, sounds reasonable to >>>>>> me ! >>>>>> >>>>>> apart from row-lineage it also complicates the operation type >>>>>> inference like here [1] as we would now >>>>>> inspect the contents of these DV to see if it's an insert ? >>>>>> >>>>>> [1] >>>>>> https://github.com/apache/iceberg/pull/14581#discussion_r2533057189 >>>>>> >>>>>> On Sat, Nov 22, 2025 at 4:48 AM Szehon Ho <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> It makes sense to me, it sounds like a minor clarification. For v2 >>>>>>> position deletes, code like rewrite_position_deletes may have made some >>>>>>> assumptions like this and would not work well if violated, maybe other >>>>>>> code >>>>>>> as well. >>>>>>> >>>>>>> Thanks >>>>>>> Szehon >>>>>>> >>>>>>> On Fri, Nov 21, 2025 at 3:03 PM Steven Wu <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Similar weird behavior can also happen for V2 position delete files >>>>>>>> with `undelete`. >>>>>>>> >>>>>>>> In V2, there could be multiple position delete files (say pd1, pd2) >>>>>>>> associated with the same data file (say f1). Let's say pd1 deletes row >>>>>>>> 5 >>>>>>>> and 10 and pd2 deletes row 15. >>>>>>>> 1. a new snapshot is committed with pd1 (DELETED), pd2 (EXISTING), >>>>>>>> and pd3 (ADDED). pd3 deletes only row 10 (undeleted row 5) >>>>>>>> 2. a new snapshot is committed with pd1 (DELETED) and pd2 (EXISTING) >>>>>>>> >>>>>>>> In either case, essentially some rows are added (back) to the table >>>>>>>> with lower sequence number than the new snapshot's sequence number. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Just to recap the question: should the spec (v2 and v3) spell out >>>>>>>> that `undelete row` is not allowed? Rows should only be added via new >>>>>>>> data >>>>>>>> files. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Nov 21, 2025 at 1:09 PM Steven Wu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >Are we specifically stating somewhere that all row-ids should be >>>>>>>>> higher than or equal to the snapshot's `first-row-id`? >>>>>>>>> In my mental model the `first-row-id` is only applicable for rows >>>>>>>>> that don't have a specific row-id assigned. >>>>>>>>> >>>>>>>>> I meant an ADDED row should have `row-id` higher than or equal to >>>>>>>>> the snapshot's `first-row-id`. EXISTING or UPDATED row can have lower >>>>>>>>> row >>>>>>>>> id. >>>>>>>>> >>>>>>>>> On Fri, Nov 21, 2025 at 1:04 PM Steven Wu <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> > Can we create a validator to prevent this from happening? >>>>>>>>>> >>>>>>>>>> We don't have this problem with the Java implementation. >>>>>>>>>> `BaseDVFileWriter` merges the previous DV with the new delta DV. So >>>>>>>>>> there >>>>>>>>>> is no `undelete` behavior. I am not aware of any Java API to allow >>>>>>>>>> "undelete". So we probably don't need to add any validation code in >>>>>>>>>> the >>>>>>>>>> Java impl. >>>>>>>>>> >>>>>>>>>> Just thought it is good to spell it out in the spec so that >>>>>>>>>> clients/engines can be clear about the expected behavior. >>>>>>>>>> >>>>>>>>>> On Fri, Nov 21, 2025 at 12:18 PM Péter Váry < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Are we specifically stating somewhere that all row-ids should be >>>>>>>>>>> higher than or equal to the snapshot's `first-row-id`? >>>>>>>>>>> In my mental model the `first-row-id` is only applicable for >>>>>>>>>>> rows that don't have a specific row-id assigned. >>>>>>>>>>> >>>>>>>>>>> Noneless, I agree that the `row-id` and the >>>>>>>>>>> `last-updated-seq-num` should have changed to a new one, so we can >>>>>>>>>>> say that >>>>>>>>>>> undeleting a row is not allowed because of this. >>>>>>>>>>> >>>>>>>>>>> Can we create a validator to prevent this from happening? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. >>>>>>>>>>> 21., P, 21:11): >>>>>>>>>>> >>>>>>>>>>>> The undeleted row would have invalid `row-id` and >>>>>>>>>>>> `last-updated-seq-num`. Since it is a new row (added back), it >>>>>>>>>>>> should have >>>>>>>>>>>> the `row-id` higher than or equal to the snapshot's `first-row-id` >>>>>>>>>>>> and the >>>>>>>>>>>> `last-updated-seq-number` should inherit/have the new snapshot's >>>>>>>>>>>> sequence >>>>>>>>>>>> number. >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Nov 21, 2025 at 11:48 AM Steven Wu < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> Should we clarify the V3 spec to explicitly formid "*undelete*" >>>>>>>>>>>>> of a row by unsetting the DV bit? Unsetting a DV bit essentially >>>>>>>>>>>>> adds a row >>>>>>>>>>>>> with lower row-id than the snapshot's first-row-id, which would >>>>>>>>>>>>> violate the >>>>>>>>>>>>> row lineage spec. With the restriction, DV cardinality should be >>>>>>>>>>>>> monotonically increasing. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Steven >>>>>>>>>>>>> >>>>>>>>>>>>
