Re: [DISCUSS] V3 spec: add monotonic requirement to data DV

Steven Wu Sat, 22 Nov 2025 06:27:59 -0800

Just to clarify, I was asking a question.

Is it valid to add a new data file with a row?


   - whose persisted row-id value is lower than the snapshot's first-row-id
   - whose last-updated-seq-number is not set and inherit from the snapshot
   sequence number

Thanks,
Steven

On Fri, Nov 21, 2025 at 11:25 PM Péter Váry <[email protected]>
wrote:

> +1 for this proposal
>
> Slightly related, but we can move this to a separate thread if it needs
> independent discussion: We should clarify the relationship between `row-id`
> and `first-row-id`. This has come up several times in our discussions about
> the equality delete removal proposal, where we considered generating
> `row-ids` manually instead of relying on the auto-assignment feature.
>
> As discussed with Steven:
>
>> It is valid to add a new data file with a row:
>>
>>    - whose persisted row-id value is lower than the snapshot's
>>    first-row-id
>>    - whose last-updated-seq-number is not set and inherit from the
>>    snapshot sequence number
>>
>>
> Prashant Singh <[email protected]> ezt írta (időpont: 2025. nov.
> 22., Szo, 5:29):
>
>> +1 for making it explicit that an *undelete *of a row can't be done by
>> unsetting the corresponding bit in DV
>>
>> *Rows should only be added via new data files*, sounds reasonable to me !
>>
>> apart from row-lineage it also complicates the operation type inference
>> like here [1] as we would now
>> inspect the contents of these DV to see if it's an insert ?
>>
>> [1] https://github.com/apache/iceberg/pull/14581#discussion_r2533057189
>>
>> On Sat, Nov 22, 2025 at 4:48 AM Szehon Ho <[email protected]>
>> wrote:
>>
>>> It makes sense to me, it sounds like a minor clarification.  For v2
>>> position deletes, code like rewrite_position_deletes may have made some
>>> assumptions like this and would not work well if violated, maybe other code
>>> as well.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Fri, Nov 21, 2025 at 3:03 PM Steven Wu <[email protected]> wrote:
>>>
>>>> Similar weird behavior can also happen for V2 position delete files
>>>> with `undelete`.
>>>>
>>>> In V2, there could be multiple position delete files (say pd1, pd2)
>>>> associated with the same data file (say f1). Let's say pd1 deletes row 5
>>>> and 10 and pd2 deletes row 15.
>>>> 1. a new snapshot is committed with pd1 (DELETED), pd2 (EXISTING), and
>>>> pd3 (ADDED). pd3 deletes only row 10 (undeleted row 5)
>>>> 2. a new snapshot is committed with pd1 (DELETED) and pd2 (EXISTING)
>>>>
>>>> In either case, essentially some rows are added (back) to the table
>>>> with lower sequence number than the new snapshot's sequence number.
>>>>
>>>>
>>>>
>>>> Just to recap the question: should the spec (v2 and v3) spell out that
>>>> `undelete row` is not allowed? Rows should only be added via new data 
>>>> files.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Nov 21, 2025 at 1:09 PM Steven Wu <[email protected]> wrote:
>>>>
>>>>> >Are we specifically stating somewhere that all row-ids should be
>>>>> higher than or equal to the snapshot's `first-row-id`?
>>>>> In my mental model the `first-row-id` is only applicable for rows that
>>>>> don't have a specific row-id assigned.
>>>>>
>>>>> I meant an ADDED row should have `row-id` higher than or equal to the
>>>>> snapshot's `first-row-id`. EXISTING or UPDATED row can have lower row id.
>>>>>
>>>>> On Fri, Nov 21, 2025 at 1:04 PM Steven Wu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> > Can we create a validator to prevent this from happening?
>>>>>>
>>>>>> We don't have this problem with the Java implementation.
>>>>>> `BaseDVFileWriter` merges the  previous DV with the new delta DV. So 
>>>>>> there
>>>>>> is no `undelete` behavior. I am not aware of any Java API to allow
>>>>>> "undelete". So we probably don't need to add any validation code in the
>>>>>> Java impl.
>>>>>>
>>>>>> Just thought it is good to spell it out in the spec so that
>>>>>> clients/engines can be clear about the expected behavior.
>>>>>>
>>>>>> On Fri, Nov 21, 2025 at 12:18 PM Péter Váry <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Are we specifically stating somewhere that all row-ids should be
>>>>>>> higher than or equal to the snapshot's `first-row-id`?
>>>>>>> In my mental model the `first-row-id` is only applicable for rows
>>>>>>> that don't have a specific row-id assigned.
>>>>>>>
>>>>>>> Noneless, I agree that the `row-id` and the
>>>>>>> `last-updated-seq-num` should have changed to a new one, so we can say 
>>>>>>> that
>>>>>>> undeleting a row is not allowed because of this.
>>>>>>>
>>>>>>> Can we create a validator to prevent this from happening?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2025. nov. 21.,
>>>>>>> P, 21:11):
>>>>>>>
>>>>>>>> The undeleted row would have invalid `row-id` and
>>>>>>>> `last-updated-seq-num`. Since it is a new row (added back), it should 
>>>>>>>> have
>>>>>>>> the `row-id` higher than or equal to the snapshot's `first-row-id` and 
>>>>>>>> the
>>>>>>>> `last-updated-seq-number` should inherit/have the new snapshot's 
>>>>>>>> sequence
>>>>>>>> number.
>>>>>>>>
>>>>>>>> On Fri, Nov 21, 2025 at 11:48 AM Steven Wu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Should we clarify the V3 spec to explicitly formid "*undelete*"
>>>>>>>>> of a row by unsetting the DV bit? Unsetting a DV bit essentially adds 
>>>>>>>>> a row
>>>>>>>>> with lower row-id than the snapshot's first-row-id, which would 
>>>>>>>>> violate the
>>>>>>>>> row lineage spec. With the restriction, DV cardinality should be
>>>>>>>>> monotonically increasing.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Steven
>>>>>>>>>
>>>>>>>>

Re: [DISCUSS] V3 spec: add monotonic requirement to data DV

Reply via email to