During the sync, we were mostly aligned that the row lineage semantics for
updates depends on how the writer engine interprets/implements (e.g. Flink
with equality deletes).

Now, if we make it required for V3 tables, what if users don't need the row
lineage feature. There is a bit overhead (although low) for row lineage.
E.g., extra metadata columns in data files during rewrite/compaction.

On Thu, Mar 20, 2025 at 8:53 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I think I'm in favor of this but I would like some way of knowing whether
> or not a snapshot was produced while preserving row_ids or not. Just so we
> can make it clear on read what the row-lineage behavior of the writer was
> without knowing what system wrote the data.
>
> On Thu, Mar 20, 2025 at 10:43 AM Ryan Blue <rdb...@gmail.com> wrote:
>
>> +1 for the PR and always having the lineage metadata.
>>
>> I think that is going to make the feature much more reliable. We don't
>> gain anything from allowing the feature to be turned off for compatibility,
>> when we have reasonable ways to interpret data written by any engine.
>>
>> Ryan
>>
>> On Wed, Mar 19, 2025 at 12:37 PM Daniel Weeks <dwe...@apache.org> wrote:
>>
>>> Hey everyone,
>>>
>>> When Row lineage was originally introduced, it was believed to be
>>> incompatible with equality deletes and we initially added lineage as a
>>> feature that could be turned on.  Now that these features can co-exist
>>> <https://lists.apache.org/thread/vhl7p72433m904y115hmh7vnnbjdz4xn>, we
>>> would like to require lineage for v3 as there are benefits for feature
>>> enablement and adoption.
>>>
>>> We discussed this topic
>>> <https://www.youtube.com/watch?v=9BBZKTfcU0s&t=1475s> in the community
>>> sync, but I would like to raise this here on the dev list to have a broader
>>> discussion around the spec changes proposed in this PR
>>> <https://github.com/apache/iceberg/pull/12580>.
>>>
>>> The main points of discussion were around what implications this
>>> requirement has for writers, especially where tracking changes may be
>>> difficult/expensive.  The proposal is to address this in the spec by
>>> clarifying the semantics of treating changes as upserts vs. delete/add is
>>> an engine implementation decision. The update to the spec would state that
>>> engines *should* track row ids through modification, but depending on
>>> the engine or the viability of tracking these changes, an engine may choose
>>> to model updates as deletes/adds.
>>>
>>> I'd love to get everyones thoughts/feedback on this proposal since we
>>> still have the opportunity to change this for v3,
>>> -Dan
>>>
>>

Reply via email to