Re: [DISCUSS] Row lineage required for v3

Ryan Blue Thu, 20 Mar 2025 17:44:09 -0700

Now, if we make it required for V3 tables, what if users don’t need the row
lineage feature. There is a bit overhead (although low) for row lineage.
E.g., extra metadata columns in data files during rewrite/compaction.

The majority of the work for row lineage is done behind the Iceberg API.
The only overhead users would see is two additional columns in the data,
for row ID and last updated sequence number. Those columns compress quite
well and if an engine chooses not to preserve row IDs, then they can even
be omitted. So I don’t think that the overhead is something to worry about.

I think the bigger problem for users is what to do when they don’t have
lineage information but need it. They have to figure out how to turn on row
lineage and then it only helps for future issues. If they want to debug a
MERGE statement that did something strange, it won’t help until the next
time there’s a problem. Similarly, when a downstream consumer wants to
consume the table with a streaming engine, they need to get the table owner
to turn on the lineage feature. To me, these headaches seem like a much
worse user experience than making lineage always-on.

The drawback of always-on lineage is the minor storage overhead and the
fact that some engines don’t support it and you get lineage that looks like
inserts and deletes. But lineage that looks like inserts and deletes is all
you get if lineage is off by default, so I strongly prefer allowing engines
that do support it produce the data correctly so that users can take
advantage of it.

I would like some way of knowing whether or not a snapshot was produced
while preserving row_ids or not. Just so we can make it clear on read what
the row-lineage behavior of the writer was without knowing what system
wrote the data.

I think that always-on lineage means engines should always preserve
row_ids. As with any feature, we know what will happen when support is not
implemented — operations look like deletes and inserts. But I think that we
will have far more adherence to the spec if we make it a real requirement.

Otherwise, the feature is optional. Engines may choose not to implement it,
which is a big problem. If an engine doesn’t implement row ID preservation
(which, by the way, is not hard!) do we really expect that engine to not
allow writes to tables with row lineage enabled? I think that we would have
engines not implementing the feature, but writing to row lineage tables
anyway because the result is just degraded lineage information. So it’s
better just to require it all the time to make sure engines do implement
the proper row ID handling and so that the lineage metadata is reliable.

I think this is similar to the decision we made with v2 deletes. We made it
possible to have more than one position delete file to have easier
requirements for writers and assumed that tables would be maintained. That
backfired and we ended up needing to fix it in v3. I think we should do the
same thing here. Because row lineage is useful and is a reasonable
requirement for writeers, we should make sure it is always on.

Ryan

On Thu, Mar 20, 2025 at 9:03 AM Steven Wu <stevenz...@gmail.com> wrote:

> During the sync, we were mostly aligned that the row lineage semantics for
> updates depends on how the writer engine interprets/implements (e.g. Flink
> with equality deletes).
>
> Now, if we make it required for V3 tables, what if users don't need the
> row lineage feature. There is a bit overhead (although low) for row
> lineage. E.g., extra metadata columns in data files during
> rewrite/compaction.
>
> On Thu, Mar 20, 2025 at 8:53 AM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> I think I'm in favor of this but I would like some way of knowing whether
>> or not a snapshot was produced while preserving row_ids or not. Just so we
>> can make it clear on read what the row-lineage behavior of the writer was
>> without knowing what system wrote the data.
>>
>> On Thu, Mar 20, 2025 at 10:43 AM Ryan Blue <rdb...@gmail.com> wrote:
>>
>>> +1 for the PR and always having the lineage metadata.
>>>
>>> I think that is going to make the feature much more reliable. We don't
>>> gain anything from allowing the feature to be turned off for compatibility,
>>> when we have reasonable ways to interpret data written by any engine.
>>>
>>> Ryan
>>>
>>> On Wed, Mar 19, 2025 at 12:37 PM Daniel Weeks <dwe...@apache.org> wrote:
>>>
>>>> Hey everyone,
>>>>
>>>> When Row lineage was originally introduced, it was believed to be
>>>> incompatible with equality deletes and we initially added lineage as a
>>>> feature that could be turned on.  Now that these features can co-exist
>>>> <https://lists.apache.org/thread/vhl7p72433m904y115hmh7vnnbjdz4xn>, we
>>>> would like to require lineage for v3 as there are benefits for feature
>>>> enablement and adoption.
>>>>
>>>> We discussed this topic
>>>> <https://www.youtube.com/watch?v=9BBZKTfcU0s&t=1475s> in the community
>>>> sync, but I would like to raise this here on the dev list to have a broader
>>>> discussion around the spec changes proposed in this PR
>>>> <https://github.com/apache/iceberg/pull/12580>.
>>>>
>>>> The main points of discussion were around what implications this
>>>> requirement has for writers, especially where tracking changes may be
>>>> difficult/expensive.  The proposal is to address this in the spec by
>>>> clarifying the semantics of treating changes as upserts vs. delete/add is
>>>> an engine implementation decision. The update to the spec would state that
>>>> engines *should* track row ids through modification, but depending on
>>>> the engine or the viability of tracking these changes, an engine may choose
>>>> to model updates as deletes/adds.
>>>>
>>>> I'd love to get everyones thoughts/feedback on this proposal since we
>>>> still have the opportunity to change this for v3,
>>>> -Dan
>>>>
>>>

Re: [DISCUSS] Row lineage required for v3

Reply via email to