Re: [DISCUSS] Row lineage required for v3

Eduard Tudenhöfner Thu, 20 Mar 2025 23:16:58 -0700

I'm convinced that always having lineage metadata is the right call, so
I'm +1 here.


On Thu, Mar 20, 2025 at 11:15 PM Ryan Blue <rdb...@gmail.com> wrote:

> Now, if we make it required for V3 tables, what if users don’t need the
> row lineage feature. There is a bit overhead (although low) for row
> lineage. E.g., extra metadata columns in data files during
> rewrite/compaction.
>
> The majority of the work for row lineage is done behind the Iceberg API.
> The only overhead users would see is two additional columns in the data,
> for row ID and last updated sequence number. Those columns compress quite
> well and if an engine chooses not to preserve row IDs, then they can even
> be omitted. So I don’t think that the overhead is something to worry about.
>
> I think the bigger problem for users is what to do when they don’t have
> lineage information but need it. They have to figure out how to turn on row
> lineage and then it only helps for future issues. If they want to debug a
> MERGE statement that did something strange, it won’t help until the next
> time there’s a problem. Similarly, when a downstream consumer wants to
> consume the table with a streaming engine, they need to get the table owner
> to turn on the lineage feature. To me, these headaches seem like a much
> worse user experience than making lineage always-on.
>
> The drawback of always-on lineage is the minor storage overhead and the
> fact that some engines don’t support it and you get lineage that looks like
> inserts and deletes. But lineage that looks like inserts and deletes is all
> you get if lineage is off by default, so I strongly prefer allowing engines
> that do support it produce the data correctly so that users can take
> advantage of it.
>
> I would like some way of knowing whether or not a snapshot was produced
> while preserving row_ids or not. Just so we can make it clear on read what
> the row-lineage behavior of the writer was without knowing what system
> wrote the data.
>
> I think that always-on lineage means engines should always preserve
> row_ids. As with any feature, we know what will happen when support is not
> implemented — operations look like deletes and inserts. But I think that we
> will have far more adherence to the spec if we make it a real requirement.
>
> Otherwise, the feature is optional. Engines may choose not to implement
> it, which is a big problem. If an engine doesn’t implement row ID
> preservation (which, by the way, is not hard!) do we really expect that
> engine to not allow writes to tables with row lineage enabled? I think that
> we would have engines not implementing the feature, but writing to row
> lineage tables anyway because the result is just degraded lineage
> information. So it’s better just to require it all the time to make sure
> engines do implement the proper row ID handling and so that the lineage
> metadata is reliable.
>
> I think this is similar to the decision we made with v2 deletes. We made
> it possible to have more than one position delete file to have easier
> requirements for writers and assumed that tables would be maintained. That
> backfired and we ended up needing to fix it in v3. I think we should do the
> same thing here. Because row lineage is useful and is a reasonable
> requirement for writeers, we should make sure it is always on.
>
> Ryan
>
> On Thu, Mar 20, 2025 at 9:03 AM Steven Wu <stevenz...@gmail.com> wrote:
>
>> During the sync, we were mostly aligned that the row lineage semantics
>> for updates depends on how the writer engine interprets/implements (e.g.
>> Flink with equality deletes).
>>
>> Now, if we make it required for V3 tables, what if users don't need the
>> row lineage feature. There is a bit overhead (although low) for row
>> lineage. E.g., extra metadata columns in data files during
>> rewrite/compaction.
>>
>> On Thu, Mar 20, 2025 at 8:53 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I think I'm in favor of this but I would like some way of knowing
>>> whether or not a snapshot was produced while preserving row_ids or not.
>>> Just so we can make it clear on read what the row-lineage behavior of the
>>> writer was without knowing what system wrote the data.
>>>
>>> On Thu, Mar 20, 2025 at 10:43 AM Ryan Blue <rdb...@gmail.com> wrote:
>>>
>>>> +1 for the PR and always having the lineage metadata.
>>>>
>>>> I think that is going to make the feature much more reliable. We don't
>>>> gain anything from allowing the feature to be turned off for compatibility,
>>>> when we have reasonable ways to interpret data written by any engine.
>>>>
>>>> Ryan
>>>>
>>>> On Wed, Mar 19, 2025 at 12:37 PM Daniel Weeks <dwe...@apache.org>
>>>> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>> When Row lineage was originally introduced, it was believed to be
>>>>> incompatible with equality deletes and we initially added lineage as a
>>>>> feature that could be turned on.  Now that these features can co-exist
>>>>> <https://lists.apache.org/thread/vhl7p72433m904y115hmh7vnnbjdz4xn>,
>>>>> we would like to require lineage for v3 as there are benefits for feature
>>>>> enablement and adoption.
>>>>>
>>>>> We discussed this topic
>>>>> <https://www.youtube.com/watch?v=9BBZKTfcU0s&t=1475s> in the
>>>>> community sync, but I would like to raise this here on the dev list to 
>>>>> have
>>>>> a broader discussion around the spec changes proposed in this PR
>>>>> <https://github.com/apache/iceberg/pull/12580>.
>>>>>
>>>>> The main points of discussion were around what implications this
>>>>> requirement has for writers, especially where tracking changes may be
>>>>> difficult/expensive.  The proposal is to address this in the spec by
>>>>> clarifying the semantics of treating changes as upserts vs. delete/add is
>>>>> an engine implementation decision. The update to the spec would state that
>>>>> engines *should* track row ids through modification, but depending on
>>>>> the engine or the viability of tracking these changes, an engine may 
>>>>> choose
>>>>> to model updates as deletes/adds.
>>>>>
>>>>> I'd love to get everyones thoughts/feedback on this proposal since we
>>>>> still have the opportunity to change this for v3,
>>>>> -Dan
>>>>>
>>>>

Re: [DISCUSS] Row lineage required for v3

Reply via email to