Re: [DISCUSS] Row lineage required for v3

Ryan Blue Fri, 21 Mar 2025 09:43:22 -0700

For streaming applications (Flink, Kafka Connect) this is a non-trivial
task. The engine either has to keep everything in memory or do continuous
lookups.


Sorry, I should be more clear. I was referring to the work of propagating
row ID and last updated sequence number, *assuming that the engine is
already reading existing rows*. As we’ve said, equality deletes that allow
you to avoid reading the existing data are used for streaming applications
where reading the existing data is hard, as Peter notes.

I do agree that streaming is an important part of the Iceberg ecosystem.
That’s partly why it’s important to have good lineage data to support
streaming *from* tables. Streaming engines writing to tables still have the
ability to trade off by using equality deletes (which show up as delete and
insert) or by doing some work to use a positional delete and recover the
row ID.

There should be a table property which allow “updates as insert and delete”
which defaults to false.

Would this property cause streaming writes using equality deletes to fail
until the table is updated? I’m open to this solution since I think people
should definitely be aware of the trade-offs they’re making in their tables.

Ryan

On Thu, Mar 20, 2025 at 11:34 PM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> I agree with most what Ryan said with a single important exception:
>
> > If an engine doesn’t implement row ID preservation (which, by the way,
> is not hard!) [..]
>
> For streaming applications (Flink, Kafka Connect) this is a non-trivial
> task. The engine either has to keep everything in memory or do continuous
> lookups.
>
> I see streaming applications as an important part of the Iceberg ecosystem
> (might be a bit biased 😉), and I think there will be a need for a version
> of the connectors which is less resource intensive than the solutions
> mentioned above.
>
> I also think it is *important to have row lineage information and the
> user should be able to trust the data*, so I'm leaning towards:
> - RowId should always be turned on (don't want to see "if"-s everywhere in
> the code for saving a few bytes)
> - There should be a table property which allow "updates as insert and
> delete" which defaults to false.
>
> Engines which don't implement updates preserving rowIds should only write
> to tables where the property is true. This way the user could be confident
> that the row lineage information is correct when the property is false.
>
> Thanks, Peter
>
> On Thu, Mar 20, 2025, 23:15 Ryan Blue <rdb...@gmail.com> wrote:
>
>> Now, if we make it required for V3 tables, what if users don’t need the
>> row lineage feature. There is a bit overhead (although low) for row
>> lineage. E.g., extra metadata columns in data files during
>> rewrite/compaction.
>>
>> The majority of the work for row lineage is done behind the Iceberg API.
>> The only overhead users would see is two additional columns in the data,
>> for row ID and last updated sequence number. Those columns compress quite
>> well and if an engine chooses not to preserve row IDs, then they can even
>> be omitted. So I don’t think that the overhead is something to worry about.
>>
>> I think the bigger problem for users is what to do when they don’t have
>> lineage information but need it. They have to figure out how to turn on row
>> lineage and then it only helps for future issues. If they want to debug a
>> MERGE statement that did something strange, it won’t help until the next
>> time there’s a problem. Similarly, when a downstream consumer wants to
>> consume the table with a streaming engine, they need to get the table owner
>> to turn on the lineage feature. To me, these headaches seem like a much
>> worse user experience than making lineage always-on.
>>
>> The drawback of always-on lineage is the minor storage overhead and the
>> fact that some engines don’t support it and you get lineage that looks like
>> inserts and deletes. But lineage that looks like inserts and deletes is all
>> you get if lineage is off by default, so I strongly prefer allowing engines
>> that do support it produce the data correctly so that users can take
>> advantage of it.
>>
>> I would like some way of knowing whether or not a snapshot was produced
>> while preserving row_ids or not. Just so we can make it clear on read what
>> the row-lineage behavior of the writer was without knowing what system
>> wrote the data.
>>
>> I think that always-on lineage means engines should always preserve
>> row_ids. As with any feature, we know what will happen when support is not
>> implemented — operations look like deletes and inserts. But I think that we
>> will have far more adherence to the spec if we make it a real requirement.
>>
>> Otherwise, the feature is optional. Engines may choose not to implement
>> it, which is a big problem. If an engine doesn’t implement row ID
>> preservation (which, by the way, is not hard!) do we really expect that
>> engine to not allow writes to tables with row lineage enabled? I think that
>> we would have engines not implementing the feature, but writing to row
>> lineage tables anyway because the result is just degraded lineage
>> information. So it’s better just to require it all the time to make sure
>> engines do implement the proper row ID handling and so that the lineage
>> metadata is reliable.
>>
>> I think this is similar to the decision we made with v2 deletes. We made
>> it possible to have more than one position delete file to have easier
>> requirements for writers and assumed that tables would be maintained. That
>> backfired and we ended up needing to fix it in v3. I think we should do the
>> same thing here. Because row lineage is useful and is a reasonable
>> requirement for writeers, we should make sure it is always on.
>>
>> Ryan
>>
>> On Thu, Mar 20, 2025 at 9:03 AM Steven Wu <stevenz...@gmail.com> wrote:
>>
>>> During the sync, we were mostly aligned that the row lineage semantics
>>> for updates depends on how the writer engine interprets/implements (e.g.
>>> Flink with equality deletes).
>>>
>>> Now, if we make it required for V3 tables, what if users don't need the
>>> row lineage feature. There is a bit overhead (although low) for row
>>> lineage. E.g., extra metadata columns in data files during
>>> rewrite/compaction.
>>>
>>> On Thu, Mar 20, 2025 at 8:53 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> I think I'm in favor of this but I would like some way of knowing
>>>> whether or not a snapshot was produced while preserving row_ids or not.
>>>> Just so we can make it clear on read what the row-lineage behavior of the
>>>> writer was without knowing what system wrote the data.
>>>>
>>>> On Thu, Mar 20, 2025 at 10:43 AM Ryan Blue <rdb...@gmail.com> wrote:
>>>>
>>>>> +1 for the PR and always having the lineage metadata.
>>>>>
>>>>> I think that is going to make the feature much more reliable. We don't
>>>>> gain anything from allowing the feature to be turned off for 
>>>>> compatibility,
>>>>> when we have reasonable ways to interpret data written by any engine.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Wed, Mar 19, 2025 at 12:37 PM Daniel Weeks <dwe...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> When Row lineage was originally introduced, it was believed to be
>>>>>> incompatible with equality deletes and we initially added lineage as a
>>>>>> feature that could be turned on.  Now that these features can
>>>>>> co-exist
>>>>>> <https://lists.apache.org/thread/vhl7p72433m904y115hmh7vnnbjdz4xn>,
>>>>>> we would like to require lineage for v3 as there are benefits for feature
>>>>>> enablement and adoption.
>>>>>>
>>>>>> We discussed this topic
>>>>>> <https://www.youtube.com/watch?v=9BBZKTfcU0s&t=1475s> in the
>>>>>> community sync, but I would like to raise this here on the dev list to 
>>>>>> have
>>>>>> a broader discussion around the spec changes proposed in this PR
>>>>>> <https://github.com/apache/iceberg/pull/12580>.
>>>>>>
>>>>>> The main points of discussion were around what implications this
>>>>>> requirement has for writers, especially where tracking changes may be
>>>>>> difficult/expensive.  The proposal is to address this in the spec by
>>>>>> clarifying the semantics of treating changes as upserts vs. delete/add is
>>>>>> an engine implementation decision. The update to the spec would state 
>>>>>> that
>>>>>> engines *should* track row ids through modification, but depending
>>>>>> on the engine or the viability of tracking these changes, an engine may
>>>>>> choose to model updates as deletes/adds.
>>>>>>
>>>>>> I'd love to get everyones thoughts/feedback on this proposal since we
>>>>>> still have the opportunity to change this for v3,
>>>>>> -Dan
>>>>>>
>>>>>

Re: [DISCUSS] Row lineage required for v3

Reply via email to