Re: [DISCUSS] Row lineage required for v3

2025-04-05 Thread Steven Wu
During the sync, we were mostly aligned that the row lineage semantics for updates depends on how the writer engine interprets/implements (e.g. Flink with equality deletes). Now, if we make it required for V3 tables, what if users don't need the row lineage feature. There is a bit overhead (althou

Re: [DISCUSS] Row lineage required for v3

2025-03-25 Thread Ryan Blue
Okay, it sounds like we have consensus that it's a good idea to make row lineage required in v3 and that it's a good idea to signal to engines when they can write delete-and-insert changes. I think we need a bit more discussion on how to signal to engines, but in the meantime we can move forward wi

Re: [DISCUSS] Row lineage required for v3

2025-03-24 Thread Péter Váry
> Would this property cause streaming writes using equality deletes to fail until the table is updated? I’m open to this solution since I think people should definitely be aware of the trade-offs they’re making in their tables. I don't think we can do such a check on the Iceberg side. As discussed

Re: [DISCUSS] Row lineage required for v3

2025-03-21 Thread Amogh Jahagirdar
I support enabling row lineage by default primarily because of the ecosystem benefit that enables engines to rely on lineage without requiring users to opt in explicitly. This should generally apply to most engines and integrations. However, as we know there are specific cases in the ecosystem—su

Re: [DISCUSS] Row lineage required for v3

2025-03-21 Thread Ryan Blue
For streaming applications (Flink, Kafka Connect) this is a non-trivial task. The engine either has to keep everything in memory or do continuous lookups. Sorry, I should be more clear. I was referring to the work of propagating row ID and last updated sequence number, *assuming that the engine is

Re: [DISCUSS] Row lineage required for v3

2025-03-20 Thread Péter Váry
I agree with most what Ryan said with a single important exception: > If an engine doesn’t implement row ID preservation (which, by the way, is not hard!) [..] For streaming applications (Flink, Kafka Connect) this is a non-trivial task. The engine either has to keep everything in memory or do co

Re: [DISCUSS] Row lineage required for v3

2025-03-20 Thread Eduard Tudenhöfner
I'm convinced that always having lineage metadata is the right call, so I'm +1 here. On Thu, Mar 20, 2025 at 11:15 PM Ryan Blue wrote: > Now, if we make it required for V3 tables, what if users don’t need the > row lineage feature. There is a bit overhead (although low) for row > lineage. E.g.,

Re: [DISCUSS] Row lineage required for v3

2025-03-20 Thread Ryan Blue
Now, if we make it required for V3 tables, what if users don’t need the row lineage feature. There is a bit overhead (although low) for row lineage. E.g., extra metadata columns in data files during rewrite/compaction. The majority of the work for row lineage is done behind the Iceberg API. The on

Re: [DISCUSS] Row lineage required for v3

2025-03-20 Thread Ryan Blue
+1 for the PR and always having the lineage metadata. I think that is going to make the feature much more reliable. We don't gain anything from allowing the feature to be turned off for compatibility, when we have reasonable ways to interpret data written by any engine. Ryan On Wed, Mar 19, 2025

Re: [DISCUSS] Row lineage required for v3

2025-03-20 Thread Russell Spitzer
I think I'm in favor of this but I would like some way of knowing whether or not a snapshot was produced while preserving row_ids or not. Just so we can make it clear on read what the row-lineage behavior of the writer was without knowing what system wrote the data. On Thu, Mar 20, 2025 at 10:43 A

[DISCUSS] Row lineage required for v3

2025-03-19 Thread Daniel Weeks
Hey everyone, When Row lineage was originally introduced, it was believed to be incompatible with equality deletes and we initially added lineage as a feature that could be turned on. Now that these features can co-exist , we would