Now, if we make it required for V3 tables, what if users don’t need the row lineage feature. There is a bit overhead (although low) for row lineage. E.g., extra metadata columns in data files during rewrite/compaction.
The majority of the work for row lineage is done behind the Iceberg API. The only overhead users would see is two additional columns in the data, for row ID and last updated sequence number. Those columns compress quite well and if an engine chooses not to preserve row IDs, then they can even be omitted. So I don’t think that the overhead is something to worry about. I think the bigger problem for users is what to do when they don’t have lineage information but need it. They have to figure out how to turn on row lineage and then it only helps for future issues. If they want to debug a MERGE statement that did something strange, it won’t help until the next time there’s a problem. Similarly, when a downstream consumer wants to consume the table with a streaming engine, they need to get the table owner to turn on the lineage feature. To me, these headaches seem like a much worse user experience than making lineage always-on. The drawback of always-on lineage is the minor storage overhead and the fact that some engines don’t support it and you get lineage that looks like inserts and deletes. But lineage that looks like inserts and deletes is all you get if lineage is off by default, so I strongly prefer allowing engines that do support it produce the data correctly so that users can take advantage of it. I would like some way of knowing whether or not a snapshot was produced while preserving row_ids or not. Just so we can make it clear on read what the row-lineage behavior of the writer was without knowing what system wrote the data. I think that always-on lineage means engines should always preserve row_ids. As with any feature, we know what will happen when support is not implemented — operations look like deletes and inserts. But I think that we will have far more adherence to the spec if we make it a real requirement. Otherwise, the feature is optional. Engines may choose not to implement it, which is a big problem. If an engine doesn’t implement row ID preservation (which, by the way, is not hard!) do we really expect that engine to not allow writes to tables with row lineage enabled? I think that we would have engines not implementing the feature, but writing to row lineage tables anyway because the result is just degraded lineage information. So it’s better just to require it all the time to make sure engines do implement the proper row ID handling and so that the lineage metadata is reliable. I think this is similar to the decision we made with v2 deletes. We made it possible to have more than one position delete file to have easier requirements for writers and assumed that tables would be maintained. That backfired and we ended up needing to fix it in v3. I think we should do the same thing here. Because row lineage is useful and is a reasonable requirement for writeers, we should make sure it is always on. Ryan On Thu, Mar 20, 2025 at 9:03 AM Steven Wu <stevenz...@gmail.com> wrote: > During the sync, we were mostly aligned that the row lineage semantics for > updates depends on how the writer engine interprets/implements (e.g. Flink > with equality deletes). > > Now, if we make it required for V3 tables, what if users don't need the > row lineage feature. There is a bit overhead (although low) for row > lineage. E.g., extra metadata columns in data files during > rewrite/compaction. > > On Thu, Mar 20, 2025 at 8:53 AM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> I think I'm in favor of this but I would like some way of knowing whether >> or not a snapshot was produced while preserving row_ids or not. Just so we >> can make it clear on read what the row-lineage behavior of the writer was >> without knowing what system wrote the data. >> >> On Thu, Mar 20, 2025 at 10:43 AM Ryan Blue <rdb...@gmail.com> wrote: >> >>> +1 for the PR and always having the lineage metadata. >>> >>> I think that is going to make the feature much more reliable. We don't >>> gain anything from allowing the feature to be turned off for compatibility, >>> when we have reasonable ways to interpret data written by any engine. >>> >>> Ryan >>> >>> On Wed, Mar 19, 2025 at 12:37 PM Daniel Weeks <dwe...@apache.org> wrote: >>> >>>> Hey everyone, >>>> >>>> When Row lineage was originally introduced, it was believed to be >>>> incompatible with equality deletes and we initially added lineage as a >>>> feature that could be turned on. Now that these features can co-exist >>>> <https://lists.apache.org/thread/vhl7p72433m904y115hmh7vnnbjdz4xn>, we >>>> would like to require lineage for v3 as there are benefits for feature >>>> enablement and adoption. >>>> >>>> We discussed this topic >>>> <https://www.youtube.com/watch?v=9BBZKTfcU0s&t=1475s> in the community >>>> sync, but I would like to raise this here on the dev list to have a broader >>>> discussion around the spec changes proposed in this PR >>>> <https://github.com/apache/iceberg/pull/12580>. >>>> >>>> The main points of discussion were around what implications this >>>> requirement has for writers, especially where tracking changes may be >>>> difficult/expensive. The proposal is to address this in the spec by >>>> clarifying the semantics of treating changes as upserts vs. delete/add is >>>> an engine implementation decision. The update to the spec would state that >>>> engines *should* track row ids through modification, but depending on >>>> the engine or the viability of tracking these changes, an engine may choose >>>> to model updates as deletes/adds. >>>> >>>> I'd love to get everyones thoughts/feedback on this proposal since we >>>> still have the opportunity to change this for v3, >>>> -Dan >>>> >>>