I'm convinced that always having lineage metadata is the right call, so I'm +1 here.
On Thu, Mar 20, 2025 at 11:15 PM Ryan Blue <rdb...@gmail.com> wrote: > Now, if we make it required for V3 tables, what if users don’t need the > row lineage feature. There is a bit overhead (although low) for row > lineage. E.g., extra metadata columns in data files during > rewrite/compaction. > > The majority of the work for row lineage is done behind the Iceberg API. > The only overhead users would see is two additional columns in the data, > for row ID and last updated sequence number. Those columns compress quite > well and if an engine chooses not to preserve row IDs, then they can even > be omitted. So I don’t think that the overhead is something to worry about. > > I think the bigger problem for users is what to do when they don’t have > lineage information but need it. They have to figure out how to turn on row > lineage and then it only helps for future issues. If they want to debug a > MERGE statement that did something strange, it won’t help until the next > time there’s a problem. Similarly, when a downstream consumer wants to > consume the table with a streaming engine, they need to get the table owner > to turn on the lineage feature. To me, these headaches seem like a much > worse user experience than making lineage always-on. > > The drawback of always-on lineage is the minor storage overhead and the > fact that some engines don’t support it and you get lineage that looks like > inserts and deletes. But lineage that looks like inserts and deletes is all > you get if lineage is off by default, so I strongly prefer allowing engines > that do support it produce the data correctly so that users can take > advantage of it. > > I would like some way of knowing whether or not a snapshot was produced > while preserving row_ids or not. Just so we can make it clear on read what > the row-lineage behavior of the writer was without knowing what system > wrote the data. > > I think that always-on lineage means engines should always preserve > row_ids. As with any feature, we know what will happen when support is not > implemented — operations look like deletes and inserts. But I think that we > will have far more adherence to the spec if we make it a real requirement. > > Otherwise, the feature is optional. Engines may choose not to implement > it, which is a big problem. If an engine doesn’t implement row ID > preservation (which, by the way, is not hard!) do we really expect that > engine to not allow writes to tables with row lineage enabled? I think that > we would have engines not implementing the feature, but writing to row > lineage tables anyway because the result is just degraded lineage > information. So it’s better just to require it all the time to make sure > engines do implement the proper row ID handling and so that the lineage > metadata is reliable. > > I think this is similar to the decision we made with v2 deletes. We made > it possible to have more than one position delete file to have easier > requirements for writers and assumed that tables would be maintained. That > backfired and we ended up needing to fix it in v3. I think we should do the > same thing here. Because row lineage is useful and is a reasonable > requirement for writeers, we should make sure it is always on. > > Ryan > > On Thu, Mar 20, 2025 at 9:03 AM Steven Wu <stevenz...@gmail.com> wrote: > >> During the sync, we were mostly aligned that the row lineage semantics >> for updates depends on how the writer engine interprets/implements (e.g. >> Flink with equality deletes). >> >> Now, if we make it required for V3 tables, what if users don't need the >> row lineage feature. There is a bit overhead (although low) for row >> lineage. E.g., extra metadata columns in data files during >> rewrite/compaction. >> >> On Thu, Mar 20, 2025 at 8:53 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> I think I'm in favor of this but I would like some way of knowing >>> whether or not a snapshot was produced while preserving row_ids or not. >>> Just so we can make it clear on read what the row-lineage behavior of the >>> writer was without knowing what system wrote the data. >>> >>> On Thu, Mar 20, 2025 at 10:43 AM Ryan Blue <rdb...@gmail.com> wrote: >>> >>>> +1 for the PR and always having the lineage metadata. >>>> >>>> I think that is going to make the feature much more reliable. We don't >>>> gain anything from allowing the feature to be turned off for compatibility, >>>> when we have reasonable ways to interpret data written by any engine. >>>> >>>> Ryan >>>> >>>> On Wed, Mar 19, 2025 at 12:37 PM Daniel Weeks <dwe...@apache.org> >>>> wrote: >>>> >>>>> Hey everyone, >>>>> >>>>> When Row lineage was originally introduced, it was believed to be >>>>> incompatible with equality deletes and we initially added lineage as a >>>>> feature that could be turned on. Now that these features can co-exist >>>>> <https://lists.apache.org/thread/vhl7p72433m904y115hmh7vnnbjdz4xn>, >>>>> we would like to require lineage for v3 as there are benefits for feature >>>>> enablement and adoption. >>>>> >>>>> We discussed this topic >>>>> <https://www.youtube.com/watch?v=9BBZKTfcU0s&t=1475s> in the >>>>> community sync, but I would like to raise this here on the dev list to >>>>> have >>>>> a broader discussion around the spec changes proposed in this PR >>>>> <https://github.com/apache/iceberg/pull/12580>. >>>>> >>>>> The main points of discussion were around what implications this >>>>> requirement has for writers, especially where tracking changes may be >>>>> difficult/expensive. The proposal is to address this in the spec by >>>>> clarifying the semantics of treating changes as upserts vs. delete/add is >>>>> an engine implementation decision. The update to the spec would state that >>>>> engines *should* track row ids through modification, but depending on >>>>> the engine or the viability of tracking these changes, an engine may >>>>> choose >>>>> to model updates as deletes/adds. >>>>> >>>>> I'd love to get everyones thoughts/feedback on this proposal since we >>>>> still have the opportunity to change this for v3, >>>>> -Dan >>>>> >>>>