Pull Request Available, please focus any remaining comments there and we can wrap this one up
https://github.com/apache/iceberg/pull/11130 On Thu, Aug 29, 2024 at 11:20 AM rdb...@gmail.com <rdb...@gmail.com> wrote: > +1 for making row lineage and equality deletes mutually exclusive. > > The idea behind equality deletes is to avoid needing to read existing data > in order to delete records. That doesn't fit with row lineage because the > purpose of lineage is to be able to identify when a row changes by > maintaining an identifier that would have to be read. > > On Wed, Aug 28, 2024 at 4:16 PM Anton Okolnychyi <aokolnyc...@gmail.com> > wrote: > >> I went through the proposal and left comments as well. Thanks for working >> on it, Russell! >> >> I don't see a good solution to how row lineage can work with equality >> deletes. If so, I would be in favor of not allowing equality deletes at all >> if row lineage is enabled as opposed to treating all added data records as >> new. I will spend more time thinking if we can make it work. >> >> - Anton >> >> ср, 28 серп. 2024 р. о 12:41 Ryan Blue <b...@databricks.com.invalid> >> пише: >> >>> Sounds good to me. Thanks for pushing this forward, Russell! >>> >>> On Tue, Aug 27, 2024 at 7:17 PM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> I think folks have had a lot of good comments and since there haven't >>>> been a lot of strong opinions I'm going to try to take what I think are the >>>> least interesting options and move them into the "discarded section". >>>> Please continue to comment and let's please make sure any things that folk >>>> think are blockers for a Spec PR are eliminated. If we have general >>>> consensus at a high level I think we can move to discussing the actual spec >>>> changes on a spec change PR. >>>> >>>> I'm going to be keeping the proposals for : >>>> >>>> Global Identifier as the Identifier >>>> and >>>> Last Updated Sequence number as the Version >>>> >>>> >>>> >>>> On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid> >>>> wrote: >>>> >>>>> The situation in which you would use equality deletes is when you do >>>>> not want to read the existing table data. That seems at odds with a >>>>> feature >>>>> like row-level tracking where you want to keep track. To me, it would be a >>>>> reasonable solution to just say that equality deletes can't be used in >>>>> tables where row-level tracking is enabled. >>>>> >>>>> On Mon, Aug 19, 2024 at 11:34 AM Russell Spitzer < >>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>> As far as I know Flink is actually the only engine we have at the >>>>>> moment that can produce Equally deletes and only Equality deletes have >>>>>> this >>>>>> specific problem. Since an equality delete can be written without >>>>>> actually >>>>>> knowing whether rows are being updated or not, it is always ambiguous as >>>>>> to >>>>>> whether a new row is an updated row, a newly added row, or a row which >>>>>> was >>>>>> deleted but then a newly added row was also appended. >>>>>> >>>>>> I think in this case we need to ignore row_versioning and just give >>>>>> every new row a brand new identifier. For a reader this means all updates >>>>>> look like a "delete" and "add" and no "updates". For other processes (COW >>>>>> and Position Deletes) we only mark records as being deleted or updated >>>>>> after finding them first, this makes it easy to take the lineage >>>>>> identifier >>>>>> from the source record and change it. For Spark, we just kept working on >>>>>> engine improvements (like SPJ, Dynamic partition pushdown) to try to make >>>>>> that scan and join faster but we probably still require a bit slower >>>>>> latency. >>>>>> >>>>>> I think we could theoretically resolve equality deletes into updates >>>>>> at compaction time again but only if the user first defines accurate "row >>>>>> identity" columns because otherwise we have no way of determining whether >>>>>> rows were updated or not. This is basically the issue we have now in the >>>>>> CDC procedures. Ideally, I think we need to find a way to have flink >>>>>> locate >>>>>> updated rows at runtime using some better indexing structure or something >>>>>> like that as you suggested. >>>>>> >>>>>> On Sat, Aug 17, 2024 at 1:07 AM Péter Váry < >>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>> >>>>>>> Hi Russell, >>>>>>> >>>>>>> As discussed offline, this would be very hard to implement with the >>>>>>> current Flink CDC write strategies. I think this is true for every >>>>>>> streaming writers. >>>>>>> >>>>>>> For tracking the previous version of the row, the streaming writer >>>>>>> would need to scan the table. It needs to be done for every record to >>>>>>> find >>>>>>> the previous version. This could be possible if the data would be >>>>>>> stored in >>>>>>> a way which supports fast queries on the primary key, like LSM Tree >>>>>>> (see: >>>>>>> Paimon [1]), otherwise it would be prohibitively costly, and unfeasible >>>>>>> for >>>>>>> higher loads. So adding a new storage strategy could be one solution. >>>>>>> >>>>>>> Alternatively we might find a way for the compaction to update the >>>>>>> lineage fields. We could provide a way to link the equality deletes to >>>>>>> the >>>>>>> new rows which updated them during write, then on compaction we could >>>>>>> update the lineage fields based on this info. >>>>>>> >>>>>>> Is there any better ideas with Spark streaming which we can adopt? >>>>>>> >>>>>>> Thanks, >>>>>>> Peter >>>>>>> >>>>>>> [1] - https://paimon.apache.org/docs/0.8/ >>>>>>> >>>>>>> On Sat, Aug 17, 2024, 01:06 Russell Spitzer < >>>>>>> russell.spit...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Y'all, >>>>>>>> >>>>>>>> We've been working on a new proposal to add Row Lineage to Iceberg >>>>>>>> in the V3 Spec. The general idea is to give every row a unique >>>>>>>> identifier >>>>>>>> as well as a marker of what version of the row it is. This should let >>>>>>>> us >>>>>>>> build a variety of features related to CDC, Incremental Processing and >>>>>>>> Audit Logging. If you are interested please check out the linked >>>>>>>> proposal >>>>>>>> below. This will require compliance from all engines to be really >>>>>>>> useful so >>>>>>>> It's important we come to consensus on whether or not this is possible. >>>>>>>> >>>>>>>> >>>>>>>> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing >>>>>>>> >>>>>>>> >>>>>>>> Thank you for your consideration, >>>>>>>> Russ >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Databricks >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Databricks >>> >>