Thanks Russel. Not a question on the proposal itself, I find it a bit hard to follow and maintain all the three specs in one place. We are also publishing a unfinalized spec to the website. Would it be better to maintain the spec in a "copy-on-write" style, i.e. each spec having its own format file?
Sorry to go off topic, I can start a separate thread if you think this concern is valid. On Sat, Sep 14, 2024 at 6:33 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > Pull Request Available, please focus any remaining comments there and we > can wrap this one up > > https://github.com/apache/iceberg/pull/11130 > > On Thu, Aug 29, 2024 at 11:20 AM rdb...@gmail.com <rdb...@gmail.com> > wrote: > >> +1 for making row lineage and equality deletes mutually exclusive. >> >> The idea behind equality deletes is to avoid needing to read existing >> data in order to delete records. That doesn't fit with row lineage because >> the purpose of lineage is to be able to identify when a row changes by >> maintaining an identifier that would have to be read. >> >> On Wed, Aug 28, 2024 at 4:16 PM Anton Okolnychyi <aokolnyc...@gmail.com> >> wrote: >> >>> I went through the proposal and left comments as well. Thanks for >>> working on it, Russell! >>> >>> I don't see a good solution to how row lineage can work with equality >>> deletes. If so, I would be in favor of not allowing equality deletes at all >>> if row lineage is enabled as opposed to treating all added data records as >>> new. I will spend more time thinking if we can make it work. >>> >>> - Anton >>> >>> ср, 28 серп. 2024 р. о 12:41 Ryan Blue <b...@databricks.com.invalid> >>> пише: >>> >>>> Sounds good to me. Thanks for pushing this forward, Russell! >>>> >>>> On Tue, Aug 27, 2024 at 7:17 PM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> I think folks have had a lot of good comments and since there haven't >>>>> been a lot of strong opinions I'm going to try to take what I think are >>>>> the >>>>> least interesting options and move them into the "discarded section". >>>>> Please continue to comment and let's please make sure any things that folk >>>>> think are blockers for a Spec PR are eliminated. If we have general >>>>> consensus at a high level I think we can move to discussing the actual >>>>> spec >>>>> changes on a spec change PR. >>>>> >>>>> I'm going to be keeping the proposals for : >>>>> >>>>> Global Identifier as the Identifier >>>>> and >>>>> Last Updated Sequence number as the Version >>>>> >>>>> >>>>> >>>>> On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid> >>>>> wrote: >>>>> >>>>>> The situation in which you would use equality deletes is when you do >>>>>> not want to read the existing table data. That seems at odds with a >>>>>> feature >>>>>> like row-level tracking where you want to keep track. To me, it would be >>>>>> a >>>>>> reasonable solution to just say that equality deletes can't be used in >>>>>> tables where row-level tracking is enabled. >>>>>> >>>>>> On Mon, Aug 19, 2024 at 11:34 AM Russell Spitzer < >>>>>> russell.spit...@gmail.com> wrote: >>>>>> >>>>>>> As far as I know Flink is actually the only engine we have at the >>>>>>> moment that can produce Equally deletes and only Equality deletes have >>>>>>> this >>>>>>> specific problem. Since an equality delete can be written without >>>>>>> actually >>>>>>> knowing whether rows are being updated or not, it is always ambiguous >>>>>>> as to >>>>>>> whether a new row is an updated row, a newly added row, or a row which >>>>>>> was >>>>>>> deleted but then a newly added row was also appended. >>>>>>> >>>>>>> I think in this case we need to ignore row_versioning and just give >>>>>>> every new row a brand new identifier. For a reader this means all >>>>>>> updates >>>>>>> look like a "delete" and "add" and no "updates". For other processes >>>>>>> (COW >>>>>>> and Position Deletes) we only mark records as being deleted or updated >>>>>>> after finding them first, this makes it easy to take the lineage >>>>>>> identifier >>>>>>> from the source record and change it. For Spark, we just kept working on >>>>>>> engine improvements (like SPJ, Dynamic partition pushdown) to try to >>>>>>> make >>>>>>> that scan and join faster but we probably still require a bit slower >>>>>>> latency. >>>>>>> >>>>>>> I think we could theoretically resolve equality deletes into updates >>>>>>> at compaction time again but only if the user first defines accurate >>>>>>> "row >>>>>>> identity" columns because otherwise we have no way of determining >>>>>>> whether >>>>>>> rows were updated or not. This is basically the issue we have now in the >>>>>>> CDC procedures. Ideally, I think we need to find a way to have flink >>>>>>> locate >>>>>>> updated rows at runtime using some better indexing structure or >>>>>>> something >>>>>>> like that as you suggested. >>>>>>> >>>>>>> On Sat, Aug 17, 2024 at 1:07 AM Péter Váry < >>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Russell, >>>>>>>> >>>>>>>> As discussed offline, this would be very hard to implement with the >>>>>>>> current Flink CDC write strategies. I think this is true for every >>>>>>>> streaming writers. >>>>>>>> >>>>>>>> For tracking the previous version of the row, the streaming writer >>>>>>>> would need to scan the table. It needs to be done for every record to >>>>>>>> find >>>>>>>> the previous version. This could be possible if the data would be >>>>>>>> stored in >>>>>>>> a way which supports fast queries on the primary key, like LSM Tree >>>>>>>> (see: >>>>>>>> Paimon [1]), otherwise it would be prohibitively costly, and >>>>>>>> unfeasible for >>>>>>>> higher loads. So adding a new storage strategy could be one solution. >>>>>>>> >>>>>>>> Alternatively we might find a way for the compaction to update the >>>>>>>> lineage fields. We could provide a way to link the equality deletes to >>>>>>>> the >>>>>>>> new rows which updated them during write, then on compaction we could >>>>>>>> update the lineage fields based on this info. >>>>>>>> >>>>>>>> Is there any better ideas with Spark streaming which we can adopt? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Peter >>>>>>>> >>>>>>>> [1] - https://paimon.apache.org/docs/0.8/ >>>>>>>> >>>>>>>> On Sat, Aug 17, 2024, 01:06 Russell Spitzer < >>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Y'all, >>>>>>>>> >>>>>>>>> We've been working on a new proposal to add Row Lineage to Iceberg >>>>>>>>> in the V3 Spec. The general idea is to give every row a unique >>>>>>>>> identifier >>>>>>>>> as well as a marker of what version of the row it is. This should let >>>>>>>>> us >>>>>>>>> build a variety of features related to CDC, Incremental Processing and >>>>>>>>> Audit Logging. If you are interested please check out the linked >>>>>>>>> proposal >>>>>>>>> below. This will require compliance from all engines to be really >>>>>>>>> useful so >>>>>>>>> It's important we come to consensus on whether or not this is >>>>>>>>> possible. >>>>>>>>> >>>>>>>>> >>>>>>>>> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing >>>>>>>>> >>>>>>>>> >>>>>>>>> Thank you for your consideration, >>>>>>>>> Russ >>>>>>>>> >>>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Databricks >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Databricks >>>> >>>