Sounds good to me. Thanks for pushing this forward, Russell! On Tue, Aug 27, 2024 at 7:17 PM Russell Spitzer <russell.spit...@gmail.com> wrote:
> I think folks have had a lot of good comments and since there haven't been > a lot of strong opinions I'm going to try to take what I think are the > least interesting options and move them into the "discarded section". > Please continue to comment and let's please make sure any things that folk > think are blockers for a Spec PR are eliminated. If we have general > consensus at a high level I think we can move to discussing the actual spec > changes on a spec change PR. > > I'm going to be keeping the proposals for : > > Global Identifier as the Identifier > and > Last Updated Sequence number as the Version > > > > On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid> > wrote: > >> The situation in which you would use equality deletes is when you do not >> want to read the existing table data. That seems at odds with a feature >> like row-level tracking where you want to keep track. To me, it would be a >> reasonable solution to just say that equality deletes can't be used in >> tables where row-level tracking is enabled. >> >> On Mon, Aug 19, 2024 at 11:34 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> As far as I know Flink is actually the only engine we have at the moment >>> that can produce Equally deletes and only Equality deletes have this >>> specific problem. Since an equality delete can be written without actually >>> knowing whether rows are being updated or not, it is always ambiguous as to >>> whether a new row is an updated row, a newly added row, or a row which was >>> deleted but then a newly added row was also appended. >>> >>> I think in this case we need to ignore row_versioning and just give >>> every new row a brand new identifier. For a reader this means all updates >>> look like a "delete" and "add" and no "updates". For other processes (COW >>> and Position Deletes) we only mark records as being deleted or updated >>> after finding them first, this makes it easy to take the lineage identifier >>> from the source record and change it. For Spark, we just kept working on >>> engine improvements (like SPJ, Dynamic partition pushdown) to try to make >>> that scan and join faster but we probably still require a bit slower >>> latency. >>> >>> I think we could theoretically resolve equality deletes into updates at >>> compaction time again but only if the user first defines accurate "row >>> identity" columns because otherwise we have no way of determining whether >>> rows were updated or not. This is basically the issue we have now in the >>> CDC procedures. Ideally, I think we need to find a way to have flink locate >>> updated rows at runtime using some better indexing structure or something >>> like that as you suggested. >>> >>> On Sat, Aug 17, 2024 at 1:07 AM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> Hi Russell, >>>> >>>> As discussed offline, this would be very hard to implement with the >>>> current Flink CDC write strategies. I think this is true for every >>>> streaming writers. >>>> >>>> For tracking the previous version of the row, the streaming writer >>>> would need to scan the table. It needs to be done for every record to find >>>> the previous version. This could be possible if the data would be stored in >>>> a way which supports fast queries on the primary key, like LSM Tree (see: >>>> Paimon [1]), otherwise it would be prohibitively costly, and unfeasible for >>>> higher loads. So adding a new storage strategy could be one solution. >>>> >>>> Alternatively we might find a way for the compaction to update the >>>> lineage fields. We could provide a way to link the equality deletes to the >>>> new rows which updated them during write, then on compaction we could >>>> update the lineage fields based on this info. >>>> >>>> Is there any better ideas with Spark streaming which we can adopt? >>>> >>>> Thanks, >>>> Peter >>>> >>>> [1] - https://paimon.apache.org/docs/0.8/ >>>> >>>> On Sat, Aug 17, 2024, 01:06 Russell Spitzer <russell.spit...@gmail.com> >>>> wrote: >>>> >>>>> Hi Y'all, >>>>> >>>>> We've been working on a new proposal to add Row Lineage to Iceberg in >>>>> the V3 Spec. The general idea is to give every row a unique identifier as >>>>> well as a marker of what version of the row it is. This should let us >>>>> build >>>>> a variety of features related to CDC, Incremental Processing and Audit >>>>> Logging. If you are interested please check out the linked proposal below. >>>>> This will require compliance from all engines to be really useful so It's >>>>> important we come to consensus on whether or not this is possible. >>>>> >>>>> >>>>> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing >>>>> >>>>> >>>>> Thank you for your consideration, >>>>> Russ >>>>> >>>> >> >> -- >> Ryan Blue >> Databricks >> > -- Ryan Blue Databricks