Re: [DISCUSS] Row Lineage Proposal

Russell Spitzer Fri, 13 Sep 2024 15:33:31 -0700

Pull Request Available, please focus any remaining comments there and we
can wrap this one up


https://github.com/apache/iceberg/pull/11130

On Thu, Aug 29, 2024 at 11:20 AM rdb...@gmail.com <rdb...@gmail.com> wrote:

> +1 for making row lineage and equality deletes mutually exclusive.
>
> The idea behind equality deletes is to avoid needing to read existing data
> in order to delete records. That doesn't fit with row lineage because the
> purpose of lineage is to be able to identify when a row changes by
> maintaining an identifier that would have to be read.
>
> On Wed, Aug 28, 2024 at 4:16 PM Anton Okolnychyi <aokolnyc...@gmail.com>
> wrote:
>
>> I went through the proposal and left comments as well. Thanks for working
>> on it, Russell!
>>
>> I don't see a good solution to how row lineage can work with equality
>> deletes. If so, I would be in favor of not allowing equality deletes at all
>> if row lineage is enabled as opposed to treating all added data records as
>> new. I will spend more time thinking if we can make it work.
>>
>> - Anton
>>
>> ср, 28 серп. 2024 р. о 12:41 Ryan Blue <b...@databricks.com.invalid>
>> пише:
>>
>>> Sounds good to me. Thanks for pushing this forward, Russell!
>>>
>>> On Tue, Aug 27, 2024 at 7:17 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> I think folks have had a lot of good comments and since there haven't
>>>> been a lot of strong opinions I'm going to try to take what I think are the
>>>> least interesting options and move them into the "discarded section".
>>>> Please continue to comment and let's please make sure any things that folk
>>>> think are blockers for a Spec PR are eliminated. If we have general
>>>> consensus at a high level I think we can move to discussing the actual spec
>>>> changes on a spec change PR.
>>>>
>>>> I'm going to be keeping the proposals for :
>>>>
>>>> Global Identifier as the Identifier
>>>> and
>>>> Last Updated Sequence number as the Version
>>>>
>>>>
>>>>
>>>> On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid>
>>>> wrote:
>>>>
>>>>> The situation in which you would use equality deletes is when you do
>>>>> not want to read the existing table data. That seems at odds with a 
>>>>> feature
>>>>> like row-level tracking where you want to keep track. To me, it would be a
>>>>> reasonable solution to just say that equality deletes can't be used in
>>>>> tables where row-level tracking is enabled.
>>>>>
>>>>> On Mon, Aug 19, 2024 at 11:34 AM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> As far as I know Flink is actually the only engine we have at the
>>>>>> moment that can produce Equally deletes and only Equality deletes have 
>>>>>> this
>>>>>> specific problem. Since an equality delete can be written without 
>>>>>> actually
>>>>>> knowing whether rows are being updated or not, it is always ambiguous as 
>>>>>> to
>>>>>> whether a new row is an updated row, a newly added row, or a row which 
>>>>>> was
>>>>>> deleted but then a newly added row was also appended.
>>>>>>
>>>>>> I think in this case we need to ignore row_versioning and just give
>>>>>> every new row a brand new identifier. For a reader this means all updates
>>>>>> look like a "delete" and "add" and no "updates". For other processes (COW
>>>>>> and Position Deletes) we only mark records as being deleted or updated
>>>>>> after finding them first, this makes it easy to take the lineage 
>>>>>> identifier
>>>>>> from the source record and change it. For Spark, we just kept working on
>>>>>> engine improvements (like SPJ, Dynamic partition pushdown) to try to make
>>>>>> that scan and join faster but we probably still require a bit slower
>>>>>> latency.
>>>>>>
>>>>>> I think we could theoretically resolve equality deletes into updates
>>>>>> at compaction time again but only if the user first defines accurate "row
>>>>>> identity" columns because otherwise we have no way of determining whether
>>>>>> rows were updated or not. This is basically the issue we have now in the
>>>>>> CDC procedures. Ideally, I think we need to find a way to have flink 
>>>>>> locate
>>>>>> updated rows at runtime using some better indexing structure or something
>>>>>> like that as you suggested.
>>>>>>
>>>>>> On Sat, Aug 17, 2024 at 1:07 AM Péter Váry <
>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Russell,
>>>>>>>
>>>>>>> As discussed offline, this would be very hard to implement with the
>>>>>>> current Flink CDC write strategies. I think this is true for every
>>>>>>> streaming writers.
>>>>>>>
>>>>>>> For tracking the previous version of the row, the streaming writer
>>>>>>> would need to scan the table. It needs to be done for every record to 
>>>>>>> find
>>>>>>> the previous version. This could be possible if the data would be 
>>>>>>> stored in
>>>>>>> a way which supports fast queries on the primary key, like LSM Tree 
>>>>>>> (see:
>>>>>>> Paimon [1]), otherwise it would be prohibitively costly, and unfeasible 
>>>>>>> for
>>>>>>> higher loads. So adding a new storage strategy could be one solution.
>>>>>>>
>>>>>>> Alternatively we might find a way for the compaction to update the
>>>>>>> lineage fields. We could provide a way to link the equality deletes to 
>>>>>>> the
>>>>>>> new rows which updated them during write, then on compaction we could
>>>>>>> update the lineage fields based on this info.
>>>>>>>
>>>>>>> Is there any better ideas with Spark streaming which we can adopt?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>> [1] - https://paimon.apache.org/docs/0.8/
>>>>>>>
>>>>>>> On Sat, Aug 17, 2024, 01:06 Russell Spitzer <
>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Y'all,
>>>>>>>>
>>>>>>>> We've been working on a new proposal to add Row Lineage to Iceberg
>>>>>>>> in the V3 Spec. The general idea is to give every row a unique 
>>>>>>>> identifier
>>>>>>>> as well as a marker of what version of the row it is. This should let 
>>>>>>>> us
>>>>>>>> build a variety of features related to CDC, Incremental Processing and
>>>>>>>> Audit Logging. If you are interested please check out the linked 
>>>>>>>> proposal
>>>>>>>> below. This will require compliance from all engines to be really 
>>>>>>>> useful so
>>>>>>>> It's important we come to consensus on whether or not this is possible.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you for your consideration,
>>>>>>>> Russ
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Databricks
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>

Re: [DISCUSS] Row Lineage Proposal

Reply via email to