Re: [DISCUSS] Row Lineage Proposal

Manu Zhang Fri, 13 Sep 2024 22:43:27 -0700

Thanks Russel. Not a question on the proposal itself, I find it a bit hard
to follow and maintain all the three specs in one place. We are also
publishing a unfinalized spec to the website. Would it be better to
maintain the spec in a "copy-on-write" style, i.e. each spec having its own
format file?


Sorry to go off topic, I can start a separate thread if you think this
concern is valid.


On Sat, Sep 14, 2024 at 6:33 AM Russell Spitzer <[email protected]>
wrote:

> Pull Request Available, please focus any remaining comments there and we
> can wrap this one up
>
> https://github.com/apache/iceberg/pull/11130
>
> On Thu, Aug 29, 2024 at 11:20 AM [email protected] <[email protected]>
> wrote:
>
>> +1 for making row lineage and equality deletes mutually exclusive.
>>
>> The idea behind equality deletes is to avoid needing to read existing
>> data in order to delete records. That doesn't fit with row lineage because
>> the purpose of lineage is to be able to identify when a row changes by
>> maintaining an identifier that would have to be read.
>>
>> On Wed, Aug 28, 2024 at 4:16 PM Anton Okolnychyi <[email protected]>
>> wrote:
>>
>>> I went through the proposal and left comments as well. Thanks for
>>> working on it, Russell!
>>>
>>> I don't see a good solution to how row lineage can work with equality
>>> deletes. If so, I would be in favor of not allowing equality deletes at all
>>> if row lineage is enabled as opposed to treating all added data records as
>>> new. I will spend more time thinking if we can make it work.
>>>
>>> - Anton
>>>
>>> ср, 28 серп. 2024 р. о 12:41 Ryan Blue <[email protected]>
>>> пише:
>>>
>>>> Sounds good to me. Thanks for pushing this forward, Russell!
>>>>
>>>> On Tue, Aug 27, 2024 at 7:17 PM Russell Spitzer <
>>>> [email protected]> wrote:
>>>>
>>>>> I think folks have had a lot of good comments and since there haven't
>>>>> been a lot of strong opinions I'm going to try to take what I think are 
>>>>> the
>>>>> least interesting options and move them into the "discarded section".
>>>>> Please continue to comment and let's please make sure any things that folk
>>>>> think are blockers for a Spec PR are eliminated. If we have general
>>>>> consensus at a high level I think we can move to discussing the actual 
>>>>> spec
>>>>> changes on a spec change PR.
>>>>>
>>>>> I'm going to be keeping the proposals for :
>>>>>
>>>>> Global Identifier as the Identifier
>>>>> and
>>>>> Last Updated Sequence number as the Version
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> The situation in which you would use equality deletes is when you do
>>>>>> not want to read the existing table data. That seems at odds with a 
>>>>>> feature
>>>>>> like row-level tracking where you want to keep track. To me, it would be 
>>>>>> a
>>>>>> reasonable solution to just say that equality deletes can't be used in
>>>>>> tables where row-level tracking is enabled.
>>>>>>
>>>>>> On Mon, Aug 19, 2024 at 11:34 AM Russell Spitzer <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> As far as I know Flink is actually the only engine we have at the
>>>>>>> moment that can produce Equally deletes and only Equality deletes have 
>>>>>>> this
>>>>>>> specific problem. Since an equality delete can be written without 
>>>>>>> actually
>>>>>>> knowing whether rows are being updated or not, it is always ambiguous 
>>>>>>> as to
>>>>>>> whether a new row is an updated row, a newly added row, or a row which 
>>>>>>> was
>>>>>>> deleted but then a newly added row was also appended.
>>>>>>>
>>>>>>> I think in this case we need to ignore row_versioning and just give
>>>>>>> every new row a brand new identifier. For a reader this means all 
>>>>>>> updates
>>>>>>> look like a "delete" and "add" and no "updates". For other processes 
>>>>>>> (COW
>>>>>>> and Position Deletes) we only mark records as being deleted or updated
>>>>>>> after finding them first, this makes it easy to take the lineage 
>>>>>>> identifier
>>>>>>> from the source record and change it. For Spark, we just kept working on
>>>>>>> engine improvements (like SPJ, Dynamic partition pushdown) to try to 
>>>>>>> make
>>>>>>> that scan and join faster but we probably still require a bit slower
>>>>>>> latency.
>>>>>>>
>>>>>>> I think we could theoretically resolve equality deletes into updates
>>>>>>> at compaction time again but only if the user first defines accurate 
>>>>>>> "row
>>>>>>> identity" columns because otherwise we have no way of determining 
>>>>>>> whether
>>>>>>> rows were updated or not. This is basically the issue we have now in the
>>>>>>> CDC procedures. Ideally, I think we need to find a way to have flink 
>>>>>>> locate
>>>>>>> updated rows at runtime using some better indexing structure or 
>>>>>>> something
>>>>>>> like that as you suggested.
>>>>>>>
>>>>>>> On Sat, Aug 17, 2024 at 1:07 AM Péter Váry <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Russell,
>>>>>>>>
>>>>>>>> As discussed offline, this would be very hard to implement with the
>>>>>>>> current Flink CDC write strategies. I think this is true for every
>>>>>>>> streaming writers.
>>>>>>>>
>>>>>>>> For tracking the previous version of the row, the streaming writer
>>>>>>>> would need to scan the table. It needs to be done for every record to 
>>>>>>>> find
>>>>>>>> the previous version. This could be possible if the data would be 
>>>>>>>> stored in
>>>>>>>> a way which supports fast queries on the primary key, like LSM Tree 
>>>>>>>> (see:
>>>>>>>> Paimon [1]), otherwise it would be prohibitively costly, and 
>>>>>>>> unfeasible for
>>>>>>>> higher loads. So adding a new storage strategy could be one solution.
>>>>>>>>
>>>>>>>> Alternatively we might find a way for the compaction to update the
>>>>>>>> lineage fields. We could provide a way to link the equality deletes to 
>>>>>>>> the
>>>>>>>> new rows which updated them during write, then on compaction we could
>>>>>>>> update the lineage fields based on this info.
>>>>>>>>
>>>>>>>> Is there any better ideas with Spark streaming which we can adopt?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> [1] - https://paimon.apache.org/docs/0.8/
>>>>>>>>
>>>>>>>> On Sat, Aug 17, 2024, 01:06 Russell Spitzer <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Y'all,
>>>>>>>>>
>>>>>>>>> We've been working on a new proposal to add Row Lineage to Iceberg
>>>>>>>>> in the V3 Spec. The general idea is to give every row a unique 
>>>>>>>>> identifier
>>>>>>>>> as well as a marker of what version of the row it is. This should let 
>>>>>>>>> us
>>>>>>>>> build a variety of features related to CDC, Incremental Processing and
>>>>>>>>> Audit Logging. If you are interested please check out the linked 
>>>>>>>>> proposal
>>>>>>>>> below. This will require compliance from all engines to be really 
>>>>>>>>> useful so
>>>>>>>>> It's important we come to consensus on whether or not this is 
>>>>>>>>> possible.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thank you for your consideration,
>>>>>>>>> Russ
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Databricks
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Databricks
>>>>
>>>

Re: [DISCUSS] Row Lineage Proposal

Reply via email to