Re: [DISCUSS] Row Lineage Proposal

Ryan Blue Wed, 28 Aug 2024 12:42:09 -0700

Sounds good to me. Thanks for pushing this forward, Russell!

On Tue, Aug 27, 2024 at 7:17 PM Russell Spitzer <russell.spit...@gmail.com>
wrote:


> I think folks have had a lot of good comments and since there haven't been
> a lot of strong opinions I'm going to try to take what I think are the
> least interesting options and move them into the "discarded section".
> Please continue to comment and let's please make sure any things that folk
> think are blockers for a Spec PR are eliminated. If we have general
> consensus at a high level I think we can move to discussing the actual spec
> changes on a spec change PR.
>
> I'm going to be keeping the proposals for :
>
> Global Identifier as the Identifier
> and
> Last Updated Sequence number as the Version
>
>
>
> On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid>
> wrote:
>
>> The situation in which you would use equality deletes is when you do not
>> want to read the existing table data. That seems at odds with a feature
>> like row-level tracking where you want to keep track. To me, it would be a
>> reasonable solution to just say that equality deletes can't be used in
>> tables where row-level tracking is enabled.
>>
>> On Mon, Aug 19, 2024 at 11:34 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> As far as I know Flink is actually the only engine we have at the moment
>>> that can produce Equally deletes and only Equality deletes have this
>>> specific problem. Since an equality delete can be written without actually
>>> knowing whether rows are being updated or not, it is always ambiguous as to
>>> whether a new row is an updated row, a newly added row, or a row which was
>>> deleted but then a newly added row was also appended.
>>>
>>> I think in this case we need to ignore row_versioning and just give
>>> every new row a brand new identifier. For a reader this means all updates
>>> look like a "delete" and "add" and no "updates". For other processes (COW
>>> and Position Deletes) we only mark records as being deleted or updated
>>> after finding them first, this makes it easy to take the lineage identifier
>>> from the source record and change it. For Spark, we just kept working on
>>> engine improvements (like SPJ, Dynamic partition pushdown) to try to make
>>> that scan and join faster but we probably still require a bit slower
>>> latency.
>>>
>>> I think we could theoretically resolve equality deletes into updates at
>>> compaction time again but only if the user first defines accurate "row
>>> identity" columns because otherwise we have no way of determining whether
>>> rows were updated or not. This is basically the issue we have now in the
>>> CDC procedures. Ideally, I think we need to find a way to have flink locate
>>> updated rows at runtime using some better indexing structure or something
>>> like that as you suggested.
>>>
>>> On Sat, Aug 17, 2024 at 1:07 AM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Russell,
>>>>
>>>> As discussed offline, this would be very hard to implement with the
>>>> current Flink CDC write strategies. I think this is true for every
>>>> streaming writers.
>>>>
>>>> For tracking the previous version of the row, the streaming writer
>>>> would need to scan the table. It needs to be done for every record to find
>>>> the previous version. This could be possible if the data would be stored in
>>>> a way which supports fast queries on the primary key, like LSM Tree (see:
>>>> Paimon [1]), otherwise it would be prohibitively costly, and unfeasible for
>>>> higher loads. So adding a new storage strategy could be one solution.
>>>>
>>>> Alternatively we might find a way for the compaction to update the
>>>> lineage fields. We could provide a way to link the equality deletes to the
>>>> new rows which updated them during write, then on compaction we could
>>>> update the lineage fields based on this info.
>>>>
>>>> Is there any better ideas with Spark streaming which we can adopt?
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> [1] - https://paimon.apache.org/docs/0.8/
>>>>
>>>> On Sat, Aug 17, 2024, 01:06 Russell Spitzer <russell.spit...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Y'all,
>>>>>
>>>>> We've been working on a new proposal to add Row Lineage to Iceberg in
>>>>> the V3 Spec. The general idea is to give every row a unique identifier as
>>>>> well as a marker of what version of the row it is. This should let us 
>>>>> build
>>>>> a variety of features related to CDC, Incremental Processing and Audit
>>>>> Logging. If you are interested please check out the linked proposal below.
>>>>> This will require compliance from all engines to be really useful so It's
>>>>> important we come to consensus on whether or not this is possible.
>>>>>
>>>>>
>>>>> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing
>>>>>
>>>>>
>>>>> Thank you for your consideration,
>>>>> Russ
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Databricks
>>
>

-- 
Ryan Blue
Databricks

Re: [DISCUSS] Row Lineage Proposal

Reply via email to