Re: [Discussion] Spec change for Row Lineage - Allow Equality Deletes

Russell Spitzer Wed, 12 Feb 2025 10:06:09 -0800

I'm not sure I follow how one could figure out the equality delete row ID
after the fact. Won't I need to use some other primary key identifier and
do a shuffle join to line it up
with existing records?


On Wed, Feb 12, 2025 at 8:57 AM Péter Váry <[email protected]>
wrote:

> In Flink there are 2 types of CDC streams:
>
>    - Upsert stream - in this case the sink receives only -D (delete), +I
>    (insert) records - In this case we can't differentiate
>    - Retract stream - in this case the sink receives -D (delete), +I
>    (insert), -U (removed by update), +U (added by update) records - In this
>    case we know if a record is updated, so we can assume that there were a
>    previous version for the record
>
> I'm not sure about Kafka though
>
> Steven Wu <[email protected]> ezt írta (időpont: 2025. febr. 12., Sze,
> 15:42):
>
>> > If we want to keep the equality deletes, we might add a marker to the
>> updated row (for example rowId=-1) that this is an update. The reader and
>> the compaction could calculate the correct value.
>>
>> Peter, this is probably not going to work with equality deletes. The
>> writer doesn't really know if it is an insert or update, since it didn't
>> really check if existing data files have the same row key (equality
>> fields). Otherwise, writers can produce position deletes already. Equality
>> deletes are cheap for writers, because there are no need to do any
>> check/join/correlation at all. Equality deletes just blindly claim deletion
>> of any previous rows ( *if there are any*) marked by the equality key.
>>
>>
>> On Tue, Feb 11, 2025 at 10:11 PM Péter Váry <[email protected]>
>> wrote:
>>
>>> Hi Russell,
>>>
>>> Thanks for bringing this up!
>>> I think equality deletes are not the root of the problem here.
>>>
>>> - If we have a positional delete, and the new row doesn't include the
>>> old rowId, then the lineage info is lost.
>>> - If we have an equality delete, and the new row contains the rowId,
>>> then we have the lineage info
>>>
>>> That said, it is unlikely to have rowId for the new rows for writers
>>> (Flink, Kafka Connect) who are currently using equality deletes for
>>> updates, because of the exact reason they're using equality deletes (has to
>>> defer the lookup for later). Maybe we can add checks for these engines to
>>> warn when they try to write a table which has row lineage enabled.
>>>
>>> If we want to keep the equality deletes, we might add a marker to the
>>> updated row (for example rowId=-1) that this is an update. The reader and
>>> the compaction could calculate the correct value. That said, I think it
>>> would be better to invest in finding a replacement for equality deletes
>>> than investing more in it. Maybe adding Iceberg indexes which would allow
>>> fast lookup and conversion at commit time, or push those indexes to the
>>> file format level?
>>>
>>> Thanks, Peter
>>>
>>>
>>> On Wed, Feb 12, 2025, 06:38 Gang Wu <[email protected]> wrote:
>>>
>>>> Thanks Steven for the explanation! Yes, you're right that solely
>>>> rewriting delete files does not help.
>>>>
>>>> IIUC, Iceberg is the only table format that does not produce changelog
>>>> files. Is there any chance to recompute the row_id of updated rows by
>>>> tracking changes of the identifier fields between snapshots during the
>>>> rewrite and produce changelog files?
>>>>
>>>> I'm not asking to add changelogs support since it is a large design
>>>> choice. Just want to brainstorm it.
>>>>
>>>> On Wed, Feb 12, 2025 at 11:55 AM Steven Wu <[email protected]>
>>>> wrote:
>>>>
>>>>> I am fine with the proposed spec change. While it "supports/allows"
>>>>> equality deletes, row lineage semantics needn't/can't be maintained
>>>>> properly for equality deletes (compared to position deletes). Gang pointed
>>>>> out a couple issues with the implications. But we have no choice but to
>>>>> live with those implications due to how equality deletes behave.
>>>>>
>>>>> Gang, rewriting equality deletes to position deletes doesn't really
>>>>> help in this case. To have correct lineage, the row update is supposed to
>>>>> have the row_id carried over from the previous row (equality deleted row)
>>>>> during the write phase with equality deletes. Instead, this spec change 
>>>>> now
>>>>> says the updated row is a complete new row with new row_id.
>>>>>
>>>>> On Tue, Feb 11, 2025 at 7:39 PM Gang Wu <[email protected]> wrote:
>>>>>
>>>>>> Hi Russell,
>>>>>>
>>>>>> Thanks for supporting equality deletes to row lineage!
>>>>>>
>>>>>> > accept that "updates" will be treated as "delete" and "insert"
>>>>>>
>>>>>> I would say that it has obvious drawbacks below (though it is better
>>>>>> than not supported):
>>>>>> 1) updates will be populated differently when outputting changelogs
>>>>>> to users or downstream databases
>>>>>> 2) lead to more computation for incremental processing like
>>>>>> refreshing materialized views
>>>>>>
>>>>>> At the same time, I would like to ask if it would help if we support
>>>>>> rewriting equality deletes to position deletes.
>>>>>> There was an effort but it has been closed:
>>>>>> https://github.com/apache/iceberg/pull/2216
>>>>>>
>>>>>> Best,
>>>>>> Gang
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 12, 2025 at 7:25 AM Russell Spitzer <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Y'all,
>>>>>>>
>>>>>>> As we have been working on the row lineage implementation I've been
>>>>>>> reached out to by a few folks
>>>>>>>  in the community who are interested in changing our defined
>>>>>>> behavior around equality deletes.
>>>>>>>
>>>>>>> Currently when Row Lineage is enabled, the spec says to disable
>>>>>>> equality deletes for the table.
>>>>>>>
>>>>>>> In the interest of compatibility with Flink and other Equality
>>>>>>> delete producers, I originally wrote
>>>>>>> that we would simply treat all equality delete based updates as a
>>>>>>> pure insert and
>>>>>>> delete. At the time, some folks thought this was too open and
>>>>>>> worried that it would be poor behavior which
>>>>>>> led to the current restriction.
>>>>>>>
>>>>>>> Now that we are actually implementing I think there have been some
>>>>>>> changes of heart and that we
>>>>>>> should go back to the original design.  I'd like to see if we have
>>>>>>> consensus
>>>>>>> in the community to change the wording back and allow equality
>>>>>>> deletes.
>>>>>>>
>>>>>>> PR: https://github.com/apache/iceberg/pull/12230
>>>>>>>
>>>>>>> The TLDR;
>>>>>>>
>>>>>>> Allow equality deletes with row lineage but accept that "updates"
>>>>>>> will be treated as "delete" and "insert"
>>>>>>>
>>>>>>> Thanks for your time,
>>>>>>> Russ
>>>>>>>
>>>>>>

Re: [Discussion] Spec change for Row Lineage - Allow Equality Deletes

Reply via email to