Re:Re: Re: Re: FLIP-510: Drop ChangelogNormalize for operations which don't need it

Xuyang Sun, 09 Mar 2025 20:05:57 -0700

I have no other questions. +1 for it.




--

    Best！
    Xuyang





At 2025-03-07 19:37:09, "Dawid Wysakowicz" <dwysakow...@apache.org> wrote:
>>
>> From my understanding, for a sink, if its schema includes a primary key,
>> we can assume it has
>> the ability to process delete messages (with '-D') and perform deletions
>> by key (PK). If it does not
>> include a PK, we would implicitly treat it as a log-structured table that
>> supports full row deletions.
>
>
>I am afraid this assumption is too far going. PK is information about
>columns uniqueness and that's it. It does not tell us what is required to
>perform a DELETE operation. I agree the assumption would most often hold,
>but I am afraid it is not guaranteed. E.g. In a log based systems one may
>just want to have full information encoded in the DELETE messages. (e.g. in
>a debezium message)
>
>Same holds for sources. Even though theoretically, if there is a PK,
>deletes could contain only the key information, but the source may just as
>well produce DELETEs with all fields set.
>
>Given that you mentioned `PARTIAL_DELETE`, should I interpret this as
>> referring to a scenario
>> similar to wide tables, where if the sink has a PK, some columns are
>> deleted (set to null or through other operations) while others remain
>> unchanged?
>
>
>No. The effect is the same. That the ROW is deleted/disappears. The
>difference is what is required to perform the deletion. In some cases it
>may be enough to have the PK to perform the deletion and then we don't need
>the information about other columns, but there may be systems that require
>all columns to be set.
>
>By the way, since the flag applies both for sources and sinks to tell what
>is the expected format of DELETE records produced/consumed I renamed the
>flag in the FLIP:
>supportsDeleteByKey -> deletesByKeOnly.
>
>Let me know if there are other questions. If there are none, I'd like to
>start a vote in the upcoming days.
>
>Best,
>Dawid
>
>
>On Mon, 3 Mar 2025 at 07:29, Xuyang <xyzhong...@163.com> wrote:
>
>> Hi, Dawid.
>>
>> Thanks for your response. I believe I've identified a key point, but I’m a
>> bit unclear about the
>>
>> following you said. Could you please provide an example for clarification?
>>
>> ```
>>
>> The only missing information is if the external sink can consume deletes
>> by key and if a source
>>
>> produces full deletes or deletes by key.
>>
>> ```
>>
>> From my understanding, for a sink, if its schema includes a primary key,
>> we can assume it has
>>
>> the ability to process delete messages (with '-D') and perform deletions
>> by key (PK). If it does not
>>
>> include a PK, we would implicitly treat it as a log-structured table that
>> supports full row deletions.
>>
>> Given that you mentioned `PARTIAL_DELETE`, should I interpret this as
>> referring to a scenario
>>
>> similar to wide tables, where if the sink has a PK, some columns are
>> deleted (set to null or through
>>
>> other operations) while others remain unchanged?
>>
>> Looking forward your reply.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>>     Best！
>>     Xuyang
>>
>>
>>
>>
>>
>> At 2025-02-28 19:16:12, "Dawid Wysakowicz" <wysakowicz.da...@gmail.com>
>> wrote:
>> >Hey Xuyang,
>> >Ad. 1
>> >Yes, you're right, but we already do that for determining if we need
>> >UPDATE_BEFORE or not. FlinkChangelogModeInferenceProgram already deals
>> with
>> >that.
>> >Ad. 2
>> >Unfortunately it is. This is also the only reason I need a FLIP. We can
>> >determine internally for every internal operator if we can work with
>> >partial deletes or if we need full deletes. The only missing information
>> is
>> >if the external sink can consume deletes by key and if a source produces
>> >full deletes or deletes by key. Unfortunately this is information that
>> >comes from a connector implementation and thus needs to be provided via a
>> >public API.
>> >Ad. 3
>> >With ChangelogMode#kinds -> to some degree yes. We theoretically could
>> >split RowKind#DELETE to RowKind#DELETE_BY_KEY and RowKind#FULL_DELETE.
>> >However, that change would 1) be much more involved 2) we would need to
>> >encode that information in every single message, which I think is not
>> >necessary. I don't think it has much to do with PK.
>> >Ad.4
>> >I don't think so. PK information is part of Schema not about the kind of
>> >messages. We don't have PK information for UPDATE_BEFORE/UPDATE_AFTER and
>> >they also apply per key. If the name containing `DELETE_BY_KEY` is
>> >confusing I am happy to rename it to e.g. PARTIAL_DELETE, therefore I'd
>> add
>> >`supportsPartialDeletes`
>> >
>> >Best,
>> >Dawid
>> >
>> >On Fri, 28 Feb 2025 at 04:43, Xuyang <xyzhong...@163.com> wrote:
>> >
>> >> Hi Dawid.
>> >>
>> >>
>> >>
>> >>
>> >> Big +1 for this FLIP. After reading through it, I have a few questions
>> and
>> >> would appreciate your responses:
>> >>
>> >> 1. IIUC, we only need to provide additional information in the
>> >> `FlinkChangelogModeInferenceProgram` to enable the
>> >>
>> >> inference program to determine whether it is safe to remove
>> >> `ChangelogNormalize`. My first instinct is that we need to
>> >>
>> >> know if all subsequent output-side nodes consuming Upsert Keys include
>> the
>> >> Upsert Keys provided by the input-side operator (source).
>> >>
>> >> If this condition is met, we can safely eliminate `ChangelogNormalize`.
>> >> Perhaps, I have missed some important points, so please feel
>> >>
>> >> free to correct me if necessary.
>> >>
>> >> 2. The introduction of `supportsDeleteByKey` in ChangelogMode seems to
>> >> exist solely as auxiliary information for the
>> >>
>> >> `FlinkChangelogModeInferenceProgram`. If that's the case, it doesn't
>> seem
>> >> necessary to expose it in the public API, does it?
>> >>
>> >> 3. If the purpose of introducing `supportsDeleteByKey` in ChangelogMode
>> is
>> >> to facilitate support for `#fromChangelogStream`
>> >>
>> >> and `#toChangelogStream`, it appears that `supportsDeleteByKey` might
>> >> overlap with ChangelogMode#kinds and Schema#PK
>> >>
>> >> to some extent, right?
>> >>
>> >> 4. Regarding supportsDeleteByKey, as part of a complete ChangelogMode
>> >> entity, should we also store the specific key information?
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >>     Best！
>> >>     Xuyang
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> 在 2025-02-28 04:27:19，"Martijn Visser" <martijnvis...@apache.org> 写道：
>> >> >Hi Dawid,
>> >> >
>> >> >Thanks for the FLIP, looks like a good improvement for me that will
>> bring
>> >> a
>> >> >lot of benefits. +1
>> >> >
>> >> >Best regards,
>> >> >
>> >> >Martijn
>> >> >
>> >> >On Tue, Feb 25, 2025 at 6:51 AM Sergey Nuyanzin <snuyan...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> +1 for such improvement
>> >> >>
>> >> >> On Mon, Feb 24, 2025 at 12:01 PM Dawid Wysakowicz
>> >> >> <wysakowicz.da...@gmail.com> wrote:
>> >> >> >
>> >> >> > Hi everyone,
>> >> >> >
>> >> >> > I would like to initiate a discussion for the FLIP-510[1] below,
>> which
>> >> >> aims
>> >> >> > on optimising certain use cases in SQL which at the moment add
>> >> >> > ChangelogNormalize, but don't necessarily need to do it.
>> >> >> >
>> >> >> > Looking forward to hearing from you.
>> >> >> >
>> >> >> > [1] https://cwiki.apache.org/confluence/x/7o5EF
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Best regards,
>> >> >> Sergey
>> >> >>
>> >>
>>

Re:Re: Re: Re: FLIP-510: Drop ChangelogNormalize for operations which don't need it

Reply via email to