Re: [DISCUSS] FLIP-506: Support Reuse Multiple Table Sinks in Planner

xiangyu feng Wed, 19 Feb 2025 01:15:40 -0800

Hi Ron,

Thx for ur advice.  I've made the changes to current FLIP[1] including
renaming the interface and remove the default implementation. As we have
discussed, the target columns will be compared in sink reuse if the sink
has not implemented the `SupportsTargetColumnWriting` ability. This will
make sure the sink reuse feature can still be safely enabled by default.


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner

Best,
Xiangyu Feng

Ron Liu <ron9....@gmail.com> 于2025年2月19日周三 10:25写道：

> Hi Xiangyu,
>
> Thanks for your reply, the updates LGTM  overall.
>
> 1. Regarding the naming of the interface, what do you think about calling
> it SupportsTargetColumnWriting? Here I would like to emphasize the support
> for partial column writing, and I personally think the naming can be
> aligned with SupportsWritingMetadata.
>
> 2. Regarding the interface methods, is it necessary to provide a default
> implementation, do most of the stores support partial column writing?
>
>
> Best,
> Ron
>
> Cong Cheng <congchengch...@gmail.com> 于2025年2月18日周二 16:12写道：
>
>> Hi Xiangyu,
>>
>> Introduce a new sink ability interface named `SupportsTargetColumnUpdate`,
>> > this interface will tell the planner if the sink has considered the
>> target
>> > columns information in its implementation;
>>
>>
>> I think it makes a lot of sense, +1 for this ability.
>>
>> Sorry to all that I sended the draft of the content twice, something
>> wrong with the enter of my keyboard.
>>
>> Best,
>> Cong Cheng
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> xiangyu feng <xiangyu...@gmail.com> 于2025年2月18日周二 15:06写道：
>>
>> > Hi Kevin,
>> >
>> > Thx for ur valuable suggestion. I've made a few changes to current
>> FLIP[1].
>> >
>> > 1, Introduce a new sink ability interface named
>> > `SupportsTargetColumnUpdate`, this interface will tell the planner if
>> the
>> > sink has considered the target columns information in its
>> implementation;
>> >
>> > 2, This ability will return true by default for safety consideration;
>> >
>> > 3, When generating node digest for sink reuse, the digest will only
>> ignore
>> > the target column infos when this ability return false. This will
>> require
>> > extra work for specific sink.
>> >
>> > By applying these changes, we can safely enable the sink reuse feature
>> by
>> > default even for sinks like JDBC . And for sinks like Paimon, we can
>> also
>> > reuse the sink node across multiple partial-update streams with
>> different
>> > target columns by revising paimon sink to implement this interface.
>> >
>> > Glad to hear you back for these updates.
>> >
>> > [1]
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
>> >
>> >
>> >
>> > Kevin Cheng <congchengch...@gmail.com> 于2025年2月14日周五 16:13写道：
>> >
>> >> Hi Xiangyu and Ron,
>> >>
>> >> I agree that sink reuse can be enabled by default from Flink Planner
>> >> perspective. But the planner should be informed by Sink Connector that
>> >> whether the planner can reuse different sink with different target
>> >> columns.
>> >>
>> >> Take JBDC sink as an example, under partial update circumstances, the
>> JDBC
>> >> needs to know the target sink or update columns of every record.
>> However,
>> >> the target columns info is discarded in current FLIP design.
>> >>
>> >> Best,
>> >> Xiangyu
>> >>
>> >> xiangyu feng <xiangyu...@gmail.com> 于2025年2月14日周五 13:51写道：
>> >>
>> >> > Hi Ron,
>> >> >
>> >> > After second thought, taking sink reuse as a long awaited feature
>> >> > with significant benefits for users, I agree to enable this in the
>> first
>> >> > place.  Similar features like
>> `table.optimizer.reuse-sub-plan-enabled`
>> >> and
>> >> > `table.optimizer.reuse-source-enabled` are also enabled by default.
>> From
>> >> > this point of view, sink reuse should be the same.
>> >> >
>> >> > The Flip[1] has been revised accordingly. Thx for suggestion.
>> >> >
>> >> > [1]
>> >> >
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
>> >> >
>> >> > Regards,
>> >> > Xiangyu
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > Ron Liu <ron9....@gmail.com> 于2025年2月14日周五 12:10写道：
>> >> >
>> >> > > Hi, Xiangyu
>> >> > >
>> >> > > >>> I prefer to set the default value of this option as'false in
>> the
>> >> > first
>> >> > > place.  Setting as true might introduce unexpected behavior for
>> users
>> >> > when
>> >> > > operating existing jobs. Maybe we should introduce this feature at
>> >> first
>> >> > > and discuss enabling this feature as default in a separated thread.
>> >> WDYT?
>> >> > >
>> >> > > 1. What unexpected behaviors do you think this might introduce?
>> For
>> >> Sink
>> >> > > nodes, which are generally stateless, I intuitively understand
>> that no
>> >> > > state compatibility issues will be introduced after Sink reuse.
>> >> > >
>> >> > > 2. Since Sink reuse benefits users, why not enable this feature by
>> >> > default
>> >> > > on the first day it is introduced? If your concern is potential
>> >> unhandled
>> >> > > corner cases in the implementation, I consider those to be bugs. We
>> >> > should
>> >> > > prioritize fixing them rather than blocking the default enablement
>> of
>> >> > this
>> >> > > optimization.
>> >> > >
>> >> > > 3. If we don't enable it by default now, when should we? What
>> specific
>> >> > > milestones or actions are needed during the waiting period?  Your
>> >> > concerns
>> >> > > about unintended behaviors would still exist even if we enable it
>> >> later.
>> >> > > Why delay resolving this in a separate discussion instead of
>> >> finalizing
>> >> > it
>> >> > > here?
>> >> > >
>> >> > > 4. From our internal practice, users still want to enjoy the
>> benefits
>> >> of
>> >> > > this feature by default.
>> >> > >
>> >> > >
>> >> > > Best,
>> >> > > Ron
>> >> > >
>> >> > > xiangyu feng <xiangyu...@gmail.com> 于2025年2月13日周四 15:57写道：
>> >> > >
>> >> > > >  Hi Ron,
>> >> > > >
>> >> > > > Thx for quick response.
>> >> > > >
>> >> > > > - should the default value be true for the newly introduced
>> option
>> >> > > > `table.optimizer.reuse-sink-enabled`?
>> >> > > >
>> >> > > > I prefer to set the default value of this option as'false in the
>> >> first
>> >> > > > place.  Setting as true might introduce unexpected behavior for
>> >> users
>> >> > > when
>> >> > > > operating existing jobs. Maybe we should introduce this feature
>> at
>> >> > first
>> >> > > > and discuss enabling this feature as default in a separated
>> thread.
>> >> > WDYT?
>> >> > > >
>> >> > > > - have you considered the technical implementation options and
>> are
>> >> they
>> >> > > > feasible?
>> >> > > >
>> >> > > > Yes, we have already implemented the POC internally. It works
>> well.
>> >> > > >
>> >> > > > Looking forward for your feedback.
>> >> > > >
>> >> > > > Best,
>> >> > > > Xiangyu
>> >> > > >
>> >> > > > Ron Liu <ron9....@gmail.com> 于2025年2月13日周四 14:55写道：
>> >> > > >
>> >> > > > > Hi, Xiangyu
>> >> > > > >
>> >> > > > > Thank you for proposing this FLIP, it's great work and looks
>> very
>> >> > > useful
>> >> > > > > for users.
>> >> > > > >
>> >> > > > > I have the following two questions regarding the content of the
>> >> FLIP:
>> >> > > > > 1. Since sink reuse is very useful, should the default value be
>> >> true
>> >> > > for
>> >> > > > > the newly introduced option
>> `table.optimizer.reuse-sink-enabled`,
>> >> and
>> >> > > > > should the engine enable this optimization by default.
>> Currently
>> >> for
>> >> > > > source
>> >> > > > > reuse, the default value of
>> >> > `sql.optimizer.reuse.table-source.enabled`
>> >> > > > > option is also true, which does not require user access by
>> >> default,
>> >> > so
>> >> > > I
>> >> > > > > think the engine should turn on Sink reuse optimization by
>> >> default.
>> >> > > > > 2. Regarding Sink Digest, you mentioned disregarding the sink
>> >> target
>> >> > > > > column, which I think is a very good suggestion, and very
>> useful
>> >> if
>> >> > it
>> >> > > > can
>> >> > > > > be done. I have a question: have you considered the technical
>> >> > > > > implementation options and are they feasible?
>> >> > > > >
>> >> > > > > Best,
>> >> > > > > Ron
>> >> > > > >
>> >> > > > > xiangyu feng <xiangyu...@gmail.com> 于2025年2月13日周四 12:56写道：
>> >> > > > >
>> >> > > > > > Hi all,
>> >> > > > > >
>> >> > > > > > Thank you all for the comments.
>> >> > > > > >
>> >> > > > > > If there is no further comment, I will open the voting thread
>> >> in 3
>> >> > > > days.
>> >> > > > > >
>> >> > > > > > Regards,
>> >> > > > > > Xiangyu
>> >> > > > > >
>> >> > > > > > xiangyu feng <xiangyu...@gmail.com> 于2025年2月11日周二 14:17写道：
>> >> > > > > >
>> >> > > > > > > Link for Paimon LocalMerge Operator[1]
>> >> > > > > > >
>> >> > > > > > > [1]
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://paimon.apache.org/docs/master/maintenance/write-performance/#local-merging
>> >> > > > > > >
>> >> > > > > > > xiangyu feng <xiangyu...@gmail.com> 于2025年2月11日周二 14:03写道：
>> >> > > > > > >
>> >> > > > > > >> Follow the above,
>> >> > > > > > >>
>> >> > > > > > >> "And for SinkWriter, the data structure to be processed
>> >> should
>> >> > be
>> >> > > > > > fixed."
>> >> > > > > > >>
>> >> > > > > > >> I'm not very sure why the data structure of SinkWriter
>> >> should be
>> >> > > > > fixed.
>> >> > > > > > >> Can you elaborate the scenario here?
>> >> > > > > > >>
>> >> > > > > > >>  "Is there a node or an operator to fill in the
>> inconsistent
>> >> > field
>> >> > > > of
>> >> > > > > > >> Rowdata that passed from different Sources?"
>> >> > > > > > >>
>> >> > > > > > >> By `filling in the inconsistent field from different
>> >> sources`,
>> >> > do
>> >> > > > you
>> >> > > > > > >> refer to implementations like the LocalMerge Operator [1]
>> for
>> >> > > > Paimon?
>> >> > > > > > IMHO,
>> >> > > > > > >> this should not be included in the Sink Reuse. The merging
>> >> > > behavior
>> >> > > > of
>> >> > > > > > >> multiple sources should be considered inside of the sink.
>> >> > > > > > >>
>> >> > > > > > >> Regards,
>> >> > > > > > >> Xiangyu Feng
>> >> > > > > > >>
>> >> > > > > > >> xiangyu feng <xiangyu...@gmail.com> 于2025年2月11日周二
>> 13:46写道：
>> >> > > > > > >>
>> >> > > > > > >>> Hi Yanquan,
>> >> > > > > > >>>
>> >> > > > > > >>> Thx for reply. IIUC, the schema of CatalogTable should
>> >> contain
>> >> > > all
>> >> > > > > > >>> target columns for sources. If not, a SQL validation
>> >> exception
>> >> > > > should
>> >> > > > > > be
>> >> > > > > > >>> raised for planner.
>> >> > > > > > >>>
>> >> > > > > > >>> Regards,
>> >> > > > > > >>> Xiangyu Feng
>> >> > > > > > >>>
>> >> > > > > > >>>
>> >> > > > > > >>>
>> >> > > > > > >>> Yanquan Lv <decq12y...@gmail.com> 于2025年2月10日周一 16:25写道：
>> >> > > > > > >>>
>> >> > > > > > >>>> Hi, Xiangyu. Thanks for driving this.
>> >> > > > > > >>>>
>> >> > > > > > >>>> I have a question to confirm:
>> >> > > > > > >>>> Considering the case that different Sources use
>> different
>> >> > > > > columns[1],
>> >> > > > > > >>>> will the Schema of CatalogTable[2] contain all target
>> >> columns
>> >> > > for
>> >> > > > > > Sources?
>> >> > > > > > >>>> And for SinkWriter, the data structure to be processed
>> >> should
>> >> > be
>> >> > > > > > fixed.
>> >> > > > > > >>>> Is there a node or an operator to fill in the
>> inconsistent
>> >> > field
>> >> > > > of
>> >> > > > > > Rowdata
>> >> > > > > > >>>> that passed from different Sources?
>> >> > > > > > >>>>
>> >> > > > > > >>>> [1]
>> >> > > > > > >>>>
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
>> >> > > > > > >>>> [2]
>> >> > > > > > >>>>
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sourcessinks/#planning
>> >> > > > > > >>>>
>> >> > > > > > >>>>
>> >> > > > > > >>>>
>> >> > > > > > >>>> > 2025年2月6日 17:06，xiangyu feng <xiangyu...@gmail.com>
>> 写道：
>> >> > > > > > >>>> >
>> >> > > > > > >>>> > Hi devs,
>> >> > > > > > >>>> >
>> >> > > > > > >>>> > I'm opening this thread to discuss FLIP-506: Support
>> >> Reuse
>> >> > > > > Multiple
>> >> > > > > > >>>> Table
>> >> > > > > > >>>> > Sinks in Planner[1].
>> >> > > > > > >>>> >
>> >> > > > > > >>>> > Currently if users want to partial-update a downstream
>> >> table
>> >> > > > from
>> >> > > > > > >>>> multiple
>> >> > > > > > >>>> > source tables in one datastream, they would have to
>> >> manually
>> >> > > > union
>> >> > > > > > all
>> >> > > > > > >>>> > source tables and add lots of "cast(null as string) as
>> >> xxx"
>> >> > in
>> >> > > > > Flink
>> >> > > > > > >>>> SQL.
>> >> > > > > > >>>> > This will make the SQL here hard to use and maintain.
>> >> > > > > > >>>> >
>> >> > > > > > >>>> > After discussing with Weijie Guo, we think that by
>> >> > supporting
>> >> > > > > reuse
>> >> > > > > > >>>> sink
>> >> > > > > > >>>> > nodes in planner, the usability can be greatly
>> improved
>> >> in
>> >> > > this
>> >> > > > > > case.
>> >> > > > > > >>>> >
>> >> > > > > > >>>> > Therefore, we propose to add a new option
>> >> > > > > > >>>> > *`table.optimizer.reuse-sink-enabled`* here to support
>> >> this
>> >> > > > > feature.
>> >> > > > > > >>>> More
>> >> > > > > > >>>> > details can be found in the FLIP.
>> >> > > > > > >>>> >
>> >> > > > > > >>>> > Looking forward to your feedback, thanks.
>> >> > > > > > >>>> >
>> >> > > > > > >>>> > [1]
>> >> > > > > > >>>> >
>> >> > > > > > >>>>
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-506%3A+Support+Reuse+Multiple+Table+Sinks+in+Planner
>> >> > > > > > >>>> >
>> >> > > > > > >>>> > Best regards,
>> >> > > > > > >>>> > Xiangyu Feng
>> >> > > > > > >>>>
>> >> > > > > > >>>>
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>>
>

Re: [DISCUSS] FLIP-506: Support Reuse Multiple Table Sinks in Planner

Reply via email to