Re: [DISCUSS] FLIP-530: Dynamic job configuration

Hangxiang Yu Tue, 20 May 2025 19:57:22 -0700

Thanks everyone for sharing thoughts.
Sorry for the late reply.
I'm also +1 for the limited white list from the start.
Also Thanks Rui for sharing extra information.


On Tue, May 20, 2025 at 8:59 PM Gustavo de Morais <[email protected]>
wrote:

> Hi Roman,
>
> This is a great and important improvement. +1 for the FLIP and to start
> voting.
>
> Best,
> Gustavo
>
> On Mon, 19 May 2025 at 20:28, Roman Khachatryan <[email protected]> wrote:
>
> > Thanks everyone for the discussion!
> >
> > I'm going to start a voting thread soon unless there are other
> suggestions
> > or objections.
> >
> > Regards,
> > Roman
> >
> >
> > On Sat, May 17, 2025 at 2:01 PM Roman Khachatryan <[email protected]>
> > wrote:
> >
> > > Thanks Chesnay, I like your idea of returning 403 for non-white-listed
> > > options. Updated the FLIP accordingly. Also, specified
> > > 'execution.checkpointing.interval' as a default value for the
> allow-list.
> > >
> > > Kartikey Pant, that's a good question, and your understanding is
> correct.
> > > There's a possibility of breaking the job via this API after passing
> the
> > > validation.
> > > For example, checkpoint timeout of 1 second would be valid, but might
> > > cause the checkpoints to fail.In such a case, configuration change
> should
> > > be reverted via a new PUT request.
> > >
> > > Regards,
> > > Roman
> > >
> > >
> > > On Thu, May 15, 2025 at 3:45 PM Chesnay Schepler <[email protected]>
> > > wrote:
> > >
> > >> Documenting the supported options is a fair concern, but at the same
> > >> time also a mountain of work as it would require going through all
> > >> options and creating well-defined rules for what is a job setting and
> > >> what isn't, enforcing that and possibly also change a whole bunch of
> > >> code to make that remotely consistent.
> > >>
> > >> I would say just documenting a few use-cases, like changing the
> > >> checkpoint interval for example, would already be good enough.
> > >> Changing the checkpointing interval on it's own would justify this
> > >> entire effort; anything else that happens to work without explicit
> > >> documentation could then just be a bonus for power users.
> > >>
> > >> I'd may suggest to return FORBIDDEN if an option is provided in the
> > >> request that's not allow listed be changed, and limit bad request to
> > >> invalid json.
> > >>
> > >> But as-is already +1 from my side.
> > >>
> > >> On 12/05/2025 07:33, Junrui Lee wrote:
> > >> > Hi Roman
> > >> >
> > >> > Thanks for driving this feature. +1 for this proposal.
> > >> >
> > >> > I also agree with the suggestion made by Feifan.
> > >> >
> > >> > Currently, not all configuration items are job-level configurations
> > [1].
> > >> > Even for those that are, not all job-level config options can be
> > >> updated at
> > >> > runtime through the Adaptive Scheduler. For instance, certain config
> > >> option
> > >> > related to job plan compilation, such as
> > >> pipeline.operator-chaining.enabled
> > >> > and nearly all of the table.* settings, are not eligible for runtime
> > >> > updates.
> > >> >
> > >> > >From a user perspective, it would be beneficial to clearly describe
> > >> which
> > >> > config options can be dynamically updated, allowing users to take
> > better
> > >> > advantage of this feature.
> > >> >
> > >> > Best,
> > >> > Junrui
> > >> >
> > >> > [1]
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-478+Introduce+Config+Option+Scope
> > >> >
> > >> > Feifan Wang <[email protected]> 于2025年5月12日周一 11:27写道：
> > >> >
> > >> >> Thanks Roman for driving this useful improvement, +1 for this
> > proposal.
> > >> >>
> > >> >> Also thanks discussion from Hangxiang and Rui Fan. Regarding
> question
> > >> 1, I
> > >> >> have some ideas for discussion:
> > >> >>
> > >> >> Based on the consideration of providing stable expectations for
> > users,
> > >> I
> > >> >> think we should perform configuration checks in a whitelist manner.
> > >> Ensure
> > >> >> that the configurations allowed to be modified through this API can
> > >> >> actually
> > >> >> take effect.
> > >> >>
> > >> >> In the initial version, we can provide a very small whitelist list,
> > >> even if
> > >> >> it only contains a few configurations that we most want to use and
> > have
> > >> >> been
> > >> >> confirmed to be effective. This list can be continuously
> supplemented
> > >> >> later.
> > >> >>
> > >> >>
> > >> >> ——————————————
> > >> >>
> > >> >> Best regards,
> > >> >> Feifan Wang
> > >> >>
> > >> >>
> > >> >>
> > >> >> ---- Replied Message ----
> > >> >> | From | Rui Fan<[email protected]> |
> > >> >> | Date | 05/11/2025 16:36 |
> > >> >> | To | <[email protected]> |
> > >> >> | Subject | Re: [DISCUSS] FLIP-530: Dynamic job configuration |
> > >> >> Thanks Roman for driving this valuable proposal, it uses the
> Adaptive
> > >> >> Scheduler to greatly reduce the downtime of configuration updates,
> > >> >> so +1 for this proposal!
> > >> >>
> > >> >> Overall LGTM, thanks to Hangxiang for the questions, and I have the
> > >> >> same questions with Hangxiang. I'd like to share my thoughts:
> > >> >>
> > >> >>
> > >> >> For question1 about validation:
> > >> >>
> > >> >> I think validation is necessary, but both the list of valid
> > >> configurations
> > >> >> and
> > >> >> the list of invalid configurations have limitations.
> > >> >>
> > >> >> For valid configurations: IIUC, almost all job level configurations
> > are
> > >> >> valid
> > >> >> after restarting the job by the adaptive scheduler. It means lots
> of
> > >> new
> > >> >> configurations need to be added to the list if we list valid
> > >> >> configurations.
> > >> >> If other developers miss it, the new configuration will fail
> > >> validation(but
> > >> >> it works).
> > >> >>
> > >> >> For invalid configurations: I encountered a problem before, where
> the
> > >> user
> > >> >> added a non-existent flink configuration, but flink could not
> detect
> > >> it.
> > >> >> It may be caused by typo. Therefore, even if we list some Flink
> > >> >> configurations
> > >> >> that do not support dynamic modification, we still cannot guarantee
> > >> that
> > >> >> the
> > >> >> configurations outside the list will take effect.
> > >> >>
> > >> >> Even so, I prefer to do limited validation, for example: not
> through
> > a
> > >> >> list,
> > >> >> but hard code a few rules (e.g. table.* doesn't work).
> > >> >>
> > >> >>
> > >> >> For question 2 about configuration change history:
> > >> >>
> > >> >> Logging configuration change history in the first version is fine.
> > >> >>
> > >> >> As I understand, both of configuration change and resource
> > requirements
> > >> >> change
> > >> >> could trigger a rescale for Adaptive Scheduler. So rescale history
> > can
> > >> >> probably
> > >> >> include both. If we want to show the configuration change history,
> it
> > >> might
> > >> >> be
> > >> >> more appropriate to put it in FLIP-487[1] and FLIP-495[2].
> > >> >>
> > >> >> For question 3 about co-works with other dynamic requests:
> > >> >>
> > >> >> Configuration changes are applied immediately; resource
> requirements
> > >> >> changes are applied with some delay
> > >> >>
> > >> >> Yes, rescale after some delay could reduce the rescale frequency to
> > >> avoid
> > >> >> some invalid restarts. So I'm curious why configuration changes
> don't
> > >> >> respect the delay mechanism?
> > >> >>
> > >> >> Please correct me if anything is wrong, thanks!
> > >> >>
> > >> >> [1]
> > >> >>
> > >> >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler
> > >> >> [2]
> > >> >>
> > >> >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history
> > >> >>
> > >> >> Best,
> > >> >> Rui
> > >> >>
> > >> >>
> > >> >> On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <
> [email protected]
> > >
> > >> >> wrote:
> > >> >>
> > >> >> Thanks Hangxiang Yu,
> > >> >>
> > >> >> Please find the answers below
> > >> >>
> > >> >> 1. Yes, we should perform validation before trying to update the
> > >> >> configuration. I'd rather validate some specific options that are
> > >> known to
> > >> >> not work though. Finding and hard-coding all the valid options
> might
> > be
> > >> >> impractical since they can change, and non trivial.
> > >> >>
> > >> >> 2. That would be great, but we'd have to store the history of such
> > >> updates
> > >> >> somewhere. For debugging purposes, logs should suffice I think
> > >> >>
> > >> >> 3. That's a great question! Configuration changes are applied
> > >> immediately;
> > >> >> resource requirements changes are applied with some delay; and both
> > are
> > >> >> stored in HA immediately. So configuration change request results
> > also
> > >> in
> > >> >> restarting and applying why pending resource requirements changes
> > >> >>
> > >> >>
> > >> >> Regards,
> > >> >> Roman
> > >> >>
> > >> >> On Fri, May 9, 2025, 05:10 Hangxiang Yu <[email protected]>
> wrote:
> > >> >>
> > >> >> Hi, Roman.
> > >> >>
> > >> >> Thanks for the FLIP.
> > >> >> +1 for supporting dynamic configuration to reduce manual restart.
> > >> >>
> > >> >>
> > >> >> I just have below questions:
> > >> >>
> > >> >> 1. Do we need a working configuration list ? So some unsupported
> > >> >> configurations could be rejected in advance.
> > >> >>
> > >> >> 2. Could we show the change history in the Web UI ? So more changed
> > >> >> details
> > >> >> could be tracked.
> > >> >>
> > >> >> 3. How does it co-works with other dynamic requests ? For example,
> it
> > >> >> modifies the parallelisms together with '
> > >> >> /jobs/:jobid/resource-requirements'.
> > >> >>
> > >> >> On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <[email protected]
> >
> > >> >> wrote:
> > >> >>
> > >> >> Hi everyone,
> > >> >>
> > >> >> I would like to start a discussion about FLIP-530: Dynamic job
> > >> >> configuration [1].
> > >> >>
> > >> >> In some cases, it is desirable to change Flink job configuration
> > after
> > >> >> it
> > >> >> was submitted to Flink, for example:
> > >> >> - Troubleshooting (e.g. increase checkpoint timeout or failure
> > >> >> threshold)
> > >> >> - Performance optimization, (e.g. tuning state backend parameters)
> > >> >> - Enabling new features after testing them in a non-Production
> > >> >> environment.
> > >> >> This allows to de-couple upgrading to newer Flink versions from
> > >> >> actually
> > >> >> enabling the features.
> > >> >> To support such use-cases, we propose to enhance Flink job
> > >> >> configuration
> > >> >> REST-endpoint with the support to read full job configuration; and
> > >> >> update
> > >> >> it.
> > >> >>
> > >> >> Looking forward to feedback.
> > >> >>
> > >> >> [1]
> > >> >> https://cwiki.apache.org/confluence/x/uglKFQ
> > >> >>
> > >> >> Regards,
> > >> >> Roman
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Best,
> > >> >> Hangxiang.
> > >> >>
> > >> >>
> > >> >>
> > >>
> > >>
> >
>


-- 
Best,
Hangxiang.

Re: [DISCUSS] FLIP-530: Dynamic job configuration

Reply via email to