Re: [DISCUSS] FLIP-530: Dynamic job configuration

Roman Khachatryan Mon, 19 May 2025 11:28:49 -0700

Thanks everyone for the discussion!

I'm going to start a voting thread soon unless there are other suggestions
or objections.


Regards,
Roman


On Sat, May 17, 2025 at 2:01 PM Roman Khachatryan <ro...@apache.org> wrote:

> Thanks Chesnay, I like your idea of returning 403 for non-white-listed
> options. Updated the FLIP accordingly. Also, specified
> 'execution.checkpointing.interval' as a default value for the allow-list.
>
> Kartikey Pant, that's a good question, and your understanding is correct.
> There's a possibility of breaking the job via this API after passing the
> validation.
> For example, checkpoint timeout of 1 second would be valid, but might
> cause the checkpoints to fail.In such a case, configuration change should
> be reverted via a new PUT request.
>
> Regards,
> Roman
>
>
> On Thu, May 15, 2025 at 3:45 PM Chesnay Schepler <ches...@apache.org>
> wrote:
>
>> Documenting the supported options is a fair concern, but at the same
>> time also a mountain of work as it would require going through all
>> options and creating well-defined rules for what is a job setting and
>> what isn't, enforcing that and possibly also change a whole bunch of
>> code to make that remotely consistent.
>>
>> I would say just documenting a few use-cases, like changing the
>> checkpoint interval for example, would already be good enough.
>> Changing the checkpointing interval on it's own would justify this
>> entire effort; anything else that happens to work without explicit
>> documentation could then just be a bonus for power users.
>>
>> I'd may suggest to return FORBIDDEN if an option is provided in the
>> request that's not allow listed be changed, and limit bad request to
>> invalid json.
>>
>> But as-is already +1 from my side.
>>
>> On 12/05/2025 07:33, Junrui Lee wrote:
>> > Hi Roman
>> >
>> > Thanks for driving this feature. +1 for this proposal.
>> >
>> > I also agree with the suggestion made by Feifan.
>> >
>> > Currently, not all configuration items are job-level configurations [1].
>> > Even for those that are, not all job-level config options can be
>> updated at
>> > runtime through the Adaptive Scheduler. For instance, certain config
>> option
>> > related to job plan compilation, such as
>> pipeline.operator-chaining.enabled
>> > and nearly all of the table.* settings, are not eligible for runtime
>> > updates.
>> >
>> > >From a user perspective, it would be beneficial to clearly describe
>> which
>> > config options can be dynamically updated, allowing users to take better
>> > advantage of this feature.
>> >
>> > Best,
>> > Junrui
>> >
>> > [1]
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-478+Introduce+Config+Option+Scope
>> >
>> > Feifan Wang <zoltar9...@163.com> 于2025年5月12日周一 11:27写道：
>> >
>> >> Thanks Roman for driving this useful improvement, +1 for this proposal.
>> >>
>> >> Also thanks discussion from Hangxiang and Rui Fan. Regarding question
>> 1, I
>> >> have some ideas for discussion:
>> >>
>> >> Based on the consideration of providing stable expectations for users,
>> I
>> >> think we should perform configuration checks in a whitelist manner.
>> Ensure
>> >> that the configurations allowed to be modified through this API can
>> >> actually
>> >> take effect.
>> >>
>> >> In the initial version, we can provide a very small whitelist list,
>> even if
>> >> it only contains a few configurations that we most want to use and have
>> >> been
>> >> confirmed to be effective. This list can be continuously supplemented
>> >> later.
>> >>
>> >>
>> >> ——————————————
>> >>
>> >> Best regards,
>> >> Feifan Wang
>> >>
>> >>
>> >>
>> >> ---- Replied Message ----
>> >> | From | Rui Fan<1996fan...@gmail.com> |
>> >> | Date | 05/11/2025 16:36 |
>> >> | To | <dev@flink.apache.org> |
>> >> | Subject | Re: [DISCUSS] FLIP-530: Dynamic job configuration |
>> >> Thanks Roman for driving this valuable proposal, it uses the Adaptive
>> >> Scheduler to greatly reduce the downtime of configuration updates,
>> >> so +1 for this proposal!
>> >>
>> >> Overall LGTM, thanks to Hangxiang for the questions, and I have the
>> >> same questions with Hangxiang. I'd like to share my thoughts:
>> >>
>> >>
>> >> For question1 about validation:
>> >>
>> >> I think validation is necessary, but both the list of valid
>> configurations
>> >> and
>> >> the list of invalid configurations have limitations.
>> >>
>> >> For valid configurations: IIUC, almost all job level configurations are
>> >> valid
>> >> after restarting the job by the adaptive scheduler. It means lots of
>> new
>> >> configurations need to be added to the list if we list valid
>> >> configurations.
>> >> If other developers miss it, the new configuration will fail
>> validation(but
>> >> it works).
>> >>
>> >> For invalid configurations: I encountered a problem before, where the
>> user
>> >> added a non-existent flink configuration, but flink could not detect
>> it.
>> >> It may be caused by typo. Therefore, even if we list some Flink
>> >> configurations
>> >> that do not support dynamic modification, we still cannot guarantee
>> that
>> >> the
>> >> configurations outside the list will take effect.
>> >>
>> >> Even so, I prefer to do limited validation, for example: not through a
>> >> list,
>> >> but hard code a few rules (e.g. table.* doesn't work).
>> >>
>> >>
>> >> For question 2 about configuration change history:
>> >>
>> >> Logging configuration change history in the first version is fine.
>> >>
>> >> As I understand, both of configuration change and resource requirements
>> >> change
>> >> could trigger a rescale for Adaptive Scheduler. So rescale history can
>> >> probably
>> >> include both. If we want to show the configuration change history, it
>> might
>> >> be
>> >> more appropriate to put it in FLIP-487[1] and FLIP-495[2].
>> >>
>> >> For question 3 about co-works with other dynamic requests:
>> >>
>> >> Configuration changes are applied immediately; resource requirements
>> >> changes are applied with some delay
>> >>
>> >> Yes, rescale after some delay could reduce the rescale frequency to
>> avoid
>> >> some invalid restarts. So I'm curious why configuration changes don't
>> >> respect the delay mechanism?
>> >>
>> >> Please correct me if anything is wrong, thanks!
>> >>
>> >> [1]
>> >>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler
>> >> [2]
>> >>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history
>> >>
>> >> Best,
>> >> Rui
>> >>
>> >>
>> >> On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <ro...@apache.org>
>> >> wrote:
>> >>
>> >> Thanks Hangxiang Yu,
>> >>
>> >> Please find the answers below
>> >>
>> >> 1. Yes, we should perform validation before trying to update the
>> >> configuration. I'd rather validate some specific options that are
>> known to
>> >> not work though. Finding and hard-coding all the valid options might be
>> >> impractical since they can change, and non trivial.
>> >>
>> >> 2. That would be great, but we'd have to store the history of such
>> updates
>> >> somewhere. For debugging purposes, logs should suffice I think
>> >>
>> >> 3. That's a great question! Configuration changes are applied
>> immediately;
>> >> resource requirements changes are applied with some delay; and both are
>> >> stored in HA immediately. So configuration change request results also
>> in
>> >> restarting and applying why pending resource requirements changes
>> >>
>> >>
>> >> Regards,
>> >> Roman
>> >>
>> >> On Fri, May 9, 2025, 05:10 Hangxiang Yu <master...@gmail.com> wrote:
>> >>
>> >> Hi, Roman.
>> >>
>> >> Thanks for the FLIP.
>> >> +1 for supporting dynamic configuration to reduce manual restart.
>> >>
>> >>
>> >> I just have below questions:
>> >>
>> >> 1. Do we need a working configuration list ? So some unsupported
>> >> configurations could be rejected in advance.
>> >>
>> >> 2. Could we show the change history in the Web UI ? So more changed
>> >> details
>> >> could be tracked.
>> >>
>> >> 3. How does it co-works with other dynamic requests ? For example, it
>> >> modifies the parallelisms together with '
>> >> /jobs/:jobid/resource-requirements'.
>> >>
>> >> On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <ro...@apache.org>
>> >> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> I would like to start a discussion about FLIP-530: Dynamic job
>> >> configuration [1].
>> >>
>> >> In some cases, it is desirable to change Flink job configuration after
>> >> it
>> >> was submitted to Flink, for example:
>> >> - Troubleshooting (e.g. increase checkpoint timeout or failure
>> >> threshold)
>> >> - Performance optimization, (e.g. tuning state backend parameters)
>> >> - Enabling new features after testing them in a non-Production
>> >> environment.
>> >> This allows to de-couple upgrading to newer Flink versions from
>> >> actually
>> >> enabling the features.
>> >> To support such use-cases, we propose to enhance Flink job
>> >> configuration
>> >> REST-endpoint with the support to read full job configuration; and
>> >> update
>> >> it.
>> >>
>> >> Looking forward to feedback.
>> >>
>> >> [1]
>> >> https://cwiki.apache.org/confluence/x/uglKFQ
>> >>
>> >> Regards,
>> >> Roman
>> >>
>> >>
>> >>
>> >> --
>> >> Best,
>> >> Hangxiang.
>> >>
>> >>
>> >>
>>
>>

Re: [DISCUSS] FLIP-530: Dynamic job configuration

Reply via email to