Re: [DISCUSS] FLIP-530: Dynamic job configuration

Junrui Lee Sun, 11 May 2025 22:34:23 -0700

Hi Roman

Thanks for driving this feature. +1 for this proposal.


I also agree with the suggestion made by Feifan.

Currently, not all configuration items are job-level configurations [1].
Even for those that are, not all job-level config options can be updated at
runtime through the Adaptive Scheduler. For instance, certain config option
related to job plan compilation, such as pipeline.operator-chaining.enabled
and nearly all of the table.* settings, are not eligible for runtime
updates.

>From a user perspective, it would be beneficial to clearly describe which
config options can be dynamically updated, allowing users to take better
advantage of this feature.

Best,
Junrui

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-478+Introduce+Config+Option+Scope

Feifan Wang <zoltar9...@163.com> 于2025年5月12日周一 11:27写道：

> Thanks Roman for driving this useful improvement, +1 for this proposal.
>
> Also thanks discussion from Hangxiang and Rui Fan. Regarding question 1, I
> have some ideas for discussion:
>
> Based on the consideration of providing stable expectations for users, I
> think we should perform configuration checks in a whitelist manner. Ensure
> that the configurations allowed to be modified through this API can
> actually
> take effect.
>
> In the initial version, we can provide a very small whitelist list, even if
> it only contains a few configurations that we most want to use and have
> been
> confirmed to be effective. This list can be continuously supplemented
> later.
>
>
> ——————————————
>
> Best regards,
> Feifan Wang
>
>
>
> ---- Replied Message ----
> | From | Rui Fan<1996fan...@gmail.com> |
> | Date | 05/11/2025 16:36 |
> | To | <dev@flink.apache.org> |
> | Subject | Re: [DISCUSS] FLIP-530: Dynamic job configuration |
> Thanks Roman for driving this valuable proposal, it uses the Adaptive
> Scheduler to greatly reduce the downtime of configuration updates,
> so +1 for this proposal!
>
> Overall LGTM, thanks to Hangxiang for the questions, and I have the
> same questions with Hangxiang. I'd like to share my thoughts:
>
>
> For question1 about validation:
>
> I think validation is necessary, but both the list of valid configurations
> and
> the list of invalid configurations have limitations.
>
> For valid configurations: IIUC, almost all job level configurations are
> valid
> after restarting the job by the adaptive scheduler. It means lots of new
> configurations need to be added to the list if we list valid
> configurations.
> If other developers miss it, the new configuration will fail validation(but
> it works).
>
> For invalid configurations: I encountered a problem before, where the user
> added a non-existent flink configuration, but flink could not detect it.
> It may be caused by typo. Therefore, even if we list some Flink
> configurations
> that do not support dynamic modification, we still cannot guarantee that
> the
> configurations outside the list will take effect.
>
> Even so, I prefer to do limited validation, for example: not through a
> list,
> but hard code a few rules (e.g. table.* doesn't work).
>
>
> For question 2 about configuration change history:
>
> Logging configuration change history in the first version is fine.
>
> As I understand, both of configuration change and resource requirements
> change
> could trigger a rescale for Adaptive Scheduler. So rescale history can
> probably
> include both. If we want to show the configuration change history, it might
> be
> more appropriate to put it in FLIP-487[1] and FLIP-495[2].
>
> For question 3 about co-works with other dynamic requests:
>
> Configuration changes are applied immediately; resource requirements
> changes are applied with some delay
>
> Yes, rescale after some delay could reduce the rescale frequency to avoid
> some invalid restarts. So I'm curious why configuration changes don't
> respect the delay mechanism?
>
> Please correct me if anything is wrong, thanks!
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history
>
> Best,
> Rui
>
>
> On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <ro...@apache.org>
> wrote:
>
> Thanks Hangxiang Yu,
>
> Please find the answers below
>
> 1. Yes, we should perform validation before trying to update the
> configuration. I'd rather validate some specific options that are known to
> not work though. Finding and hard-coding all the valid options might be
> impractical since they can change, and non trivial.
>
> 2. That would be great, but we'd have to store the history of such updates
> somewhere. For debugging purposes, logs should suffice I think
>
> 3. That's a great question! Configuration changes are applied immediately;
> resource requirements changes are applied with some delay; and both are
> stored in HA immediately. So configuration change request results also in
> restarting and applying why pending resource requirements changes
>
>
> Regards,
> Roman
>
> On Fri, May 9, 2025, 05:10 Hangxiang Yu <master...@gmail.com> wrote:
>
> Hi, Roman.
>
> Thanks for the FLIP.
> +1 for supporting dynamic configuration to reduce manual restart.
>
>
> I just have below questions:
>
> 1. Do we need a working configuration list ? So some unsupported
> configurations could be rejected in advance.
>
> 2. Could we show the change history in the Web UI ? So more changed
> details
> could be tracked.
>
> 3. How does it co-works with other dynamic requests ? For example, it
> modifies the parallelisms together with '
> /jobs/:jobid/resource-requirements'.
>
> On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <ro...@apache.org>
> wrote:
>
> Hi everyone,
>
> I would like to start a discussion about FLIP-530: Dynamic job
> configuration [1].
>
> In some cases, it is desirable to change Flink job configuration after
> it
> was submitted to Flink, for example:
> - Troubleshooting (e.g. increase checkpoint timeout or failure
> threshold)
> - Performance optimization, (e.g. tuning state backend parameters)
> - Enabling new features after testing them in a non-Production
> environment.
> This allows to de-couple upgrading to newer Flink versions from
> actually
> enabling the features.
> To support such use-cases, we propose to enhance Flink job
> configuration
> REST-endpoint with the support to read full job configuration; and
> update
> it.
>
> Looking forward to feedback.
>
> [1]
> https://cwiki.apache.org/confluence/x/uglKFQ
>
> Regards,
> Roman
>
>
>
> --
> Best,
> Hangxiang.
>
>
>

Re: [DISCUSS] FLIP-530: Dynamic job configuration

Reply via email to