Thanks everyone for the discussion! I'm going to start a voting thread soon unless there are other suggestions or objections.
Regards, Roman On Sat, May 17, 2025 at 2:01 PM Roman Khachatryan <ro...@apache.org> wrote: > Thanks Chesnay, I like your idea of returning 403 for non-white-listed > options. Updated the FLIP accordingly. Also, specified > 'execution.checkpointing.interval' as a default value for the allow-list. > > Kartikey Pant, that's a good question, and your understanding is correct. > There's a possibility of breaking the job via this API after passing the > validation. > For example, checkpoint timeout of 1 second would be valid, but might > cause the checkpoints to fail.In such a case, configuration change should > be reverted via a new PUT request. > > Regards, > Roman > > > On Thu, May 15, 2025 at 3:45 PM Chesnay Schepler <ches...@apache.org> > wrote: > >> Documenting the supported options is a fair concern, but at the same >> time also a mountain of work as it would require going through all >> options and creating well-defined rules for what is a job setting and >> what isn't, enforcing that and possibly also change a whole bunch of >> code to make that remotely consistent. >> >> I would say just documenting a few use-cases, like changing the >> checkpoint interval for example, would already be good enough. >> Changing the checkpointing interval on it's own would justify this >> entire effort; anything else that happens to work without explicit >> documentation could then just be a bonus for power users. >> >> I'd may suggest to return FORBIDDEN if an option is provided in the >> request that's not allow listed be changed, and limit bad request to >> invalid json. >> >> But as-is already +1 from my side. >> >> On 12/05/2025 07:33, Junrui Lee wrote: >> > Hi Roman >> > >> > Thanks for driving this feature. +1 for this proposal. >> > >> > I also agree with the suggestion made by Feifan. >> > >> > Currently, not all configuration items are job-level configurations [1]. >> > Even for those that are, not all job-level config options can be >> updated at >> > runtime through the Adaptive Scheduler. For instance, certain config >> option >> > related to job plan compilation, such as >> pipeline.operator-chaining.enabled >> > and nearly all of the table.* settings, are not eligible for runtime >> > updates. >> > >> > >From a user perspective, it would be beneficial to clearly describe >> which >> > config options can be dynamically updated, allowing users to take better >> > advantage of this feature. >> > >> > Best, >> > Junrui >> > >> > [1] >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-478+Introduce+Config+Option+Scope >> > >> > Feifan Wang <zoltar9...@163.com> 于2025年5月12日周一 11:27写道: >> > >> >> Thanks Roman for driving this useful improvement, +1 for this proposal. >> >> >> >> Also thanks discussion from Hangxiang and Rui Fan. Regarding question >> 1, I >> >> have some ideas for discussion: >> >> >> >> Based on the consideration of providing stable expectations for users, >> I >> >> think we should perform configuration checks in a whitelist manner. >> Ensure >> >> that the configurations allowed to be modified through this API can >> >> actually >> >> take effect. >> >> >> >> In the initial version, we can provide a very small whitelist list, >> even if >> >> it only contains a few configurations that we most want to use and have >> >> been >> >> confirmed to be effective. This list can be continuously supplemented >> >> later. >> >> >> >> >> >> —————————————— >> >> >> >> Best regards, >> >> Feifan Wang >> >> >> >> >> >> >> >> ---- Replied Message ---- >> >> | From | Rui Fan<1996fan...@gmail.com> | >> >> | Date | 05/11/2025 16:36 | >> >> | To | <dev@flink.apache.org> | >> >> | Subject | Re: [DISCUSS] FLIP-530: Dynamic job configuration | >> >> Thanks Roman for driving this valuable proposal, it uses the Adaptive >> >> Scheduler to greatly reduce the downtime of configuration updates, >> >> so +1 for this proposal! >> >> >> >> Overall LGTM, thanks to Hangxiang for the questions, and I have the >> >> same questions with Hangxiang. I'd like to share my thoughts: >> >> >> >> >> >> For question1 about validation: >> >> >> >> I think validation is necessary, but both the list of valid >> configurations >> >> and >> >> the list of invalid configurations have limitations. >> >> >> >> For valid configurations: IIUC, almost all job level configurations are >> >> valid >> >> after restarting the job by the adaptive scheduler. It means lots of >> new >> >> configurations need to be added to the list if we list valid >> >> configurations. >> >> If other developers miss it, the new configuration will fail >> validation(but >> >> it works). >> >> >> >> For invalid configurations: I encountered a problem before, where the >> user >> >> added a non-existent flink configuration, but flink could not detect >> it. >> >> It may be caused by typo. Therefore, even if we list some Flink >> >> configurations >> >> that do not support dynamic modification, we still cannot guarantee >> that >> >> the >> >> configurations outside the list will take effect. >> >> >> >> Even so, I prefer to do limited validation, for example: not through a >> >> list, >> >> but hard code a few rules (e.g. table.* doesn't work). >> >> >> >> >> >> For question 2 about configuration change history: >> >> >> >> Logging configuration change history in the first version is fine. >> >> >> >> As I understand, both of configuration change and resource requirements >> >> change >> >> could trigger a rescale for Adaptive Scheduler. So rescale history can >> >> probably >> >> include both. If we want to show the configuration change history, it >> might >> >> be >> >> more appropriate to put it in FLIP-487[1] and FLIP-495[2]. >> >> >> >> For question 3 about co-works with other dynamic requests: >> >> >> >> Configuration changes are applied immediately; resource requirements >> >> changes are applied with some delay >> >> >> >> Yes, rescale after some delay could reduce the rescale frequency to >> avoid >> >> some invalid restarts. So I'm curious why configuration changes don't >> >> respect the delay mechanism? >> >> >> >> Please correct me if anything is wrong, thanks! >> >> >> >> [1] >> >> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler >> >> [2] >> >> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history >> >> >> >> Best, >> >> Rui >> >> >> >> >> >> On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan <ro...@apache.org> >> >> wrote: >> >> >> >> Thanks Hangxiang Yu, >> >> >> >> Please find the answers below >> >> >> >> 1. Yes, we should perform validation before trying to update the >> >> configuration. I'd rather validate some specific options that are >> known to >> >> not work though. Finding and hard-coding all the valid options might be >> >> impractical since they can change, and non trivial. >> >> >> >> 2. That would be great, but we'd have to store the history of such >> updates >> >> somewhere. For debugging purposes, logs should suffice I think >> >> >> >> 3. That's a great question! Configuration changes are applied >> immediately; >> >> resource requirements changes are applied with some delay; and both are >> >> stored in HA immediately. So configuration change request results also >> in >> >> restarting and applying why pending resource requirements changes >> >> >> >> >> >> Regards, >> >> Roman >> >> >> >> On Fri, May 9, 2025, 05:10 Hangxiang Yu <master...@gmail.com> wrote: >> >> >> >> Hi, Roman. >> >> >> >> Thanks for the FLIP. >> >> +1 for supporting dynamic configuration to reduce manual restart. >> >> >> >> >> >> I just have below questions: >> >> >> >> 1. Do we need a working configuration list ? So some unsupported >> >> configurations could be rejected in advance. >> >> >> >> 2. Could we show the change history in the Web UI ? So more changed >> >> details >> >> could be tracked. >> >> >> >> 3. How does it co-works with other dynamic requests ? For example, it >> >> modifies the parallelisms together with ' >> >> /jobs/:jobid/resource-requirements'. >> >> >> >> On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <ro...@apache.org> >> >> wrote: >> >> >> >> Hi everyone, >> >> >> >> I would like to start a discussion about FLIP-530: Dynamic job >> >> configuration [1]. >> >> >> >> In some cases, it is desirable to change Flink job configuration after >> >> it >> >> was submitted to Flink, for example: >> >> - Troubleshooting (e.g. increase checkpoint timeout or failure >> >> threshold) >> >> - Performance optimization, (e.g. tuning state backend parameters) >> >> - Enabling new features after testing them in a non-Production >> >> environment. >> >> This allows to de-couple upgrading to newer Flink versions from >> >> actually >> >> enabling the features. >> >> To support such use-cases, we propose to enhance Flink job >> >> configuration >> >> REST-endpoint with the support to read full job configuration; and >> >> update >> >> it. >> >> >> >> Looking forward to feedback. >> >> >> >> [1] >> >> https://cwiki.apache.org/confluence/x/uglKFQ >> >> >> >> Regards, >> >> Roman >> >> >> >> >> >> >> >> -- >> >> Best, >> >> Hangxiang. >> >> >> >> >> >> >> >>