Thanks everyone for sharing thoughts. Sorry for the late reply. I'm also +1 for the limited white list from the start. Also Thanks Rui for sharing extra information.
On Tue, May 20, 2025 at 8:59 PM Gustavo de Morais <gustavopg...@gmail.com> wrote: > Hi Roman, > > This is a great and important improvement. +1 for the FLIP and to start > voting. > > Best, > Gustavo > > On Mon, 19 May 2025 at 20:28, Roman Khachatryan <ro...@apache.org> wrote: > > > Thanks everyone for the discussion! > > > > I'm going to start a voting thread soon unless there are other > suggestions > > or objections. > > > > Regards, > > Roman > > > > > > On Sat, May 17, 2025 at 2:01 PM Roman Khachatryan <ro...@apache.org> > > wrote: > > > > > Thanks Chesnay, I like your idea of returning 403 for non-white-listed > > > options. Updated the FLIP accordingly. Also, specified > > > 'execution.checkpointing.interval' as a default value for the > allow-list. > > > > > > Kartikey Pant, that's a good question, and your understanding is > correct. > > > There's a possibility of breaking the job via this API after passing > the > > > validation. > > > For example, checkpoint timeout of 1 second would be valid, but might > > > cause the checkpoints to fail.In such a case, configuration change > should > > > be reverted via a new PUT request. > > > > > > Regards, > > > Roman > > > > > > > > > On Thu, May 15, 2025 at 3:45 PM Chesnay Schepler <ches...@apache.org> > > > wrote: > > > > > >> Documenting the supported options is a fair concern, but at the same > > >> time also a mountain of work as it would require going through all > > >> options and creating well-defined rules for what is a job setting and > > >> what isn't, enforcing that and possibly also change a whole bunch of > > >> code to make that remotely consistent. > > >> > > >> I would say just documenting a few use-cases, like changing the > > >> checkpoint interval for example, would already be good enough. > > >> Changing the checkpointing interval on it's own would justify this > > >> entire effort; anything else that happens to work without explicit > > >> documentation could then just be a bonus for power users. > > >> > > >> I'd may suggest to return FORBIDDEN if an option is provided in the > > >> request that's not allow listed be changed, and limit bad request to > > >> invalid json. > > >> > > >> But as-is already +1 from my side. > > >> > > >> On 12/05/2025 07:33, Junrui Lee wrote: > > >> > Hi Roman > > >> > > > >> > Thanks for driving this feature. +1 for this proposal. > > >> > > > >> > I also agree with the suggestion made by Feifan. > > >> > > > >> > Currently, not all configuration items are job-level configurations > > [1]. > > >> > Even for those that are, not all job-level config options can be > > >> updated at > > >> > runtime through the Adaptive Scheduler. For instance, certain config > > >> option > > >> > related to job plan compilation, such as > > >> pipeline.operator-chaining.enabled > > >> > and nearly all of the table.* settings, are not eligible for runtime > > >> > updates. > > >> > > > >> > >From a user perspective, it would be beneficial to clearly describe > > >> which > > >> > config options can be dynamically updated, allowing users to take > > better > > >> > advantage of this feature. > > >> > > > >> > Best, > > >> > Junrui > > >> > > > >> > [1] > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-478+Introduce+Config+Option+Scope > > >> > > > >> > Feifan Wang <zoltar9...@163.com> 于2025年5月12日周一 11:27写道: > > >> > > > >> >> Thanks Roman for driving this useful improvement, +1 for this > > proposal. > > >> >> > > >> >> Also thanks discussion from Hangxiang and Rui Fan. Regarding > question > > >> 1, I > > >> >> have some ideas for discussion: > > >> >> > > >> >> Based on the consideration of providing stable expectations for > > users, > > >> I > > >> >> think we should perform configuration checks in a whitelist manner. > > >> Ensure > > >> >> that the configurations allowed to be modified through this API can > > >> >> actually > > >> >> take effect. > > >> >> > > >> >> In the initial version, we can provide a very small whitelist list, > > >> even if > > >> >> it only contains a few configurations that we most want to use and > > have > > >> >> been > > >> >> confirmed to be effective. This list can be continuously > supplemented > > >> >> later. > > >> >> > > >> >> > > >> >> —————————————— > > >> >> > > >> >> Best regards, > > >> >> Feifan Wang > > >> >> > > >> >> > > >> >> > > >> >> ---- Replied Message ---- > > >> >> | From | Rui Fan<1996fan...@gmail.com> | > > >> >> | Date | 05/11/2025 16:36 | > > >> >> | To | <dev@flink.apache.org> | > > >> >> | Subject | Re: [DISCUSS] FLIP-530: Dynamic job configuration | > > >> >> Thanks Roman for driving this valuable proposal, it uses the > Adaptive > > >> >> Scheduler to greatly reduce the downtime of configuration updates, > > >> >> so +1 for this proposal! > > >> >> > > >> >> Overall LGTM, thanks to Hangxiang for the questions, and I have the > > >> >> same questions with Hangxiang. I'd like to share my thoughts: > > >> >> > > >> >> > > >> >> For question1 about validation: > > >> >> > > >> >> I think validation is necessary, but both the list of valid > > >> configurations > > >> >> and > > >> >> the list of invalid configurations have limitations. > > >> >> > > >> >> For valid configurations: IIUC, almost all job level configurations > > are > > >> >> valid > > >> >> after restarting the job by the adaptive scheduler. It means lots > of > > >> new > > >> >> configurations need to be added to the list if we list valid > > >> >> configurations. > > >> >> If other developers miss it, the new configuration will fail > > >> validation(but > > >> >> it works). > > >> >> > > >> >> For invalid configurations: I encountered a problem before, where > the > > >> user > > >> >> added a non-existent flink configuration, but flink could not > detect > > >> it. > > >> >> It may be caused by typo. Therefore, even if we list some Flink > > >> >> configurations > > >> >> that do not support dynamic modification, we still cannot guarantee > > >> that > > >> >> the > > >> >> configurations outside the list will take effect. > > >> >> > > >> >> Even so, I prefer to do limited validation, for example: not > through > > a > > >> >> list, > > >> >> but hard code a few rules (e.g. table.* doesn't work). > > >> >> > > >> >> > > >> >> For question 2 about configuration change history: > > >> >> > > >> >> Logging configuration change history in the first version is fine. > > >> >> > > >> >> As I understand, both of configuration change and resource > > requirements > > >> >> change > > >> >> could trigger a rescale for Adaptive Scheduler. So rescale history > > can > > >> >> probably > > >> >> include both. If we want to show the configuration change history, > it > > >> might > > >> >> be > > >> >> more appropriate to put it in FLIP-487[1] and FLIP-495[2]. > > >> >> > > >> >> For question 3 about co-works with other dynamic requests: > > >> >> > > >> >> Configuration changes are applied immediately; resource > requirements > > >> >> changes are applied with some delay > > >> >> > > >> >> Yes, rescale after some delay could reduce the rescale frequency to > > >> avoid > > >> >> some invalid restarts. So I'm curious why configuration changes > don't > > >> >> respect the delay mechanism? > > >> >> > > >> >> Please correct me if anything is wrong, thanks! > > >> >> > > >> >> [1] > > >> >> > > >> >> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler > > >> >> [2] > > >> >> > > >> >> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history > > >> >> > > >> >> Best, > > >> >> Rui > > >> >> > > >> >> > > >> >> On Sat, May 10, 2025 at 11:57 AM Roman Khachatryan < > ro...@apache.org > > > > > >> >> wrote: > > >> >> > > >> >> Thanks Hangxiang Yu, > > >> >> > > >> >> Please find the answers below > > >> >> > > >> >> 1. Yes, we should perform validation before trying to update the > > >> >> configuration. I'd rather validate some specific options that are > > >> known to > > >> >> not work though. Finding and hard-coding all the valid options > might > > be > > >> >> impractical since they can change, and non trivial. > > >> >> > > >> >> 2. That would be great, but we'd have to store the history of such > > >> updates > > >> >> somewhere. For debugging purposes, logs should suffice I think > > >> >> > > >> >> 3. That's a great question! Configuration changes are applied > > >> immediately; > > >> >> resource requirements changes are applied with some delay; and both > > are > > >> >> stored in HA immediately. So configuration change request results > > also > > >> in > > >> >> restarting and applying why pending resource requirements changes > > >> >> > > >> >> > > >> >> Regards, > > >> >> Roman > > >> >> > > >> >> On Fri, May 9, 2025, 05:10 Hangxiang Yu <master...@gmail.com> > wrote: > > >> >> > > >> >> Hi, Roman. > > >> >> > > >> >> Thanks for the FLIP. > > >> >> +1 for supporting dynamic configuration to reduce manual restart. > > >> >> > > >> >> > > >> >> I just have below questions: > > >> >> > > >> >> 1. Do we need a working configuration list ? So some unsupported > > >> >> configurations could be rejected in advance. > > >> >> > > >> >> 2. Could we show the change history in the Web UI ? So more changed > > >> >> details > > >> >> could be tracked. > > >> >> > > >> >> 3. How does it co-works with other dynamic requests ? For example, > it > > >> >> modifies the parallelisms together with ' > > >> >> /jobs/:jobid/resource-requirements'. > > >> >> > > >> >> On Fri, May 9, 2025 at 5:00 AM Roman Khachatryan <ro...@apache.org > > > > >> >> wrote: > > >> >> > > >> >> Hi everyone, > > >> >> > > >> >> I would like to start a discussion about FLIP-530: Dynamic job > > >> >> configuration [1]. > > >> >> > > >> >> In some cases, it is desirable to change Flink job configuration > > after > > >> >> it > > >> >> was submitted to Flink, for example: > > >> >> - Troubleshooting (e.g. increase checkpoint timeout or failure > > >> >> threshold) > > >> >> - Performance optimization, (e.g. tuning state backend parameters) > > >> >> - Enabling new features after testing them in a non-Production > > >> >> environment. > > >> >> This allows to de-couple upgrading to newer Flink versions from > > >> >> actually > > >> >> enabling the features. > > >> >> To support such use-cases, we propose to enhance Flink job > > >> >> configuration > > >> >> REST-endpoint with the support to read full job configuration; and > > >> >> update > > >> >> it. > > >> >> > > >> >> Looking forward to feedback. > > >> >> > > >> >> [1] > > >> >> https://cwiki.apache.org/confluence/x/uglKFQ > > >> >> > > >> >> Regards, > > >> >> Roman > > >> >> > > >> >> > > >> >> > > >> >> -- > > >> >> Best, > > >> >> Hangxiang. > > >> >> > > >> >> > > >> >> > > >> > > >> > > > -- Best, Hangxiang.