Thanks, xiongraorao. This is a valid theoretical concern, but in practice the risk is mitigated by the existing design:
1. The PATCH handler forwards updates to CheckpointCoordinator on the *JobMaster main thread*. Both fields are written in a single method invocation within synchronized(lock), so any reader that also holds the lock sees both updates atomically. 2. The volatile keyword is primarily a safety net for unsynchronized reads (e.g., metrics reporting or logging). The critical scheduling and canceller logic all operates within synchronized(lock). 3. Even in the worst case of a transient inconsistent read, the next periodic trigger cycle (seconds later) will observe both correct values. There is no persistent corruption. Verne Deng <[email protected]> 于2026年4月15日周三 15:14写道: > Thanks for the design. I think it is a common requirement in our > production. I have one question——Could dynamic timeout extension mask > genuine checkpoint hangs, making problems harder to detect? > > The primary motivation is allowing SREs to extend `checkpointTimeout` > to save near-complete checkpoints. However, this introduces an > operational anti-pattern risk: operators might habitually extend > timeouts instead of investigating root causes (e.g., state backend > degradation, skewed key distribution, slow sinks). A checkpoint > "stuck" at 95% might actually indicate a genuine hang in one subtask, > and extending the timeout only delays the inevitable failure while > consuming additional resources (holding barriers, buffering data). > > This could turn a clear, fast-failing signal into a slow, ambiguous > one — exactly the opposite of what good observability requires. > > 熊饶饶 <[email protected]> 于2026年4月15日周三 15:03写道: > > > > > > It is a very useful feature in the production. Once a checkpoint fails, > the job may stuck and the next checkpoint may fail without updating the > config. The only thing I care is the thread Safety - will `volatile` fields > cause consistency issues between `checkpointInterval` and > `checkpointTimeout`? > > > > The FLIP proposes changing `checkpointInterval` and `checkpointTimeout` > from `final` to `volatile` in `CheckpointCoordinator`. While `volatile` > guarantees visibility, it does not guarantee atomicity across multiple > fields. If a user updates both values simultaneously via a single PATCH > request, there is a window where `CheckpointCoordinator` could observe the > new `checkpointInterval` but the old `checkpointTimeout` (or vice versa). > This partial-update visibility could lead to unexpected behavior — for > example, a shorter interval combined with the old (shorter) timeout, > causing checkpoints to be triggered more frequently and immediately timeout. > > > > > 2026年3月24日 16:29,Jiangang Liu <[email protected]> 写道: > > > > > > Hi everyone, > > > > > > I would like to start a discussion on FLIP-571: Support Dynamically > > > Updating Checkpoint Configuration at Runtime via REST API [1]. > > > > > > Currently, checkpoint configuration (checkpointInterval, > checkpointTimeout) > > > is immutable after job submission. This creates significant operational > > > challenges for long-running streaming jobs: > > > > > > 1. Cascading checkpoint failures cannot be resolved without > restarting > > > the > > > job, causing data reprocessing delays. > > > 2. Near-complete checkpoints (e.g., 95% persisted) are entirely > discarded > > > on timeout — wasting all I/O work and potentially creating a failure > > > loop for large-state jobs. > > > 3. Static configuration cannot adapt to variable workloads at > runtime. > > > > > > FLIP-571 proposes a new REST API endpoint: > > > > > > PATCH /jobs/:jobid/checkpoints/configuration > > > > > > Key design points: > > > > > > - Timeout changes apply immediately to in-flight checkpoints by > > > rescheduling their canceller timers, saving near-complete checkpoints > > > from being discarded. > > > - Interval changes take effect on the next checkpoint trigger cycle. > > > - Configuration overrides are persisted to ExecutionPlanStore > (following > > > the JobResourceRequirements pattern) and automatically restored after > > > failover. > > > > > > For more details, please refer to the FLIP [1]. > > > > > > Looking forward to your feedback and suggestions! > > > > > > [1] > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API > > > > > > Best regards, > > > Jiangang Liu > > >
