It is a very useful feature in the production. Once a checkpoint fails, the job 
may stuck and the next checkpoint may fail without updating the config. The 
only thing I care is the thread Safety - will `volatile` fields cause 
consistency issues between `checkpointInterval` and `checkpointTimeout`?

The FLIP proposes changing `checkpointInterval` and `checkpointTimeout` from 
`final` to `volatile` in `CheckpointCoordinator`. While `volatile` guarantees 
visibility, it does not guarantee atomicity across multiple fields. If a user 
updates both values simultaneously via a single PATCH request, there is a 
window where `CheckpointCoordinator` could observe the new `checkpointInterval` 
but the old `checkpointTimeout` (or vice versa). This partial-update visibility 
could lead to unexpected behavior — for example, a shorter interval combined 
with the old (shorter) timeout, causing checkpoints to be triggered more 
frequently and immediately timeout.

> 2026年3月24日 16:29,Jiangang Liu <[email protected]> 写道:
> 
> Hi everyone,
> 
> I would like to start a discussion on FLIP-571: Support Dynamically
> Updating Checkpoint Configuration at Runtime via REST API [1].
> 
> Currently, checkpoint configuration (checkpointInterval, checkpointTimeout)
> is immutable after job submission. This creates significant operational
> challenges for long-running streaming jobs:
> 
>   1. Cascading checkpoint failures cannot be resolved without restarting
>   the
>   job, causing data reprocessing delays.
>   2. Near-complete checkpoints (e.g., 95% persisted) are entirely discarded
>   on timeout — wasting all I/O work and potentially creating a failure
>   loop for large-state jobs.
>   3. Static configuration cannot adapt to variable workloads at runtime.
> 
> FLIP-571 proposes a new REST API endpoint:
> 
> PATCH /jobs/:jobid/checkpoints/configuration
> 
> Key design points:
> 
>   - Timeout changes apply immediately to in-flight checkpoints by
>   rescheduling their canceller timers, saving near-complete checkpoints
>   from being discarded.
>   - Interval changes take effect on the next checkpoint trigger cycle.
>   - Configuration overrides are persisted to ExecutionPlanStore (following
>   the JobResourceRequirements pattern) and automatically restored after
>   failover.
> 
> For more details, please refer to the FLIP [1].
> 
> Looking forward to your feedback and suggestions!
> 
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API
> 
> Best regards,
> Jiangang Liu

Reply via email to