Thanks for the design. I think it is a common requirement in our
production. I have one question——Could dynamic timeout extension mask
genuine checkpoint hangs, making problems harder to detect?

The primary motivation is allowing SREs to extend `checkpointTimeout`
to save near-complete checkpoints. However, this introduces an
operational anti-pattern risk: operators might habitually extend
timeouts instead of investigating root causes (e.g., state backend
degradation, skewed key distribution, slow sinks). A checkpoint
"stuck" at 95% might actually indicate a genuine hang in one subtask,
and extending the timeout only delays the inevitable failure while
consuming additional resources (holding barriers, buffering data).

This could turn a clear, fast-failing signal into a slow, ambiguous
one — exactly the opposite of what good observability requires.

熊饶饶 <[email protected]> 于2026年4月15日周三 15:03写道:
>
>
> It is a very useful feature in the production. Once a checkpoint fails, the 
> job may stuck and the next checkpoint may fail without updating the config. 
> The only thing I care is the thread Safety - will `volatile` fields cause 
> consistency issues between `checkpointInterval` and `checkpointTimeout`?
>
> The FLIP proposes changing `checkpointInterval` and `checkpointTimeout` from 
> `final` to `volatile` in `CheckpointCoordinator`. While `volatile` guarantees 
> visibility, it does not guarantee atomicity across multiple fields. If a user 
> updates both values simultaneously via a single PATCH request, there is a 
> window where `CheckpointCoordinator` could observe the new 
> `checkpointInterval` but the old `checkpointTimeout` (or vice versa). This 
> partial-update visibility could lead to unexpected behavior — for example, a 
> shorter interval combined with the old (shorter) timeout, causing checkpoints 
> to be triggered more frequently and immediately timeout.
>
> > 2026年3月24日 16:29,Jiangang Liu <[email protected]> 写道:
> >
> > Hi everyone,
> >
> > I would like to start a discussion on FLIP-571: Support Dynamically
> > Updating Checkpoint Configuration at Runtime via REST API [1].
> >
> > Currently, checkpoint configuration (checkpointInterval, checkpointTimeout)
> > is immutable after job submission. This creates significant operational
> > challenges for long-running streaming jobs:
> >
> >   1. Cascading checkpoint failures cannot be resolved without restarting
> >   the
> >   job, causing data reprocessing delays.
> >   2. Near-complete checkpoints (e.g., 95% persisted) are entirely discarded
> >   on timeout — wasting all I/O work and potentially creating a failure
> >   loop for large-state jobs.
> >   3. Static configuration cannot adapt to variable workloads at runtime.
> >
> > FLIP-571 proposes a new REST API endpoint:
> >
> > PATCH /jobs/:jobid/checkpoints/configuration
> >
> > Key design points:
> >
> >   - Timeout changes apply immediately to in-flight checkpoints by
> >   rescheduling their canceller timers, saving near-complete checkpoints
> >   from being discarded.
> >   - Interval changes take effect on the next checkpoint trigger cycle.
> >   - Configuration overrides are persisted to ExecutionPlanStore (following
> >   the JobResourceRequirements pattern) and automatically restored after
> >   failover.
> >
> > For more details, please refer to the FLIP [1].
> >
> > Looking forward to your feedback and suggestions!
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API
> >
> > Best regards,
> > Jiangang Liu
>

Reply via email to