Thanks for Verne Deng's valuable question. This is a legitimate operational
concern, but it is an inherent trade-off of any runtime tuning capability,
not a flaw specific to this design:

   1. *The status quo is worse.* Today, the only option is to restart the
   entire job, which causes data reprocessing and downstream impact. Extending
   the timeout is strictly less disruptive than a restart, even if the
   checkpoint ultimately fails.
   2. *Observability is preserved.* The existing checkpoint metrics (
   checkpointDuration, checkpointSize, per-subtask completion times) remain
   fully available. The FLIP does not suppress any signals — it only gives
   operators more time.
   3. *Guardrails can be added incrementally.* Future iterations can
   introduce maximum timeout bounds or automatic alerting when dynamic
   overrides are active. This FLIP explicitly scopes Phase 1 to the mechanism;
   policy is a separate concern.

The documentation and release notes should include best-practice guidance:
use dynamic timeout extension as a *temporary bridge*, not a permanent
workaround.

Jiangang Liu <[email protected]> 于2026年4月15日周三 15:20写道:

> Thanks, xiongraorao. This is a valid theoretical concern, but in practice
> the risk is mitigated by the existing design:
>
>    1. The PATCH handler forwards updates to CheckpointCoordinator on the 
> *JobMaster
>    main thread*. Both fields are written in a single method invocation
>    within synchronized(lock), so any reader that also holds the lock sees
>    both updates atomically.
>    2. The volatile keyword is primarily a safety net for unsynchronized
>    reads (e.g., metrics reporting or logging). The critical scheduling and
>    canceller logic all operates within synchronized(lock).
>    3. Even in the worst case of a transient inconsistent read, the next
>    periodic trigger cycle (seconds later) will observe both correct values.
>    There is no persistent corruption.
>
>
> Verne Deng <[email protected]> 于2026年4月15日周三 15:14写道:
>
>> Thanks for the design. I think it is a common requirement in our
>> production. I have one question——Could dynamic timeout extension mask
>> genuine checkpoint hangs, making problems harder to detect?
>>
>> The primary motivation is allowing SREs to extend `checkpointTimeout`
>> to save near-complete checkpoints. However, this introduces an
>> operational anti-pattern risk: operators might habitually extend
>> timeouts instead of investigating root causes (e.g., state backend
>> degradation, skewed key distribution, slow sinks). A checkpoint
>> "stuck" at 95% might actually indicate a genuine hang in one subtask,
>> and extending the timeout only delays the inevitable failure while
>> consuming additional resources (holding barriers, buffering data).
>>
>> This could turn a clear, fast-failing signal into a slow, ambiguous
>> one — exactly the opposite of what good observability requires.
>>
>> 熊饶饶 <[email protected]> 于2026年4月15日周三 15:03写道:
>> >
>> >
>> > It is a very useful feature in the production. Once a checkpoint fails,
>> the job may stuck and the next checkpoint may fail without updating the
>> config. The only thing I care is the thread Safety - will `volatile` fields
>> cause consistency issues between `checkpointInterval` and
>> `checkpointTimeout`?
>> >
>> > The FLIP proposes changing `checkpointInterval` and `checkpointTimeout`
>> from `final` to `volatile` in `CheckpointCoordinator`. While `volatile`
>> guarantees visibility, it does not guarantee atomicity across multiple
>> fields. If a user updates both values simultaneously via a single PATCH
>> request, there is a window where `CheckpointCoordinator` could observe the
>> new `checkpointInterval` but the old `checkpointTimeout` (or vice versa).
>> This partial-update visibility could lead to unexpected behavior — for
>> example, a shorter interval combined with the old (shorter) timeout,
>> causing checkpoints to be triggered more frequently and immediately timeout.
>> >
>> > > 2026年3月24日 16:29,Jiangang Liu <[email protected]> 写道:
>> > >
>> > > Hi everyone,
>> > >
>> > > I would like to start a discussion on FLIP-571: Support Dynamically
>> > > Updating Checkpoint Configuration at Runtime via REST API [1].
>> > >
>> > > Currently, checkpoint configuration (checkpointInterval,
>> checkpointTimeout)
>> > > is immutable after job submission. This creates significant
>> operational
>> > > challenges for long-running streaming jobs:
>> > >
>> > >   1. Cascading checkpoint failures cannot be resolved without
>> restarting
>> > >   the
>> > >   job, causing data reprocessing delays.
>> > >   2. Near-complete checkpoints (e.g., 95% persisted) are entirely
>> discarded
>> > >   on timeout — wasting all I/O work and potentially creating a failure
>> > >   loop for large-state jobs.
>> > >   3. Static configuration cannot adapt to variable workloads at
>> runtime.
>> > >
>> > > FLIP-571 proposes a new REST API endpoint:
>> > >
>> > > PATCH /jobs/:jobid/checkpoints/configuration
>> > >
>> > > Key design points:
>> > >
>> > >   - Timeout changes apply immediately to in-flight checkpoints by
>> > >   rescheduling their canceller timers, saving near-complete
>> checkpoints
>> > >   from being discarded.
>> > >   - Interval changes take effect on the next checkpoint trigger cycle.
>> > >   - Configuration overrides are persisted to ExecutionPlanStore
>> (following
>> > >   the JobResourceRequirements pattern) and automatically restored
>> after
>> > >   failover.
>> > >
>> > > For more details, please refer to the FLIP [1].
>> > >
>> > > Looking forward to your feedback and suggestions!
>> > >
>> > > [1]
>> > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API
>> > >
>> > > Best regards,
>> > > Jiangang Liu
>> >
>>
>

Reply via email to