Re: [DISCUSS] FLIP-571: Support Dynamically Updating Checkpoint Configuration at Runtime via REST API

Look_Y_Y Wed, 15 Apr 2026 03:11:56 -0700

Thanks for the FLIP. I like the dynamic way to control flink. But I am confused 
that why reuse `ExecutionPlan.getJobConfiguration()` instead of a dedicated 
storage path? What about storage bloat risk?


The FLIP proposes persisting `JobCheckpointingOverrides` inside 
`ExecutionPlan.getJobConfiguration()` using the key 
`$internal.job-checkpoint-overrides`. This piggybacks on the existing 
`ExecutionPlan` blob in ZooKeeper/Kubernetes ConfigMap. Two concerns arise:

1. Coupling risk. Embedding runtime overrides inside the ExecutionPlan blurs 
the boundary between job definition (immutable after submission) and runtime 
state (mutable). This could cause confusion when debugging — the ExecutionPlan 
retrieved from the store may differ from the originally submitted plan.
2. Size and write amplification. Every dynamic update triggers a full 
`ExecutionPlan` re-serialization and write. For jobs with large execution plans 
(thousands of operators), this is a heavyweight operation for changing two 
numbers.

> 2026年4月15日 15:32，Jiangang Liu <[email protected]> 写道：
> 
> Thanks for Verne Deng's valuable question. This is a legitimate operational
> concern, but it is an inherent trade-off of any runtime tuning capability,
> not a flaw specific to this design:
> 
>   1. *The status quo is worse.* Today, the only option is to restart the
>   entire job, which causes data reprocessing and downstream impact. Extending
>   the timeout is strictly less disruptive than a restart, even if the
>   checkpoint ultimately fails.
>   2. *Observability is preserved.* The existing checkpoint metrics (
>   checkpointDuration, checkpointSize, per-subtask completion times) remain
>   fully available. The FLIP does not suppress any signals — it only gives
>   operators more time.
>   3. *Guardrails can be added incrementally.* Future iterations can
>   introduce maximum timeout bounds or automatic alerting when dynamic
>   overrides are active. This FLIP explicitly scopes Phase 1 to the mechanism;
>   policy is a separate concern.
> 
> The documentation and release notes should include best-practice guidance:
> use dynamic timeout extension as a *temporary bridge*, not a permanent
> workaround.
> 
> Jiangang Liu <[email protected]> 于2026年4月15日周三 15:20写道：
> 
>> Thanks, xiongraorao. This is a valid theoretical concern, but in practice
>> the risk is mitigated by the existing design:
>> 
>>   1. The PATCH handler forwards updates to CheckpointCoordinator on the 
>> *JobMaster
>>   main thread*. Both fields are written in a single method invocation
>>   within synchronized(lock), so any reader that also holds the lock sees
>>   both updates atomically.
>>   2. The volatile keyword is primarily a safety net for unsynchronized
>>   reads (e.g., metrics reporting or logging). The critical scheduling and
>>   canceller logic all operates within synchronized(lock).
>>   3. Even in the worst case of a transient inconsistent read, the next
>>   periodic trigger cycle (seconds later) will observe both correct values.
>>   There is no persistent corruption.
>> 
>> 
>> Verne Deng <[email protected]> 于2026年4月15日周三 15:14写道：
>> 
>>> Thanks for the design. I think it is a common requirement in our
>>> production. I have one question——Could dynamic timeout extension mask
>>> genuine checkpoint hangs, making problems harder to detect?
>>> 
>>> The primary motivation is allowing SREs to extend `checkpointTimeout`
>>> to save near-complete checkpoints. However, this introduces an
>>> operational anti-pattern risk: operators might habitually extend
>>> timeouts instead of investigating root causes (e.g., state backend
>>> degradation, skewed key distribution, slow sinks). A checkpoint
>>> "stuck" at 95% might actually indicate a genuine hang in one subtask,
>>> and extending the timeout only delays the inevitable failure while
>>> consuming additional resources (holding barriers, buffering data).
>>> 
>>> This could turn a clear, fast-failing signal into a slow, ambiguous
>>> one — exactly the opposite of what good observability requires.
>>> 
>>> 熊饶饶 <[email protected]> 于2026年4月15日周三 15:03写道：
>>>> 
>>>> 
>>>> It is a very useful feature in the production. Once a checkpoint fails,
>>> the job may stuck and the next checkpoint may fail without updating the
>>> config. The only thing I care is the thread Safety - will `volatile` fields
>>> cause consistency issues between `checkpointInterval` and
>>> `checkpointTimeout`?
>>>> 
>>>> The FLIP proposes changing `checkpointInterval` and `checkpointTimeout`
>>> from `final` to `volatile` in `CheckpointCoordinator`. While `volatile`
>>> guarantees visibility, it does not guarantee atomicity across multiple
>>> fields. If a user updates both values simultaneously via a single PATCH
>>> request, there is a window where `CheckpointCoordinator` could observe the
>>> new `checkpointInterval` but the old `checkpointTimeout` (or vice versa).
>>> This partial-update visibility could lead to unexpected behavior — for
>>> example, a shorter interval combined with the old (shorter) timeout,
>>> causing checkpoints to be triggered more frequently and immediately timeout.
>>>> 
>>>>> 2026年3月24日 16:29，Jiangang Liu <[email protected]> 写道：
>>>>> 
>>>>> Hi everyone,
>>>>> 
>>>>> I would like to start a discussion on FLIP-571: Support Dynamically
>>>>> Updating Checkpoint Configuration at Runtime via REST API [1].
>>>>> 
>>>>> Currently, checkpoint configuration (checkpointInterval,
>>> checkpointTimeout)
>>>>> is immutable after job submission. This creates significant
>>> operational
>>>>> challenges for long-running streaming jobs:
>>>>> 
>>>>>  1. Cascading checkpoint failures cannot be resolved without
>>> restarting
>>>>>  the
>>>>>  job, causing data reprocessing delays.
>>>>>  2. Near-complete checkpoints (e.g., 95% persisted) are entirely
>>> discarded
>>>>>  on timeout — wasting all I/O work and potentially creating a failure
>>>>>  loop for large-state jobs.
>>>>>  3. Static configuration cannot adapt to variable workloads at
>>> runtime.
>>>>> 
>>>>> FLIP-571 proposes a new REST API endpoint:
>>>>> 
>>>>> PATCH /jobs/:jobid/checkpoints/configuration
>>>>> 
>>>>> Key design points:
>>>>> 
>>>>>  - Timeout changes apply immediately to in-flight checkpoints by
>>>>>  rescheduling their canceller timers, saving near-complete
>>> checkpoints
>>>>>  from being discarded.
>>>>>  - Interval changes take effect on the next checkpoint trigger cycle.
>>>>>  - Configuration overrides are persisted to ExecutionPlanStore
>>> (following
>>>>>  the JobResourceRequirements pattern) and automatically restored
>>> after
>>>>>  failover.
>>>>> 
>>>>> For more details, please refer to the FLIP [1].
>>>>> 
>>>>> Looking forward to your feedback and suggestions!
>>>>> 
>>>>> [1]
>>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API
>>>>> 
>>>>> Best regards,
>>>>> Jiangang Liu
>>>> 
>>> 
>>

Re: [DISCUSS] FLIP-571: Support Dynamically Updating Checkpoint Configuration at Runtime via REST API

Reply via email to