Thanks for the FLIP. I like the dynamic way to control flink. But I am confused that why reuse `ExecutionPlan.getJobConfiguration()` instead of a dedicated storage path? What about storage bloat risk?
The FLIP proposes persisting `JobCheckpointingOverrides` inside `ExecutionPlan.getJobConfiguration()` using the key `$internal.job-checkpoint-overrides`. This piggybacks on the existing `ExecutionPlan` blob in ZooKeeper/Kubernetes ConfigMap. Two concerns arise: 1. Coupling risk. Embedding runtime overrides inside the ExecutionPlan blurs the boundary between job definition (immutable after submission) and runtime state (mutable). This could cause confusion when debugging — the ExecutionPlan retrieved from the store may differ from the originally submitted plan. 2. Size and write amplification. Every dynamic update triggers a full `ExecutionPlan` re-serialization and write. For jobs with large execution plans (thousands of operators), this is a heavyweight operation for changing two numbers. > 2026年4月15日 15:32,Jiangang Liu <[email protected]> 写道: > > Thanks for Verne Deng's valuable question. This is a legitimate operational > concern, but it is an inherent trade-off of any runtime tuning capability, > not a flaw specific to this design: > > 1. *The status quo is worse.* Today, the only option is to restart the > entire job, which causes data reprocessing and downstream impact. Extending > the timeout is strictly less disruptive than a restart, even if the > checkpoint ultimately fails. > 2. *Observability is preserved.* The existing checkpoint metrics ( > checkpointDuration, checkpointSize, per-subtask completion times) remain > fully available. The FLIP does not suppress any signals — it only gives > operators more time. > 3. *Guardrails can be added incrementally.* Future iterations can > introduce maximum timeout bounds or automatic alerting when dynamic > overrides are active. This FLIP explicitly scopes Phase 1 to the mechanism; > policy is a separate concern. > > The documentation and release notes should include best-practice guidance: > use dynamic timeout extension as a *temporary bridge*, not a permanent > workaround. > > Jiangang Liu <[email protected]> 于2026年4月15日周三 15:20写道: > >> Thanks, xiongraorao. This is a valid theoretical concern, but in practice >> the risk is mitigated by the existing design: >> >> 1. The PATCH handler forwards updates to CheckpointCoordinator on the >> *JobMaster >> main thread*. Both fields are written in a single method invocation >> within synchronized(lock), so any reader that also holds the lock sees >> both updates atomically. >> 2. The volatile keyword is primarily a safety net for unsynchronized >> reads (e.g., metrics reporting or logging). The critical scheduling and >> canceller logic all operates within synchronized(lock). >> 3. Even in the worst case of a transient inconsistent read, the next >> periodic trigger cycle (seconds later) will observe both correct values. >> There is no persistent corruption. >> >> >> Verne Deng <[email protected]> 于2026年4月15日周三 15:14写道: >> >>> Thanks for the design. I think it is a common requirement in our >>> production. I have one question——Could dynamic timeout extension mask >>> genuine checkpoint hangs, making problems harder to detect? >>> >>> The primary motivation is allowing SREs to extend `checkpointTimeout` >>> to save near-complete checkpoints. However, this introduces an >>> operational anti-pattern risk: operators might habitually extend >>> timeouts instead of investigating root causes (e.g., state backend >>> degradation, skewed key distribution, slow sinks). A checkpoint >>> "stuck" at 95% might actually indicate a genuine hang in one subtask, >>> and extending the timeout only delays the inevitable failure while >>> consuming additional resources (holding barriers, buffering data). >>> >>> This could turn a clear, fast-failing signal into a slow, ambiguous >>> one — exactly the opposite of what good observability requires. >>> >>> 熊饶饶 <[email protected]> 于2026年4月15日周三 15:03写道: >>>> >>>> >>>> It is a very useful feature in the production. Once a checkpoint fails, >>> the job may stuck and the next checkpoint may fail without updating the >>> config. The only thing I care is the thread Safety - will `volatile` fields >>> cause consistency issues between `checkpointInterval` and >>> `checkpointTimeout`? >>>> >>>> The FLIP proposes changing `checkpointInterval` and `checkpointTimeout` >>> from `final` to `volatile` in `CheckpointCoordinator`. While `volatile` >>> guarantees visibility, it does not guarantee atomicity across multiple >>> fields. If a user updates both values simultaneously via a single PATCH >>> request, there is a window where `CheckpointCoordinator` could observe the >>> new `checkpointInterval` but the old `checkpointTimeout` (or vice versa). >>> This partial-update visibility could lead to unexpected behavior — for >>> example, a shorter interval combined with the old (shorter) timeout, >>> causing checkpoints to be triggered more frequently and immediately timeout. >>>> >>>>> 2026年3月24日 16:29,Jiangang Liu <[email protected]> 写道: >>>>> >>>>> Hi everyone, >>>>> >>>>> I would like to start a discussion on FLIP-571: Support Dynamically >>>>> Updating Checkpoint Configuration at Runtime via REST API [1]. >>>>> >>>>> Currently, checkpoint configuration (checkpointInterval, >>> checkpointTimeout) >>>>> is immutable after job submission. This creates significant >>> operational >>>>> challenges for long-running streaming jobs: >>>>> >>>>> 1. Cascading checkpoint failures cannot be resolved without >>> restarting >>>>> the >>>>> job, causing data reprocessing delays. >>>>> 2. Near-complete checkpoints (e.g., 95% persisted) are entirely >>> discarded >>>>> on timeout — wasting all I/O work and potentially creating a failure >>>>> loop for large-state jobs. >>>>> 3. Static configuration cannot adapt to variable workloads at >>> runtime. >>>>> >>>>> FLIP-571 proposes a new REST API endpoint: >>>>> >>>>> PATCH /jobs/:jobid/checkpoints/configuration >>>>> >>>>> Key design points: >>>>> >>>>> - Timeout changes apply immediately to in-flight checkpoints by >>>>> rescheduling their canceller timers, saving near-complete >>> checkpoints >>>>> from being discarded. >>>>> - Interval changes take effect on the next checkpoint trigger cycle. >>>>> - Configuration overrides are persisted to ExecutionPlanStore >>> (following >>>>> the JobResourceRequirements pattern) and automatically restored >>> after >>>>> failover. >>>>> >>>>> For more details, please refer to the FLIP [1]. >>>>> >>>>> Looking forward to your feedback and suggestions! >>>>> >>>>> [1] >>>>> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API >>>>> >>>>> Best regards, >>>>> Jiangang Liu >>>> >>> >>
