Yunfeng Zhou created FLINK-34519:
------------------------------------
Summary: Refine checkpoint scheduling and canceling logic
Key: FLINK-34519
URL: https://issues.apache.org/jira/browse/FLINK-34519
Project: Flink
Issue Type: Technical Debt
Components: Runtime / Checkpointing
Affects Versions: 1.20.0
Reporter: Yunfeng Zhou
In the current implementation, CheckpointCoordinator#startCheckpointScheduler
would stop the checkpoint scheduler before starting it, and
CheckpointCoordinator#stopCheckpointScheduler would cancel all ongoing and
pending checkpoints. When a stop-with-savepoint request is received, checkpoint
coordinator would trigger stopCheckpointScheduler before creating the
savepoint, and start the scheduler afterwards if the savepoint fails.
The problem with this behavior is that it mixed up different checkpointing
types. For example, stopCheckpointScheduler() only needs to cancel previous
periodic checkpoints, while the current behavior cancels ongoing savepoints as
well. This behavior is still acceptable for now, given that periodic
checkpointing is enabled so long as a job is running, and two users would
hardly trigger savepoints at the same time. However, as the Batch-Streaming
Unification optimizations need to change some of these assumptions, the
checkpoint coordinator should fix this problem.
To be exact, checkpoint coordinator should at least distinguish between the
following semantics.
- Periodic checkpoint is enabled to ensure that failover recovery time should
be kept within a time limit.
- Periodic checkpoint is disabled to reduce corresponding performance
overhead, but the ability to checkpoint still exists and users can trigger a
savepoint anytime.
- Checkpoint or savepoint is not allowed due to job status or topological
requirements.
It should also be supported for a Flink job to change between the checkpointing
semantics mentioned above dynamically during runtime.
Besides, checkpoints canceled in stopCheckpointScheduler() would fail with an
error message saying "Checkpoint Coordinator is suspending", which is ambiguous
for debugging. The detailed reason should be recorded as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)