[jira] [Created] (FLINK-34519) Refine checkpoint scheduling and canceling logic

Yunfeng Zhou (Jira) Mon, 26 Feb 2024 04:13:15 -0800

Yunfeng Zhou created FLINK-34519:
------------------------------------

             Summary: Refine checkpoint scheduling and canceling logic
                 Key: FLINK-34519
                 URL: https://issues.apache.org/jira/browse/FLINK-34519
             Project: Flink
          Issue Type: Technical Debt
          Components: Runtime / Checkpointing
    Affects Versions: 1.20.0
            Reporter: Yunfeng Zhou



In the current implementation, CheckpointCoordinator#startCheckpointScheduler 
would stop the checkpoint scheduler before starting it, and 
CheckpointCoordinator#stopCheckpointScheduler would cancel all ongoing and 
pending checkpoints. When a stop-with-savepoint request is received, checkpoint 
coordinator would trigger stopCheckpointScheduler before creating the 
savepoint, and start the scheduler afterwards if the savepoint fails.

The problem with this behavior is that it mixed up different checkpointing 
types. For example, stopCheckpointScheduler() only needs to cancel previous 
periodic checkpoints, while the current behavior cancels ongoing savepoints as 
well. This behavior is still acceptable for now, given that periodic 
checkpointing is enabled so long as a job is running, and two users would 
hardly trigger savepoints at the same time. However, as the Batch-Streaming 
Unification optimizations need to change some of these assumptions, the 
checkpoint coordinator should fix this problem.

To be exact, checkpoint coordinator should at least distinguish between the 
following semantics.
 - Periodic checkpoint is enabled to ensure that failover recovery time should 
be kept within a time limit.
 - Periodic checkpoint is disabled to reduce corresponding performance 
overhead, but the ability to checkpoint still exists and users can trigger a 
savepoint anytime.
 - Checkpoint or savepoint is not allowed due to job status or topological 
requirements.

It should also be supported for a Flink job to change between the checkpointing 
semantics mentioned above dynamically during runtime.

Besides, checkpoints canceled in stopCheckpointScheduler() would fail with an 
error message saying "Checkpoint Coordinator is suspending", which is ambiguous 
for debugging. The detailed reason should be recorded as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-34519) Refine checkpoint scheduling and canceling logic

Reply via email to