Hey everyone, I will create a FLIP, but wanted to gauge community opinion first. The motivation is that production Flink applications frequently need to go through node/image patching to update the software and AMI with latest security fixes. These patching related restarts do not involve application jar or parallelism updates and can therefore be done without costly savepoint completion and restore cycles by relying on the last checkpoint state in order to achieve minimum downtime. In order to achieve this, we currently rely on retained checkpoints and the following steps:
- Create new stand-by Flink cluster and submit application jar - Delete Flink TM deployment to stop processing & checkpoints on old cluster(reduce duplicates) - Query last completed checkpoint from REST API on JM of old cluster - Submit new job using last available checkpoint in new cluster, delete old cluster We have observed that this process will sometimes not select the latest checkpoint as partially completed checkpoints race and finish after querying the JM. Alternatives are to rely on creating other sources for checkpoint info but this has complications, as discussed in [2]. Waiting and force deleting task managers increases downtimes and doesn't guarantee TM process termination respectively. In order to maintain low downtime, duplicates and solve this race we can introduce an API to suspend checkpointing. Querying the latest available checkpoint after having suspending checkpointing will guarantee that we can maintain exactly once in such a scenario. This also acts as an extension to [1] where the feature to trigger checkpoints through a control plane has been discussed and added. It makes the checkpointing process flexible and gives the user more control in scenarios like migrating applications and letting data processing catch up temporarily. We can implement this similar to [1] and expose a trigger to suspend and resume checkpointing via CLI and REST API. We can add a parameter to suspend in 2 ways. 1. Suspend scheduled checkpoint trigger, doesn’t cancel any still in progress checkpoints/savepoints but stops only future ones 2. Suspend checkpoint coordinator, cancels in progress checkpoints/savepoints. Guarantees no racing checkpoint completion and could be used for canceling stuck checkpoints and help data processing [1] https://issues.apache.org/jira/browse/FLINK-27101 [2] https://issues.apache.org/jira/browse/FLINK-26916