Hey everyone,

I will create a FLIP, but wanted to gauge community opinion first. The
motivation is that production Flink applications frequently need to go
through node/image patching to update the software and AMI with latest
security fixes. These patching related restarts do not involve application
jar or parallelism updates and can therefore be done without costly
savepoint completion and restore cycles by relying on the last checkpoint
state in order to achieve minimum downtime. In order to achieve this, we
currently rely on retained checkpoints and the following steps:

   - Create new stand-by Flink cluster and submit application jar
   - Delete Flink TM deployment to stop processing & checkpoints on old
   cluster(reduce duplicates)
   - Query last completed checkpoint from REST API on JM of old cluster
   - Submit new job using last available checkpoint in new cluster, delete
   old cluster

We have observed that this process will sometimes not select the latest
checkpoint as partially completed checkpoints race and finish after
querying the JM. Alternatives are to rely on creating other sources for
checkpoint info but this has complications, as discussed in [2]. Waiting
and force deleting task managers increases downtimes and doesn't guarantee
TM process termination respectively. In order to maintain low downtime,
duplicates and solve this race we can introduce an API to suspend
checkpointing. Querying the latest available checkpoint after having
suspending checkpointing will guarantee that we can maintain exactly once
in such a scenario.

This also acts as an extension to [1] where the feature to trigger
checkpoints through a control plane has been discussed and added. It makes
the checkpointing process flexible and gives the user more control in
scenarios like migrating applications and letting data processing catch up
temporarily.
We can implement this similar to [1] and expose a trigger to suspend and
resume checkpointing via CLI and REST API. We can add a parameter to
suspend in 2 ways.

   1. Suspend scheduled checkpoint trigger, doesn’t cancel any still in
   progress checkpoints/savepoints but stops only future ones
   2. Suspend checkpoint coordinator, cancels in progress
   checkpoints/savepoints. Guarantees no racing checkpoint completion and
   could be used for canceling stuck checkpoints and help data processing

[1] https://issues.apache.org/jira/browse/FLINK-27101
[2] https://issues.apache.org/jira/browse/FLINK-26916

Reply via email to