Hi Flink devs, Through recent discussion on job stop processing, we found its semantic is incomplete, mainly reflected in two aspects:
1. In our released versions [1], the stop process is as follows: - A “stop” call is a graceful way of stopping a running streaming job. When the user requests to stop a job, all sources will receive a stop() method call. The job will keep running until all sources properly shut down. This allows the job to finish processing all inflight data. However, for stateful operators with retained checkpointing, the stop call won’t take any checkpoint, thus when resuming the job it needs to recover from the latest checkpoint with source rewinding, which causes the wait for processing all inflight data meaningless (all need to be processed again). 2. In the latest master branch after FLIP-34 [2], job stop will always be accompanied by a savepoint, which has below problems: - It's an unexpected behavior change from user perspective, that old stop command will fail on jobs w/o savepoint configuration. - It slows down the job stop procedure and might block launching new jobs when resource is contended. To resolve the above issues and reinforce job stop semantic, we have opened FLIP-45. In the document we carefully compared the Flink concepts with database, clarified what's missing in Flink job stop process, and proposed changes to complete it. Since this involves conception level clarification and user-facing changes, we would like to collect feedback on the proposal before continuing efforts. This is the FLIP: https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic [1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/cli.html [2] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 Best Regards, Yu