Hi Flink devs,

Through recent discussion on job stop processing, we found its semantic is
incomplete, mainly reflected in two aspects:

1. In our released versions [1], the stop process is as follows:

   - A “stop” call is a graceful way of stopping a running streaming job.
   When the user requests to stop a job, all sources will receive a stop()
   method call. The job will keep running until all sources properly shut
   down. This allows the job to finish processing all inflight data.

    However, for stateful operators with retained checkpointing, the stop
call won’t take any checkpoint, thus when resuming the job it needs to
recover from the latest checkpoint with source rewinding, which causes the
wait for processing all inflight data meaningless (all need to be processed
again).

2. In the latest master branch after FLIP-34 [2], job stop will always be
accompanied by a savepoint, which has below problems:

   - It's an unexpected behavior change from user perspective, that old
   stop command will fail on jobs w/o savepoint configuration.
   - It slows down the job stop procedure and might block launching new
   jobs when resource is contended.

To resolve the above issues and reinforce job stop semantic, we have opened
FLIP-45. In the document we carefully compared the Flink concepts with
database, clarified what's missing in Flink job stop process, and proposed
changes to complete it.

Since this involves conception level clarification and user-facing
changes, we would like to collect feedback on the proposal before
continuing efforts.

This is the FLIP:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/cli.html
[2]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212

Best Regards,
Yu

Reply via email to