GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/6279
[FLINK-9706] Properly wait for termination of JobManagerRunner before restarting jobs ## What is the purpose of the change In order to avoid race conditions between resource clean up, we now wait for the proper termination of a previously running JobMaster responsible for the same job (e.g. originating from a job recovery or a re-submission). This PR also fixes [FLINK-9439](https://issues.apache.org/jira/browse/FLINK-9439). ## Brief change log - Cache per `JobManagerRunner` the termination future - Before submitting a job wait for the termination of a previously running `JobManagerRunner` responsible for the same `JobID` ## Verifying this change - Added `DispatcherResourceCleanupTest#testJobSubmissionUnderSameJobId` and `DispatcherResourceCleanupTest#testJobRecoveryWithPendingTermination` - Before `DispatcherTest#testJobRecovery` and `DispatcherTest#testSubmittedJobGraphListener` failed due to not properly waiting for the termination ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixJobManagerRunnerTermination Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6279.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6279 ---- commit 0e3a19cfa083030f81458dfd36f9bab32d64577a Author: Till Rohrmann <trohrmann@...> Date: 2018-07-06T10:38:25Z [hotfix] Exclude generated Avro types in flink-confluent-schema-registry from rat check commit a5d9ff2c16b47b87efc469196c320bd7ba492a95 Author: Till Rohrmann <trohrmann@...> Date: 2018-07-07T08:53:38Z [FLINK-9706] Properly wait for termination of JobManagerRunner before restarting jobs In order to avoid race conditions between resource clean up, we now wait for the proper termination of a previously running JobMaster responsible for the same job (e.g. originating from a job recovery or a re-submission). ---- ---