GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/6279

    [FLINK-9706] Properly wait for termination of JobManagerRunner before 
restarting jobs

    ## What is the purpose of the change
    
    In order to avoid race conditions between resource clean up, we now wait 
for the proper
    termination of a previously running JobMaster responsible for the same job 
(e.g. originating
    from a job recovery or a re-submission).
    
    This PR also fixes 
[FLINK-9439](https://issues.apache.org/jira/browse/FLINK-9439).
    
    ## Brief change log
    
    - Cache per `JobManagerRunner` the termination future
    - Before submitting a job wait for the termination of a previously running 
`JobManagerRunner` responsible for the same `JobID`
    
    ## Verifying this change
    
    - Added `DispatcherResourceCleanupTest#testJobSubmissionUnderSameJobId` and 
`DispatcherResourceCleanupTest#testJobRecoveryWithPendingTermination`
    - Before `DispatcherTest#testJobRecovery` and 
`DispatcherTest#testSubmittedJobGraphListener` failed due to not properly 
waiting for the termination
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink 
fixJobManagerRunnerTermination

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/6279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6279
    
----
commit 0e3a19cfa083030f81458dfd36f9bab32d64577a
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-06T10:38:25Z

    [hotfix] Exclude generated Avro types in flink-confluent-schema-registry 
from rat check

commit a5d9ff2c16b47b87efc469196c320bd7ba492a95
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-07T08:53:38Z

    [FLINK-9706] Properly wait for termination of JobManagerRunner before 
restarting jobs
    
    In order to avoid race conditions between resource clean up, we now wait 
for the proper
    termination of a previously running JobMaster responsible for the same job 
(e.g. originating
    from a job recovery or a re-submission).

----


---

Reply via email to