[ 
https://issues.apache.org/jira/browse/FLINK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535694#comment-16535694
 ] 

ASF GitHub Bot commented on FLINK-9706:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/6279

    [FLINK-9706] Properly wait for termination of JobManagerRunner before 
restarting jobs

    ## What is the purpose of the change
    
    In order to avoid race conditions between resource clean up, we now wait 
for the proper
    termination of a previously running JobMaster responsible for the same job 
(e.g. originating
    from a job recovery or a re-submission).
    
    This PR also fixes 
[FLINK-9439](https://issues.apache.org/jira/browse/FLINK-9439).
    
    ## Brief change log
    
    - Cache per `JobManagerRunner` the termination future
    - Before submitting a job wait for the termination of a previously running 
`JobManagerRunner` responsible for the same `JobID`
    
    ## Verifying this change
    
    - Added `DispatcherResourceCleanupTest#testJobSubmissionUnderSameJobId` and 
`DispatcherResourceCleanupTest#testJobRecoveryWithPendingTermination`
    - Before `DispatcherTest#testJobRecovery` and 
`DispatcherTest#testSubmittedJobGraphListener` failed due to not properly 
waiting for the termination
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink 
fixJobManagerRunnerTermination

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/6279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6279
    
----
commit 0e3a19cfa083030f81458dfd36f9bab32d64577a
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-06T10:38:25Z

    [hotfix] Exclude generated Avro types in flink-confluent-schema-registry 
from rat check

commit a5d9ff2c16b47b87efc469196c320bd7ba492a95
Author: Till Rohrmann <trohrmann@...>
Date:   2018-07-07T08:53:38Z

    [FLINK-9706] Properly wait for termination of JobManagerRunner before 
restarting jobs
    
    In order to avoid race conditions between resource clean up, we now wait 
for the proper
    termination of a previously running JobMaster responsible for the same job 
(e.g. originating
    from a job recovery or a re-submission).

----


> DispatcherTest#testSubmittedJobGraphListener fails on Travis
> ------------------------------------------------------------
>
>                 Key: FLINK-9706
>                 URL: https://issues.apache.org/jira/browse/FLINK-9706
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination, Tests
>    Affects Versions: 1.5.0, 1.6.0
>            Reporter: Chesnay Schepler
>            Assignee: Till Rohrmann
>            Priority: Critical
>              Labels: pull-request-available, test-stability
>             Fix For: 1.5.2, 1.6.0
>
>
> https://travis-ci.org/apache/flink/jobs/399331775
> {code:java}
> testSubmittedJobGraphListener(org.apache.flink.runtime.dispatcher.DispatcherTest)
>   Time elapsed: 0.103 sec  <<< FAILURE!
> java.lang.AssertionError: 
> Expected: a collection with size <1>
>      but: collection size was <0>
>       at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
>       at org.junit.Assert.assertThat(Assert.java:956)
>       at org.junit.Assert.assertThat(Assert.java:923)
>       at 
> org.apache.flink.runtime.dispatcher.DispatcherTest.testSubmittedJobGraphListener(DispatcherTest.java:294)
> testSubmittedJobGraphListener(org.apache.flink.runtime.dispatcher.DispatcherTest)
>   Time elapsed: 0.11 sec  <<< ERROR!
> org.apache.flink.runtime.util.TestingFatalErrorHandler$TestingException: 
> org.apache.flink.runtime.dispatcher.DispatcherException: Could not start the 
> added job b8ab3b7fa8a929bf608a5b65896a2b17
>       at 
> org.apache.flink.runtime.util.TestingFatalErrorHandler.rethrowError(TestingFatalErrorHandler.java:51)
>       at 
> org.apache.flink.runtime.dispatcher.DispatcherTest.tearDown(DispatcherTest.java:219)
> Caused by: org.apache.flink.runtime.dispatcher.DispatcherException: Could not 
> start the added job b8ab3b7fa8a929bf608a5b65896a2b17
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$28(Dispatcher.java:845)
>       at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>       at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>       at 
> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
>       at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
>       at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>       at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>       at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.flink.util.FlinkException: Failed to submit job 
> b8ab3b7fa8a929bf608a5b65896a2b17.
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:254)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$27(Dispatcher.java:836)
>       at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
>       at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>       at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>       at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>       at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not 
> set up JobManager
>       at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:901)
>       at 
> org.apache.flink.runtime.dispatcher.DispatcherTest$ExpectedJobIdJobManagerRunnerFactory.createJobManagerRunner(DispatcherTest.java:603)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:287)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:277)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.persistAndRunJob(Dispatcher.java:262)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:249)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$27(Dispatcher.java:836)
>       at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
>       at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>       at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>       at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>       at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.IllegalStateException: No libraries are registered for 
> job b8ab3b7fa8a929bf608a5b65896a2b17
>       at 
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.getClassLoader(BlobLibraryCacheManager.java:175)
>       at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:137)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:901)
>       at 
> org.apache.flink.runtime.dispatcher.DispatcherTest$ExpectedJobIdJobManagerRunnerFactory.createJobManagerRunner(DispatcherTest.java:603)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:287)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:277)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.persistAndRunJob(Dispatcher.java:262)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:249)
>       at 
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$27(Dispatcher.java:836)
>       at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
>       at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>       at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>       at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>       at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to