[ https://issues.apache.org/jira/browse/FLINK-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535694#comment-16535694 ]
ASF GitHub Bot commented on FLINK-9706: --------------------------------------- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/6279 [FLINK-9706] Properly wait for termination of JobManagerRunner before restarting jobs ## What is the purpose of the change In order to avoid race conditions between resource clean up, we now wait for the proper termination of a previously running JobMaster responsible for the same job (e.g. originating from a job recovery or a re-submission). This PR also fixes [FLINK-9439](https://issues.apache.org/jira/browse/FLINK-9439). ## Brief change log - Cache per `JobManagerRunner` the termination future - Before submitting a job wait for the termination of a previously running `JobManagerRunner` responsible for the same `JobID` ## Verifying this change - Added `DispatcherResourceCleanupTest#testJobSubmissionUnderSameJobId` and `DispatcherResourceCleanupTest#testJobRecoveryWithPendingTermination` - Before `DispatcherTest#testJobRecovery` and `DispatcherTest#testSubmittedJobGraphListener` failed due to not properly waiting for the termination ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixJobManagerRunnerTermination Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6279.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6279 ---- commit 0e3a19cfa083030f81458dfd36f9bab32d64577a Author: Till Rohrmann <trohrmann@...> Date: 2018-07-06T10:38:25Z [hotfix] Exclude generated Avro types in flink-confluent-schema-registry from rat check commit a5d9ff2c16b47b87efc469196c320bd7ba492a95 Author: Till Rohrmann <trohrmann@...> Date: 2018-07-07T08:53:38Z [FLINK-9706] Properly wait for termination of JobManagerRunner before restarting jobs In order to avoid race conditions between resource clean up, we now wait for the proper termination of a previously running JobMaster responsible for the same job (e.g. originating from a job recovery or a re-submission). ---- > DispatcherTest#testSubmittedJobGraphListener fails on Travis > ------------------------------------------------------------ > > Key: FLINK-9706 > URL: https://issues.apache.org/jira/browse/FLINK-9706 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination, Tests > Affects Versions: 1.5.0, 1.6.0 > Reporter: Chesnay Schepler > Assignee: Till Rohrmann > Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.5.2, 1.6.0 > > > https://travis-ci.org/apache/flink/jobs/399331775 > {code:java} > testSubmittedJobGraphListener(org.apache.flink.runtime.dispatcher.DispatcherTest) > Time elapsed: 0.103 sec <<< FAILURE! > java.lang.AssertionError: > Expected: a collection with size <1> > but: collection size was <0> > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) > at org.junit.Assert.assertThat(Assert.java:956) > at org.junit.Assert.assertThat(Assert.java:923) > at > org.apache.flink.runtime.dispatcher.DispatcherTest.testSubmittedJobGraphListener(DispatcherTest.java:294) > testSubmittedJobGraphListener(org.apache.flink.runtime.dispatcher.DispatcherTest) > Time elapsed: 0.11 sec <<< ERROR! > org.apache.flink.runtime.util.TestingFatalErrorHandler$TestingException: > org.apache.flink.runtime.dispatcher.DispatcherException: Could not start the > added job b8ab3b7fa8a929bf608a5b65896a2b17 > at > org.apache.flink.runtime.util.TestingFatalErrorHandler.rethrowError(TestingFatalErrorHandler.java:51) > at > org.apache.flink.runtime.dispatcher.DispatcherTest.tearDown(DispatcherTest.java:219) > Caused by: org.apache.flink.runtime.dispatcher.DispatcherException: Could not > start the added job b8ab3b7fa8a929bf608a5b65896a2b17 > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$28(Dispatcher.java:845) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561) > at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) > at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.flink.util.FlinkException: Failed to submit job > b8ab3b7fa8a929bf608a5b65896a2b17. > at > org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:254) > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$27(Dispatcher.java:836) > at > java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952) > at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) > at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not > set up JobManager > at > org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176) > at > org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:901) > at > org.apache.flink.runtime.dispatcher.DispatcherTest$ExpectedJobIdJobManagerRunnerFactory.createJobManagerRunner(DispatcherTest.java:603) > at > org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:287) > at > org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:277) > at > org.apache.flink.runtime.dispatcher.Dispatcher.persistAndRunJob(Dispatcher.java:262) > at > org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:249) > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$27(Dispatcher.java:836) > at > java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952) > at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) > at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: java.lang.IllegalStateException: No libraries are registered for > job b8ab3b7fa8a929bf608a5b65896a2b17 > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.getClassLoader(BlobLibraryCacheManager.java:175) > at > org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:137) > at > org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:901) > at > org.apache.flink.runtime.dispatcher.DispatcherTest$ExpectedJobIdJobManagerRunnerFactory.createJobManagerRunner(DispatcherTest.java:603) > at > org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:287) > at > org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:277) > at > org.apache.flink.runtime.dispatcher.Dispatcher.persistAndRunJob(Dispatcher.java:262) > at > org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:249) > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$27(Dispatcher.java:836) > at > java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952) > at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) > at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)