[ https://issues.apache.org/jira/browse/FLINK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199828#comment-17199828 ]
Matthias commented on FLINK-19237: ---------------------------------- The problem seems to be caused by [65ed0393|https://github.com/apache/flink/commit/65ed0393] which is related to FLINK-16866. We were able to reproduce it by running the test LeaderChangeClusterComponentsTest#testReelectionOfJobMaster multiple times until the failure happened (5825 and even less in other tries). We failed to reproduce it on the 65ed0393's parent commit [7da74dc2|https://github.com/apache/flink/commit/7da74dc2795abfb8806f14768a95327c8c204dc8] after 20000 runs. The runtime of the test increases by a factor of 3 with the change introduce in 65ed0393. > LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed with > "NoResourceAvailableException: Could not allocate the required slot within > slot request timeout" > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: FLINK-19237 > URL: https://issues.apache.org/jira/browse/FLINK-19237 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.12.0 > Reporter: Dian Fu > Priority: Critical > Labels: test-stability > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6499&view=logs&j=6bfdaf55-0c08-5e3f-a2d2-2a0285fd41cf&t=fd9796c3-9ce8-5619-781c-42f873e126a6] > {code} > 2020-09-14T21:11:02.8200203Z [ERROR] > testReelectionOfJobMaster(org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest) > Time elapsed: 300.14 s <<< FAILURE! > 2020-09-14T21:11:02.8201761Z java.lang.AssertionError: Job failed. > 2020-09-14T21:11:02.8202749Z at > org.apache.flink.runtime.jobmaster.utils.JobResultUtils.throwAssertionErrorOnFailedResult(JobResultUtils.java:54) > 2020-09-14T21:11:02.8203794Z at > org.apache.flink.runtime.jobmaster.utils.JobResultUtils.assertSuccess(JobResultUtils.java:30) > 2020-09-14T21:11:02.8205177Z at > org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest.testReelectionOfJobMaster(LeaderChangeClusterComponentsTest.java:152) > 2020-09-14T21:11:02.8206191Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-09-14T21:11:02.8206985Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-09-14T21:11:02.8207930Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-09-14T21:11:02.8208927Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-09-14T21:11:02.8209753Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-09-14T21:11:02.8210710Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-09-14T21:11:02.8211608Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-09-14T21:11:02.8214473Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-09-14T21:11:02.8215398Z at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > 2020-09-14T21:11:02.8216199Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2020-09-14T21:11:02.8216947Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-09-14T21:11:02.8217695Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2020-09-14T21:11:02.8218635Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2020-09-14T21:11:02.8219499Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2020-09-14T21:11:02.8220313Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-09-14T21:11:02.8221060Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-09-14T21:11:02.8222171Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-09-14T21:11:02.8222937Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-09-14T21:11:02.8223688Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-09-14T21:11:02.8225191Z at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > 2020-09-14T21:11:02.8226086Z at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > 2020-09-14T21:11:02.8226761Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2020-09-14T21:11:02.8227453Z at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > 2020-09-14T21:11:02.8228392Z at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > 2020-09-14T21:11:02.8229256Z at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > 2020-09-14T21:11:02.8235798Z at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > 2020-09-14T21:11:02.8237650Z at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > 2020-09-14T21:11:02.8239039Z at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > 2020-09-14T21:11:02.8239894Z at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > 2020-09-14T21:11:02.8240591Z at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > 2020-09-14T21:11:02.8241325Z Caused by: > org.apache.flink.runtime.JobException: Recovery is suppressed by > NoRestartBackoffTimeStrategy > 2020-09-14T21:11:02.8242225Z at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116) > 2020-09-14T21:11:02.8243358Z at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78) > 2020-09-14T21:11:02.8244425Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:215) > 2020-09-14T21:11:02.8245291Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:208) > 2020-09-14T21:11:02.8246150Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:202) > 2020-09-14T21:11:02.8247006Z at > org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:523) > 2020-09-14T21:11:02.8247960Z at > org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49) > 2020-09-14T21:11:02.8249102Z at > org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1722) > 2020-09-14T21:11:02.8249971Z at > org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1283) > 2020-09-14T21:11:02.8250675Z at > org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1251) > 2020-09-14T21:11:02.8251369Z at > org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1082) > 2020-09-14T21:11:02.8252104Z at > org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748) > 2020-09-14T21:11:02.8253060Z at > org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41) > 2020-09-14T21:11:02.8253956Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:458) > 2020-09-14T21:11:02.8254967Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:445) > 2020-09-14T21:11:02.8393562Z at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > 2020-09-14T21:11:02.8394920Z at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > 2020-09-14T21:11:02.8396122Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2020-09-14T21:11:02.8397194Z at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > 2020-09-14T21:11:02.8398150Z at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:169) > 2020-09-14T21:11:02.8399234Z at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > 2020-09-14T21:11:02.8400048Z at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > 2020-09-14T21:11:02.8401048Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2020-09-14T21:11:02.8402025Z at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > 2020-09-14T21:11:02.8403171Z at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:731) > 2020-09-14T21:11:02.8404708Z at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537) > 2020-09-14T21:11:02.8405751Z at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432) > 2020-09-14T21:11:02.8406633Z at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > 2020-09-14T21:11:02.8407378Z at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > 2020-09-14T21:11:02.8408120Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2020-09-14T21:11:02.8408948Z at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > 2020-09-14T21:11:02.8409748Z at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1168) > 2020-09-14T21:11:02.8410511Z at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > 2020-09-14T21:11:02.8411543Z at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > 2020-09-14T21:11:02.8412553Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2020-09-14T21:11:02.8413340Z at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > 2020-09-14T21:11:02.8414204Z at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1072) > 2020-09-14T21:11:02.8415364Z at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) > 2020-09-14T21:11:02.8416128Z at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) > 2020-09-14T21:11:02.8417172Z at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) > 2020-09-14T21:11:02.8417995Z at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) > 2020-09-14T21:11:02.8418997Z at > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > 2020-09-14T21:11:02.8419692Z at > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > 2020-09-14T21:11:02.8420336Z at > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > 2020-09-14T21:11:02.8421055Z at > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > 2020-09-14T21:11:02.8421655Z at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > 2020-09-14T21:11:02.8422336Z at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > 2020-09-14T21:11:02.8423049Z at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > 2020-09-14T21:11:02.8423681Z at > akka.actor.Actor$class.aroundReceive(Actor.scala:517) > 2020-09-14T21:11:02.8424505Z at > akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > 2020-09-14T21:11:02.8425209Z at > akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > 2020-09-14T21:11:02.8425760Z at > akka.actor.ActorCell.invoke(ActorCell.scala:561) > 2020-09-14T21:11:02.8426376Z at > akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > 2020-09-14T21:11:02.8427252Z at akka.dispatch.Mailbox.run(Mailbox.scala:225) > 2020-09-14T21:11:02.8427931Z at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > 2020-09-14T21:11:02.8428684Z at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > 2020-09-14T21:11:02.8429375Z at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > 2020-09-14T21:11:02.8430118Z at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > 2020-09-14T21:11:02.8430853Z at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2020-09-14T21:11:02.8431971Z Caused by: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Could not allocate the required slot within slot request timeout. Please make > sure that the cluster has enough resources. > 2020-09-14T21:11:02.8433179Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:464) > 2020-09-14T21:11:02.8434082Z ... 45 more > 2020-09-14T21:11:02.8434809Z Caused by: > java.util.concurrent.CompletionException: > java.util.concurrent.TimeoutException > 2020-09-14T21:11:02.8435611Z at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > 2020-09-14T21:11:02.8436379Z at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > 2020-09-14T21:11:02.8437159Z at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) > 2020-09-14T21:11:02.8437976Z at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > 2020-09-14T21:11:02.8438658Z ... 25 more > 2020-09-14T21:11:02.8439085Z Caused by: java.util.concurrent.TimeoutException > 2020-09-14T21:11:02.8439476Z ... 23 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)