Till Rohrmann created FLINK-25069: ------------------------------------- Summary: YARNHighAvailabilityITCase.testJobRecoversAfterKillingTaskManager failed on AZP Key: FLINK-25069 URL: https://issues.apache.org/jira/browse/FLINK-25069 Project: Flink Issue Type: Bug Components: Deployment / YARN Affects Versions: 1.15.0 Reporter: Till Rohrmann Fix For: 1.15.0
The test {{YARNHighAvailabilityITCase.testJobRecoversAfterKillingTaskManager}} fails on AZP with: {code} 2021-11-25T18:28:27.9848753Z Nov 25 18:28:27 [ERROR] Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 3,676.541 s <<< FAILURE! - in org.apache.flink.yarn.YARNHighAvailabilityITCase 2021-11-25T18:28:27.9849967Z Nov 25 18:28:27 [ERROR] org.apache.flink.yarn.YARNHighAvailabilityITCase.testJobRecoversAfterKillingTaskManager Time elapsed: 70.846 s <<< ERROR! 2021-11-25T18:28:27.9850929Z Nov 25 18:28:27 java.util.concurrent.ExecutionException: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph. 2021-11-25T18:28:27.9854591Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 2021-11-25T18:28:27.9855441Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) 2021-11-25T18:28:27.9856301Z Nov 25 18:28:27 at org.apache.flink.yarn.YARNHighAvailabilityITCase.submitJob(YARNHighAvailabilityITCase.java:378) 2021-11-25T18:28:27.9857202Z Nov 25 18:28:27 at org.apache.flink.yarn.YARNHighAvailabilityITCase.lambda$testJobRecoversAfterKillingTaskManager$1(YARNHighAvailabilityITCase.java:204) 2021-11-25T18:28:27.9858300Z Nov 25 18:28:27 at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288) 2021-11-25T18:28:27.9859245Z Nov 25 18:28:27 at org.apache.flink.yarn.YARNHighAvailabilityITCase.testJobRecoversAfterKillingTaskManager(YARNHighAvailabilityITCase.java:197) 2021-11-25T18:28:27.9860026Z Nov 25 18:28:27 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2021-11-25T18:28:27.9860705Z Nov 25 18:28:27 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2021-11-25T18:28:27.9861466Z Nov 25 18:28:27 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2021-11-25T18:28:27.9862158Z Nov 25 18:28:27 at java.lang.reflect.Method.invoke(Method.java:498) 2021-11-25T18:28:27.9863016Z Nov 25 18:28:27 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) 2021-11-25T18:28:27.9863959Z Nov 25 18:28:27 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 2021-11-25T18:28:27.9864829Z Nov 25 18:28:27 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) 2021-11-25T18:28:27.9865604Z Nov 25 18:28:27 at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 2021-11-25T18:28:27.9866300Z Nov 25 18:28:27 at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) 2021-11-25T18:28:27.9867044Z Nov 25 18:28:27 at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) 2021-11-25T18:28:27.9867692Z Nov 25 18:28:27 at java.util.concurrent.FutureTask.run(FutureTask.java:266) 2021-11-25T18:28:27.9868220Z Nov 25 18:28:27 at java.lang.Thread.run(Thread.java:748) 2021-11-25T18:28:27.9869072Z Nov 25 18:28:27 Suppressed: java.lang.AssertionError: There is at least one application on the cluster that is not finished.[App application_1637861234319_0001 is in state RUNNING.] 2021-11-25T18:28:27.9870263Z Nov 25 18:28:27 at org.junit.Assert.fail(Assert.java:89) 2021-11-25T18:28:27.9870862Z Nov 25 18:28:27 at org.apache.flink.yarn.YarnTestBase$CleanupYarnApplication.close(YarnTestBase.java:325) 2021-11-25T18:28:27.9871516Z Nov 25 18:28:27 at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:289) 2021-11-25T18:28:27.9871986Z Nov 25 18:28:27 ... 13 more 2021-11-25T18:28:27.9872665Z Nov 25 18:28:27 Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph. 2021-11-25T18:28:27.9873393Z Nov 25 18:28:27 at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$11(RestClusterClient.java:433) 2021-11-25T18:28:27.9874102Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) 2021-11-25T18:28:27.9874774Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) 2021-11-25T18:28:27.9875454Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 2021-11-25T18:28:27.9876123Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) 2021-11-25T18:28:27.9876837Z Nov 25 18:28:27 at org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:373) 2021-11-25T18:28:27.9877539Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) 2021-11-25T18:28:27.9878393Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) 2021-11-25T18:28:27.9879043Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 2021-11-25T18:28:27.9879768Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) 2021-11-25T18:28:27.9880461Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:943) 2021-11-25T18:28:27.9881229Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) 2021-11-25T18:28:27.9881883Z Nov 25 18:28:27 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2021-11-25T18:28:27.9882700Z Nov 25 18:28:27 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2021-11-25T18:28:27.9883223Z Nov 25 18:28:27 ... 1 more 2021-11-25T18:28:27.9883780Z Nov 25 18:28:27 Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Internal server error., <Exception on server side: 2021-11-25T18:28:27.9884529Z Nov 25 18:28:27 org.apache.flink.runtime.client.DuplicateJobSubmissionException: Job has already been submitted. 2021-11-25T18:28:27.9885242Z Nov 25 18:28:27 at org.apache.flink.runtime.client.DuplicateJobSubmissionException.of(DuplicateJobSubmissionException.java:29) 2021-11-25T18:28:27.9885954Z Nov 25 18:28:27 at org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:320) 2021-11-25T18:28:27.9886536Z Nov 25 18:28:27 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2021-11-25T18:28:27.9887090Z Nov 25 18:28:27 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2021-11-25T18:28:27.9887751Z Nov 25 18:28:27 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2021-11-25T18:28:27.9888357Z Nov 25 18:28:27 at java.lang.reflect.Method.invoke(Method.java:498) 2021-11-25T18:28:27.9888989Z Nov 25 18:28:27 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316) 2021-11-25T18:28:27.9889817Z Nov 25 18:28:27 at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) 2021-11-25T18:28:27.9890560Z Nov 25 18:28:27 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314) 2021-11-25T18:28:27.9891256Z Nov 25 18:28:27 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217) 2021-11-25T18:28:27.9891961Z Nov 25 18:28:27 at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78) 2021-11-25T18:28:27.9892834Z Nov 25 18:28:27 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) 2021-11-25T18:28:27.9893462Z Nov 25 18:28:27 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) 2021-11-25T18:28:27.9894044Z Nov 25 18:28:27 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) 2021-11-25T18:28:27.9894632Z Nov 25 18:28:27 at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) 2021-11-25T18:28:27.9895213Z Nov 25 18:28:27 at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) 2021-11-25T18:28:27.9895795Z Nov 25 18:28:27 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) 2021-11-25T18:28:27.9896393Z Nov 25 18:28:27 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 2021-11-25T18:28:27.9896996Z Nov 25 18:28:27 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) 2021-11-25T18:28:27.9897602Z Nov 25 18:28:27 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) 2021-11-25T18:28:27.9898166Z Nov 25 18:28:27 at akka.actor.Actor.aroundReceive(Actor.scala:537) 2021-11-25T18:28:27.9898683Z Nov 25 18:28:27 at akka.actor.Actor.aroundReceive$(Actor.scala:535) 2021-11-25T18:28:27.9899307Z Nov 25 18:28:27 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) 2021-11-25T18:28:27.9900000Z Nov 25 18:28:27 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) 2021-11-25T18:28:27.9900547Z Nov 25 18:28:27 at akka.actor.ActorCell.invoke(ActorCell.scala:548) 2021-11-25T18:28:27.9901085Z Nov 25 18:28:27 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) 2021-11-25T18:28:27.9901616Z Nov 25 18:28:27 at akka.dispatch.Mailbox.run(Mailbox.scala:231) 2021-11-25T18:28:27.9902200Z Nov 25 18:28:27 at akka.dispatch.Mailbox.exec(Mailbox.scala:243) 2021-11-25T18:28:27.9902967Z Nov 25 18:28:27 at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) 2021-11-25T18:28:27.9903587Z Nov 25 18:28:27 at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) 2021-11-25T18:28:27.9904182Z Nov 25 18:28:27 at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) 2021-11-25T18:28:27.9904805Z Nov 25 18:28:27 at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) 2021-11-25T18:28:27.9905290Z Nov 25 18:28:27 2021-11-25T18:28:27.9905666Z Nov 25 18:28:27 End of exception on server side>] 2021-11-25T18:28:27.9906179Z Nov 25 18:28:27 at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:532) 2021-11-25T18:28:27.9906842Z Nov 25 18:28:27 at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:512) 2021-11-25T18:28:27.9907507Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:966) 2021-11-25T18:28:27.9908163Z Nov 25 18:28:27 at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) 2021-11-25T18:28:27.9908681Z Nov 25 18:28:27 ... 4 more 2021-11-25T18:28:27.9909001Z Nov 25 18:28:27 2021-11-25T18:28:27.9909632Z Nov 25 18:28:27 [ERROR] org.apache.flink.yarn.YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint Time elapsed: 1,800.315 s <<< ERROR! 2021-11-25T18:28:27.9910379Z Nov 25 18:28:27 org.junit.runners.model.TestTimedOutException: test timed out after 1800000 milliseconds 2021-11-25T18:28:27.9910924Z Nov 25 18:28:27 at java.lang.Thread.sleep(Native Method) 2021-11-25T18:28:27.9911487Z Nov 25 18:28:27 at org.apache.flink.yarn.YarnClusterDescriptor.startAppMaster(YarnClusterDescriptor.java:1240) 2021-11-25T18:28:27.9912182Z Nov 25 18:28:27 at org.apache.flink.yarn.YarnClusterDescriptor.deployInternal(YarnClusterDescriptor.java:607) 2021-11-25T18:28:27.9913034Z Nov 25 18:28:27 at org.apache.flink.yarn.YarnClusterDescriptor.deploySessionCluster(YarnClusterDescriptor.java:419) 2021-11-25T18:28:27.9913782Z Nov 25 18:28:27 at org.apache.flink.yarn.YARNHighAvailabilityITCase.deploySessionCluster(YARNHighAvailabilityITCase.java:364) 2021-11-25T18:28:27.9914595Z Nov 25 18:28:27 at org.apache.flink.yarn.YARNHighAvailabilityITCase.lambda$testKillYarnSessionClusterEntrypoint$0(YARNHighAvailabilityITCase.java:174) 2021-11-25T18:28:27.9915326Z Nov 25 18:28:27 at org.apache.flink.yarn.YARNHighAvailabilityITCase$$Lambda$503/1259621657.run(Unknown Source) 2021-11-25T18:28:27.9915947Z Nov 25 18:28:27 at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288) 2021-11-25T18:28:27.9916650Z Nov 25 18:28:27 at org.apache.flink.yarn.YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint(YARNHighAvailabilityITCase.java:162) 2021-11-25T18:28:27.9917328Z Nov 25 18:28:27 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2021-11-25T18:28:27.9917905Z Nov 25 18:28:27 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2021-11-25T18:28:27.9918570Z Nov 25 18:28:27 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2021-11-25T18:28:27.9919246Z Nov 25 18:28:27 at java.lang.reflect.Method.invoke(Method.java:498) 2021-11-25T18:28:27.9919847Z Nov 25 18:28:27 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) 2021-11-25T18:28:27.9920514Z Nov 25 18:28:27 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 2021-11-25T18:28:27.9921293Z Nov 25 18:28:27 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) 2021-11-25T18:28:27.9921936Z Nov 25 18:28:27 at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 2021-11-25T18:28:27.9922772Z Nov 25 18:28:27 at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) 2021-11-25T18:28:27.9923503Z Nov 25 18:28:27 at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) 2021-11-25T18:28:27.9924238Z Nov 25 18:28:27 at java.util.concurrent.FutureTask.run(FutureTask.java:266) 2021-11-25T18:28:27.9924757Z Nov 25 18:28:27 at java.lang.Thread.run(Thread.java:748) 2021-11-25T18:28:27.9925156Z Nov 25 18:28:27 2021-11-25T18:28:27.9925694Z Nov 25 18:28:27 [ERROR] org.apache.flink.yarn.YARNHighAvailabilityITCase.testClusterClientRetrieval Time elapsed: 1,800.087 s <<< ERROR! 2021-11-25T18:28:27.9926411Z Nov 25 18:28:27 org.junit.runners.model.TestTimedOutException: test timed out after 1800000 milliseconds 2021-11-25T18:28:27.9926957Z Nov 25 18:28:27 at java.lang.Thread.sleep(Native Method) 2021-11-25T18:28:27.9927499Z Nov 25 18:28:27 at org.apache.flink.yarn.YarnClusterDescriptor.startAppMaster(YarnClusterDescriptor.java:1240) 2021-11-25T18:28:27.9928190Z Nov 25 18:28:27 at org.apache.flink.yarn.YarnClusterDescriptor.deployInternal(YarnClusterDescriptor.java:607) 2021-11-25T18:28:27.9928899Z Nov 25 18:28:27 at org.apache.flink.yarn.YarnClusterDescriptor.deploySessionCluster(YarnClusterDescriptor.java:419) 2021-11-25T18:28:27.9929731Z Nov 25 18:28:27 at org.apache.flink.yarn.YARNHighAvailabilityITCase.deploySessionCluster(YARNHighAvailabilityITCase.java:364) 2021-11-25T18:28:27.9930513Z Nov 25 18:28:27 at org.apache.flink.yarn.YARNHighAvailabilityITCase.lambda$testClusterClientRetrieval$2(YARNHighAvailabilityITCase.java:230) 2021-11-25T18:28:27.9931236Z Nov 25 18:28:27 at org.apache.flink.yarn.YARNHighAvailabilityITCase$$Lambda$504/1893740748.run(Unknown Source) 2021-11-25T18:28:27.9931852Z Nov 25 18:28:27 at org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:288) 2021-11-25T18:28:27.9932684Z Nov 25 18:28:27 at org.apache.flink.yarn.YARNHighAvailabilityITCase.testClusterClientRetrieval(YARNHighAvailabilityITCase.java:225) 2021-11-25T18:28:27.9933406Z Nov 25 18:28:27 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2021-11-25T18:28:27.9933989Z Nov 25 18:28:27 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2021-11-25T18:28:27.9934647Z Nov 25 18:28:27 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2021-11-25T18:28:27.9935251Z Nov 25 18:28:27 at java.lang.reflect.Method.invoke(Method.java:498) 2021-11-25T18:28:27.9935839Z Nov 25 18:28:27 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) 2021-11-25T18:28:27.9936502Z Nov 25 18:28:27 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 2021-11-25T18:28:27.9937158Z Nov 25 18:28:27 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) 2021-11-25T18:28:27.9937813Z Nov 25 18:28:27 at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 2021-11-25T18:28:27.9938497Z Nov 25 18:28:27 at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) 2021-11-25T18:28:27.9939288Z Nov 25 18:28:27 at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) 2021-11-25T18:28:27.9939947Z Nov 25 18:28:27 at java.util.concurrent.FutureTask.run(FutureTask.java:266) 2021-11-25T18:28:27.9940452Z Nov 25 18:28:27 at java.lang.Thread.run(Thread.java:748) 2021-11-25T18:28:27.9940854Z Nov 25 18:28:27 2021-11-25T18:28:28.9205416Z Nov 25 18:28:28 [ERROR] Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError {code} https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27085&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=29849 -- This message was sent by Atlassian Jira (v8.20.1#820001)