[ https://issues.apache.org/jira/browse/FLINK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309649#comment-17309649 ]
Matthias commented on FLINK-21148: ---------------------------------- I looked over the issue with [~rmetzger]. The actual reason seems to be that the YARN containers get [killed at the end of the test|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L192]. There's a race condition between stopping the TaskManager and stopping the JobManager. If the JM is stopped first, there is a risk that the TM is trying to access the JM's BLOB server at that moment. It loses the connection and reports the connection problem. The exception ends up in the output of the TaskManager and will trigger the test failure. The following logs showcase this based on the build reported in the Jira issues' description (application folder: {{./container_1611618440792_0002_01_000001/}}). {code} [...] 23:48:07,987 [ Time-limited test] INFO org.apache.flink.yarn.YARNSessionFIFOITCase [] - Two containers are running. Killing the application 23:48:07,988 [ Time-limited test] INFO org.apache.hadoop.yarn.client.RMProxy [] - Connecting to ResourceManager at 29c91476178c/172.21.0.2:37502 23:48:07,991 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - application_1611618440792_0002 State change from RUNNING to KILLING 23:48:07,991 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] - Updating application attempt appattempt_1611618440792_0002_000001 with final state: KILLED 23:48:07,991 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] - appattempt_1611618440792_0002_000001 State change from RUNNING to FINAL_SAVING 23:48:07,991 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService [] - Unregistering app attempt : appattempt_1611618440792_0002_000001 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] - appattempt_1611618440792_0002_000001 State change from FINAL_SAVING to KILLED 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - Updating application application_1611618440792_0002 with final state: KILLED 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - application_1611618440792_0002 State change from KILLING to FINAL_SAVING 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore [] - Storing info for app: application_1611618440792_0002 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] - container_1611618440792_0002_01_000001 Container Transitioned from RUNNING to KILLED 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp [] - Completed container: container_1611618440792_0002_01_000001 in state: KILLED event:KILL 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - application_1611618440792_0002 State change from FINAL_SAVING to KILLED 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000001 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode [] - Released container container_1611618440792_0002_01_000001 of capacity <memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 1 containers, <memory:1024, vCores:1> used and <memory:3072, vCores:665> avail able, release resources=true 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] - Application attempt appattempt_1611618440792_0002_000001 released container container_1611618440792_0002_01_000001 on node: host: 29c91476178c:36323 #containers=1 available=3072 used=1024 with event: KILL 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] - container_1611618440792_0002_01_000002 Container Transitioned from RUNNING to KILLED 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp [] - Completed container: container_1611618440792_0002_01_000002 in state: KILLED event:KILL 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000002 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode [] - Released container container_1611618440792_0002_01_000002 of capacity <memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 0 containers, <memory:0, vCores:0> used and <memory:4096, vCores:666> availabl e, release resources=true 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] - Application attempt appattempt_1611618440792_0002_000001 released container container_1611618440792_0002_01_000002 on node: host: 29c91476178c:36323 #containers=0 available=4096 used=0 with event: KILL 23:48:07,993 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo [] - Application application_1611618440792_0002 requests cleared 23:48:07,993 [ pool-3-thread-4] INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher [] - Cleaning master appattempt_1611618440792_0002_000001 23:48:07,993 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop OPERATION=Application Finished - Killed TARGET=RMAppManager RESULT=SUCCESS APPID=application_1611618440792_0002 23:48:07,993 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary [] - appId=application_1611618440792_0002,name=MyCustomName,user=hadoop,queue=default,state=KILLED,trackingUrl=http://29c91476178c:46794/cluster/app/application_1611618440792_0002,appMasterHost=N/A,startTime=1611618467077,finishTime=16 11618487992,finalStatus=KILLED 23:48:07,996 [Socket Reader #1 for port 36323] INFO SecurityLogger.org.apache.hadoop.ipc.Server [] - Auth successful for appattempt_1611618440792_0002_000001 (auth:SIMPLE) 23:48:07,998 [Socket Reader #1 for port 36323] INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager [] - Authorization successful for appattempt_1611618440792_0002_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 23:48:08,000 [IPC Server handler 11 on 36323] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl [] - Stopping container with container Id: container_1611618440792_0002_01_000001 23:48:08,000 [IPC Server handler 11 on 36323] INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger [] - USER=hadoop IP=172.21.0.2 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000001 23:48:08,000 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000001 transitioned from RUNNING to KILLING 23:48:08,000 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch [] - Cleaning up container container_1611618440792_0002_01_000001 23:48:08,008 [ContainersLauncher #0] WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit code from container container_1611618440792_0002_01_000001 is : 143 23:48:08,023 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000002 transitioned from RUNNING to KILLING 23:48:08,023 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Application application_1611618440792_0002 transitioned from RUNNING to FINISHING_CONTAINERS_WAIT 23:48:08,023 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000001 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 23:48:08,023 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch [] - Cleaning up container container_1611618440792_0002_01_000002 23:48:08,029 [ContainersLauncher #1] WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit code from container container_1611618440792_0002_01_000002 is : 143 23:48:08,043 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000002 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 23:48:08,043 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger [] - USER=hadoop OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000001 23:48:08,043 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000001 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 23:48:08,043 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Removing container_1611618440792_0002_01_000001 from application application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl [] - Neither virutal-memory nor physical-memory monitoring is needed. Not running the monitor-thread 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got event CONTAINER_STOP for appId application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger [] - USER=hadoop OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000002 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Removing container_1611618440792_0002_01_000002 from application application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Application application_1611618440792_0002 transitioned from FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl [] - Neither virutal-memory nor physical-memory monitoring is needed. Not running the monitor-thread 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got event CONTAINER_STOP for appId application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got event APPLICATION_STOP for appId application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Application application_1611618440792_0002 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler [] - Scheduling Log Deletion for application: application_1611618440792_0002, with delay of 10800 seconds 23:48:08,192 [IPC Server handler 35 on 37502] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop IP=172.21.0.2 OPERATION=Kill Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1611618440792_0002 23:48:08,193 [ Time-limited test] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl [] - Killed application application_1611618440792_0002 [...] {code} > YARNSessionFIFOSecuredITCase cannot connect to BlobServer > --------------------------------------------------------- > > Key: FLINK-21148 > URL: https://issues.apache.org/jira/browse/FLINK-21148 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Tests > Affects Versions: 1.11.3, 1.13.0 > Reporter: Dawid Wysakowicz > Assignee: Matthias > Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12483&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=ea63c80c-957f-50d1-8f67-3671c14686b9 > {code} > java.io.IOException: Could not connect to BlobServer at address > 29c91476178c/172.21.0.2:44412 > java.io.IOException: Could not connect to BlobServer at address > 29c91476178c/172.21.0.2:44412 > at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:102) > ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:137) > [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.yarn.YarnTestBase.ensureNoProhibitedStringInLogFiles(YarnTestBase.java:538) > at > org.apache.flink.yarn.YARNSessionFIFOITCase.checkForProhibitedLogContents(YARNSessionFIFOITCase.java:84) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33) > at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)