[ https://issues.apache.org/jira/browse/FLINK-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379079#comment-15379079 ]
Robert Metzger commented on FLINK-4142: --------------------------------------- Thank you for posting a log as well. It seems to be a YARN specific issue: {code} 2016-07-01 15:45:03,452 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Launching TaskManager in container ContainerInLaunch @ 1467387903451: Container: [ContainerId: container_1467387670862_0001_02_000002, NodeId: hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:40436, NodeHttpAddress: hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:8042, Resource: <memory:4096, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.240.0.18:40436 }, ] on host hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal 2016-07-01 15:45:03,455 INFO org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - Opening proxy : hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:40436 2016-07-01 15:45:03,508 ERROR org.apache.flink.yarn.YarnFlinkResourceManager - Could not start TaskManager in container ContainerInLaunch @ 1467387903451: Container: [ContainerId: container_1467387670862_0001_02_000002, NodeId: hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:40436, NodeHttpAddress: hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:8042, Resource: <memory:4096, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.240.0.18:40436 }, ] org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. NMToken for application attempt : appattempt_1467387670862_0001_000001 was used for starting container with container token issued for application attempt : appattempt_1467387670862_0001_000002 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106) at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:206) at org.apache.flink.yarn.YarnFlinkResourceManager.containersAllocated(YarnFlinkResourceManager.java:403) at org.apache.flink.yarn.YarnFlinkResourceManager.handleMessage(YarnFlinkResourceManager.java:164) at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:90) at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:70) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2016-07-01 15:45:03,508 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Requesting new TaskManager container with 4096 megabytes memory. Pending requests: 1 2016-07-01 15:45:03,959 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Container ResourceID{resourceId='container_1467387670862_0001_02_000002'} completed successfully with diagnostics: Container released by application {code} The problem was a major bug in Hadoop 2.4.0. It has been fixed in Hadoop 2.5.0. https://issues.apache.org/jira/browse/YARN-2065 I'll add a warning to the YARN documentation page that there are issues with HA on YARN < 2.5.0. > Recovery problem in HA on Hadoop Yarn 2.4.1 > ------------------------------------------- > > Key: FLINK-4142 > URL: https://issues.apache.org/jira/browse/FLINK-4142 > Project: Flink > Issue Type: Bug > Components: YARN Client > Affects Versions: 1.0.3 > Reporter: Stefan Richter > > On Hadoop Yarn 2.4.1, recovery in HA fails in the following scenario: > 1) Kill application master, let it recover normally. > 2) After that, kill a task manager. > Now, Yarn tries to restart the killed task manager in an endless loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)