Hi All, I started my flink application on YARN using flink run -m yarn-cluster, after running smoothly for 20 hrs it failed. Ideally the application should have recover on losing the Job Manger (which runs in the same container as the application master) pertaining to the fault tolerant nature of flink on YARN but it didn't recover and failed.
Please help me debug the logs. Thank you Regards, Anchit Below are the logs: 2016-11-01 14:12:37,592 INFO org.apache.flink.runtime.client.JobClientActor - 11/01/2016 14:12:36 Parse & Map Record - (Visitor ID, Product List) -> Filtering None Objects -> Fetching Output(148/200) switched to RUNNING 2016-11-02 10:16:42,960 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm1 2016-11-02 10:17:24,026 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm2 2016-11-02 10:17:40,882 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm1 2016-11-02 10:24:41,964 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@10.66.245.26:47722] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 2016-11-02 10:24:56,311 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@10.66.245.26:47722]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /10.66.245.26:47722 2016-11-02 10:24:56,315 INFO org.apache.flink.runtime.client.JobClientActor - Lost connection to JobManager akka.tcp://flink@10.66.245.26:47722/user/jobmanager. Triggering connection timeout. 2016-11-02 10:24:56,315 INFO org.apache.flink.runtime.client.JobClientActor - Disconnect from JobManager Actor[akka.tcp://flink@10.66.245.26:47722/user/jobmanager#1251121709]. 2016-11-02 10:25:56,330 INFO org.apache.flink.runtime.client.JobClientActor - Terminate JobClientActor. 2016-11-02 10:25:56,331 INFO org.apache.flink.runtime.client.JobClientActor - Disconnect from JobManager null. 2016-11-02 10:25:56,333 ERROR org.apache.flink.client.CliFrontend - Error while running the command. org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Communication with JobManager failed: Lost connection to the JobManager. at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:405) at org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:204) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:378) at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:68) at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:585) at com.tgt.prz.streaming.recs.drivers.SessionRecs2$.main(SessionRecs2.scala:126) at com.tgt.prz.streaming.recs.drivers.SessionRecs2.main(SessionRecs2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:509) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:403) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:320) at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:777) at org.apache.flink.client.CliFrontend.run(CliFrontend.java:253) at org.apache.flink.client.CliFrontend$2.run(CliFrontend.java:997) at org.apache.flink.client.CliFrontend$2.run(CliFrontend.java:994) at org.apache.flink.runtime.security.SecurityUtils$1.run(SecurityUtils.java:56) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.flink.runtime.security.SecurityUtils.runSecured(SecurityUtils.java:53) at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:994) at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1048) Caused by: org.apache.flink.runtime.client.JobExecutionException: Communication with JobManager failed: Lost connection to the JobManager. at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:137) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:401) ... 24 more Caused by: org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException: Lost connection to the JobManager. at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:252) at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:90) at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:70) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2016-11-02 10:25:56,341 INFO org.apache.flink.yarn.YarnClusterClient - Sending shutdown request to the Application Master 2016-11-02 10:25:56,341 INFO org.apache.flink.yarn.YarnClusterClient - Start application client. 2016-11-02 10:25:56,344 WARN org.apache.flink.yarn.YarnClusterClient - YARN reported application state FAILED 2016-11-02 10:25:56,344 WARN org.apache.flink.yarn.YarnClusterClient - Diagnostics: Application application_1476277440022_40328 failed 1 times due to Attempt recovered after RM restartAM Container for appattempt_1476277440022_40328_000001 exited with exitCode: 243 For more detailed output, check application tracking page:http://d-3zkvk02.target.com:8088/cluster/app/application_1476277440022_40328Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_e3066_1476277440022_40328_01_000001 Exit code: 243 Stack trace: ExitCodeException exitCode=243: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:371) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Shell output: main : command provided 1 main : run as user is A12345 main : requested yarn user is A12345 Container exited with a non-zero exit code 243 Failing this attempt. Failing the application. 2016-11-02 10:25:56,346 INFO org.apache.flink.yarn.ApplicationClient - Notification about new leader address akka.tcp://flink@10.66.245.26:47722/user/jobmanager with session ID null. 2016-11-02 10:25:56,349 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager. 2016-11-02 10:25:56,350 INFO org.apache.flink.yarn.ApplicationClient - Received address of new leader akka.tcp://flink@10.66.245.26:47722/user/jobmanager with session ID null. 2016-11-02 10:25:56,351 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager null. 2016-11-02 10:25:56,353 INFO org.apache.flink.yarn.ApplicationClient - Trying to register at JobManager akka.tcp://flink@10.66.245.26:47722/user/jobmanager. 2016-11-02 10:25:56,363 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@10.66.245.26:47722]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /10.66.245.26:47722 2016-11-02 10:25:56,870 INFO org.apache.flink.yarn.ApplicationClient - Trying to register at JobManager akka.tcp://flink@10.66.245.26:47722/user/jobmanager. 2016-11-02 10:25:57,369 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager. 2016-11-02 10:25:57,889 INFO org.apache.flink.yarn.ApplicationClient - Trying to register at JobManager akka.tcp://flink@10.66.245.26:47722/user/jobmanager. 2016-11-02 10:25:58,389 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager. 2016-11-02 10:25:59,410 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager. 2016-11-02 10:25:59,909 INFO org.apache.flink.yarn.ApplicationClient - Trying to register at JobManager akka.tcp://flink@10.66.245.26:47722/user/jobmanager. 2016-11-02 10:26:00,429 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager. 2016-11-02 10:26:01,449 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager. 2016-11-02 10:26:02,469 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager. 2016-11-02 10:26:03,489 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager. 2016-11-02 10:26:03,929 INFO org.apache.flink.yarn.ApplicationClient - Trying to register at JobManager akka.tcp://flink@10.66.245.26:47722/user/jobmanager. 2016-11-02 10:26:03,935 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@10.66.245.26:47722]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /10.66.245.26:47722 2016-11-02 10:26:04,509 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager. 2016-11-02 10:26:05,529 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager. 2016-11-02 10:26:06,345 WARN org.apache.flink.yarn.YarnClusterClient - Error while stopping YARN cluster. java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86) at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.ready(package.scala:86) at scala.concurrent.Await.ready(package.scala) at org.apache.flink.yarn.YarnClusterClient.shutdownCluster(YarnClusterClient.java:366) at org.apache.flink.yarn.YarnClusterClient.finalizeCluster(YarnClusterClient.java:336) at org.apache.flink.client.program.ClusterClient.shutdown(ClusterClient.java:206) at org.apache.flink.client.CliFrontend.run(CliFrontend.java:260) at org.apache.flink.client.CliFrontend$2.run(CliFrontend.java:997) at org.apache.flink.client.CliFrontend$2.run(CliFrontend.java:994) at org.apache.flink.runtime.security.SecurityUtils$1.run(SecurityUtils.java:56) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.flink.runtime.security.SecurityUtils.runSecured(SecurityUtils.java:53) at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:994) at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1048) 2016-11-02 10:26:06,347 INFO org.apache.flink.yarn.YarnClusterClient - Deleting files in hdfs://littleredns/user/A12345/.flink/application_1476277440022_40328 2016-11-02 10:26:06,530 INFO org.apache.flink.yarn.YarnClusterClient - Application application_1476277440022_40328 finished with state FAILED and final state FAILED at 1478100282775 2016-11-02 10:26:06,530 WARN org.apache.flink.yarn.YarnClusterClient - Application failed. Diagnostics Application application_1476277440022_40328 failed 1 times due to Attempt recovered after RM restartAM Container for appattempt_1476277440022_40328_000001 exited with exitCode: 243 For more detailed output, check application tracking page:http://d-3zkvk02.target.com:8088/cluster/app/application_1476277440022_40328Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_e3066_1476277440022_40328_01_000001 Exit code: 243 Stack trace: ExitCodeException exitCode=243: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:371) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Shell output: main : command provided 1 main : run as user is A12345 main : requested yarn user is A12345 Container exited with a non-zero exit code 243 Failing this attempt. Failing the application. 2016-11-02 10:26:06,531 WARN org.apache.flink.yarn.YarnClusterClient - If log aggregation is activated in the Hadoop cluster, we recommend to retrieve the full application log using this command: yarn logs -appReport application_1476277440022_40328 (It sometimes takes a few seconds until the logs are aggregated) 2016-11-02 10:26:06,531 INFO org.apache.flink.yarn.YarnClusterClient - YARN Client is shutting down 2016-11-02 10:26:06,532 INFO org.apache.flink.yarn.ApplicationClient - Stopped Application client. 2016-11-02 10:26:06,533 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager null. 2016-11-02 10:26:06,536 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon. 2016-11-02 10:26:06,537 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports. 2016-11-02 10:26:06,558 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down. -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Application-on-YARN-failed-on-losing-Job-Manager-No-recovery-Need-help-debug-the-cause-from-los-tp9839.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.