Emily, I did not get chance to capture the logs on the container. Since I have erased the instances, I have lost access to the logs. I have moved to no-ha mode (single master) and running OK.
Aljoscha, Network connectivity is good. I am able to ssh to 10.200.0.6. Will try the HA mode and capture all the logs and send them over On Tue, Sep 26, 2017 at 6:37 PM, Aljoscha Krettek <aljos...@apache.org> wrote: > Is the IP 10.200.0.6 reachable form the machine that runs the JobManager? > > On 25. Sep 2017, at 19:58, Emily McMahon <emil...@remitly.com> wrote: > > What's in the container log for the container that failed? > > On Sep 11, 2017 2:17 AM, "Sridhar Chellappa" <flinken...@gmail.com> wrote: > > I am trying to start Flink(Version 1.3.0) on YARN (Hadoop 2.8.1) by > issuing the following command: > > ~/flink-1.3.0/bin/yarn-session.sh -s 4 -n 10 -jm 4096 -tm 4096-d > > I am seeing a flurry of these Errors: > > 2017-09-11 08:17:11,410 INFO org.apache.flink.yarn.YarnClus > terDescriptor - Deployment took more than 60 seconds. > Please check if the requested resources are available in the YARN cluster > 2017-09-11 08:17:11,661 INFO org.apache.flink.yarn.YarnClus > terDescriptor - Deployment took more than 60 seconds. > Please check if the requested resources are available in the YARN cluster > 2017-09-11 08:17:11,912 INFO org.apache.flink.yarn.YarnClus > terDescriptor - Deployment took more than 60 seconds. > Please check if the requested resources are available in the YARN cluster > 2017-09-11 08:17:12,163 INFO org.apache.flink.yarn.YarnClus > terDescriptor - Deployment took more than 60 seconds. > Please check if the requested resources are available in the YARN cluster > > > And then, my deployment fails with the following exception : > > Error while deploying YARN cluster: Couldn't deploy Yarn cluster > java.lang.RuntimeException: Couldn't deploy Yarn cluster > at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(A > bstractYarnClusterDescriptor.java:439) > at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnS > essionCli.java:630) > at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYa > rnSessionCli.java:486) > at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYa > rnSessionCli.java:483) > at org.apache.flink.runtime.security.HadoopSecurityContext$1. > run(HadoopSecurityContext.java:43) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro > upInformation.java:1548) > at org.apache.flink.runtime.security.HadoopSecurityContext.runS > ecured(HadoopSecurityContext.java:40) > at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarn > SessionCli.java:483) > Caused by: > org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: > The YARN application unexpectedly switched to state FAILED during > deployment. > Diagnostics from YARN: Application application_1504851547322_0003 failed 2 > times due to AM Container for appattempt_1504851547322_0003_000002 exited > with exitCode: 31 > Failing this attempt.Diagnostics: Exception from container-launch. > Container id: container_1504851547322_0003_02_000001 > Exit code: 31 > Stack trace: ExitCodeException exitCode=31: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:972) > at org.apache.hadoop.util.Shell.run(Shell.java:869) > at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Sh > ell.java:1170) > at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerEx > ecutor.launchContainer(DefaultContainerExecutor.java:236) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.l > auncher.ContainerLaunch.call(ContainerLaunch.java:305) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.l > auncher.ContainerLaunch.call(ContainerLaunch.java:84) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool > Executor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo > lExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > > > > Further Debugging at the JobManager logs shows : > > Resetting connection and trying again with a new connection. > 2017-09-11 08:17:11,820 INFO org.apache.zookeeper.ZooKeeper > - Initiating client connection, > connectString=high-availability.zookeeper.quorum: > 10.200.0.6:2181,10.200.0.7:2181,10.200.0.9:2181 sessionTimeout=60000 > watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@57bd802b > 2017-09-11 08:17:11,927 ERROR > org.apache.flink.yarn.YarnApplicationMasterRunner - YARN > Application Master initialization failed > java.net.UnknownHostException: high-availability.zookeeper.quorum: > 10.200.0.6: Name or service not known > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) > at java.net.InetAddress.getAllByName0(InetAddress.java:1276) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at > org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) > > > any help in figuring this out will be appreciated > > > >