Since you're running in a container, the question is whether the container where the JM is running can access the ZooKeeper at 10.200.0.6.
> On 27. Sep 2017, at 04:31, Sridhar Chellappa <flinken...@gmail.com> wrote: > > Emily, > > I did not get chance to capture the logs on the container. Since I have > erased the instances, I have lost access to the logs. I have moved to no-ha > mode (single master) and running OK. > > Aljoscha, > > Network connectivity is good. I am able to ssh to 10.200.0.6. > > > Will try the HA mode and capture all the logs and send them over > > On Tue, Sep 26, 2017 at 6:37 PM, Aljoscha Krettek <aljos...@apache.org > <mailto:aljos...@apache.org>> wrote: > Is the IP 10.200.0.6 reachable form the machine that runs the JobManager? > >> On 25. Sep 2017, at 19:58, Emily McMahon <emil...@remitly.com >> <mailto:emil...@remitly.com>> wrote: >> >> What's in the container log for the container that failed? >> >> On Sep 11, 2017 2:17 AM, "Sridhar Chellappa" <flinken...@gmail.com >> <mailto:flinken...@gmail.com>> wrote: >> I am trying to start Flink(Version 1.3.0) on YARN (Hadoop 2.8.1) by issuing >> the following command: >> >> ~/flink-1.3.0/bin/yarn-session.sh -s 4 -n 10 -jm 4096 -tm 4096-d >> >> I am seeing a flurry of these Errors: >> >> 2017-09-11 08:17:11,410 INFO org.apache.flink.yarn.YarnClusterDescriptor >> - Deployment took more than 60 seconds. Please check if the >> requested resources are available in the YARN cluster >> 2017-09-11 08:17:11,661 INFO org.apache.flink.yarn.YarnClusterDescriptor >> - Deployment took more than 60 seconds. Please check if the >> requested resources are available in the YARN cluster >> 2017-09-11 08:17:11,912 INFO org.apache.flink.yarn.YarnClusterDescriptor >> - Deployment took more than 60 seconds. Please check if the >> requested resources are available in the YARN cluster >> 2017-09-11 08:17:12,163 INFO org.apache.flink.yarn.YarnClusterDescriptor >> - Deployment took more than 60 seconds. Please check if the >> requested resources are available in the YARN cluster >> >> >> And then, my deployment fails with the following exception : >> >> Error while deploying YARN cluster: Couldn't deploy Yarn cluster >> java.lang.RuntimeException: Couldn't deploy Yarn cluster >> at >> org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:439) >> at >> org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:630) >> at >> org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:486) >> at >> org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:483) >> at >> org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.do >> <http://javax.security.auth.subject.do/>As(Subject.java:422) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) >> at >> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) >> at >> org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:483) >> Caused by: >> org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: >> The YARN application unexpectedly switched to state FAILED during deployment. >> Diagnostics from YARN: Application application_1504851547322_0003 failed 2 >> times due to AM Container for appattempt_1504851547322_0003_000002 exited >> with exitCode: 31 >> Failing this attempt.Diagnostics: Exception from container-launch. >> Container id: container_1504851547322_0003_02_000001 >> Exit code: 31 >> Stack trace: ExitCodeException exitCode=31: >> at org.apache.hadoop.util.Shell.runCommand(Shell.java:972) >> at org.apache.hadoop.util.Shell.run(Shell.java:869) >> at >> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170) >> at >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:748) >> >> >> >> Further Debugging at the JobManager logs shows : >> >> Resetting connection and trying again with a new connection. >> 2017-09-11 08:17:11,820 INFO org.apache.zookeeper.ZooKeeper >> - Initiating client connection, >> connectString=high-availability.zookeeper.quorum: 10.200.0.6:2181 >> <http://10.200.0.6:2181/>,10.200.0.7:2181 >> <http://10.200.0.7:2181/>,10.200.0.9:2181 <http://10.200.0.9:2181/> >> sessionTimeout=60000 >> watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@57bd802b >> 2017-09-11 08:17:11,927 ERROR >> org.apache.flink.yarn.YarnApplicationMasterRunner - YARN >> Application Master initialization failed >> java.net.UnknownHostException: high-availability.zookeeper.quorum: >> 10.200.0.6 <http://10.200.0.6/>: Name or service not known >> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) >> at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) >> at >> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) >> at java.net.InetAddress.getAllByName0(InetAddress.java:1276) >> at java.net.InetAddress.getAllByName(InetAddress.java:1192) >> at java.net.InetAddress.getAllByName(InetAddress.java:1126) >> at org.apache.zookeeper.client.St >> <http://org.apache.zookeeper.client.st/>aticHostProvider.<init>(StaticHostProvider.java:61) >> >> >> any help in figuring this out will be appreciated >> > >