Hi All, I'm trying to setup Flink 1.0.0 cluster on Docker (separate containers for jobmanager and taskmanager) inside AWS (Using AWS ECS service). I tested it locally and its working fine but on AWS Docker, I am running into following issue
*2016-03-09 18:04:12,114 PST [INFO] ec2-52-3-248-202.compute-1.ama [main] o.a.f.runtime.jobmanager.JobManager - Starting JobManager with high-availability* *2016-03-09 18:04:12,118 PST [INFO] ec2-52-3-248-202.compute-1.ama [main] o.a.f.runtime.jobmanager.JobManager - Starting JobManager on 172.31.63.152:8079 <http://172.31.63.152:8079> with execution mode CLUSTER* *2016-03-09 18:04:12,172 PST [INFO] ec2-52-3-248-202.compute-1.ama [main] o.a.f.runtime.jobmanager.JobManager - Security is not enabled. Starting non-authenticated JobManager.* *2016-03-09 18:04:12,174 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] org.apache.flink.util.NetUtils - Trying to open socket on port 8079* *2016-03-09 18:04:12,176 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] org.apache.flink.util.NetUtils - Unable to allocate socket on port* *java.net.BindException: Cannot assign requested address* * at java.net.PlainSocketImpl.socketBind(Native Method)* * at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)* * at java.net.ServerSocket.bind(ServerSocket.java:375)* * at java.net.ServerSocket.<init>(ServerSocket.java:237)* * at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2$$anon$3.createSocket(JobManager.scala:1722)* * at org.apache.flink.util.NetUtils.createSocketFromPorts(NetUtils.java:237)* * at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:1719)* * at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:1717)* * at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:1717)* * at scala.util.Try$.apply(Try.scala:192)* * at org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:1772)* * at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:1717)* * at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1653)* * at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala)* *2016-03-09 18:04:12,180 PST [ERROR] ec2-52-3-248-202.compute-1.ama [main] o.a.f.runtime.jobmanager.JobManager - Failed to run JobManager.* *java.lang.RuntimeException: Unable to do further retries starting the actor system* * at org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:1777)* * at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:1717)* * at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1653)* * at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala)* *2016-03-09 18:04:12,991 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.m.lib.MutableMetricsFactory - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, value=[Rate of successful kerberos logins and latency (milliseconds)], valueName=Time)* Initially Jobmanager tries to bind to port 0 which did not work. On looking further into it, I tried using recovery jobmanager port using different port combinations, but it does not seems to be working... I've exposed the ports in the docker compose file as well.... PFA the jobmanager log file for details also the jobmanager config file... -- Thanks, Deepak Jha
2016-03-09 18:04:11,887 PST [INFO] ec2-52-3-248-202.compute-1.ama [main] o.a.f.runtime.jobmanager.JobManager - -------------------------------------------------------------------------------- 2016-03-09 18:04:11,888 PST [INFO] ec2-52-3-248-202.compute-1.ama [main] o.a.f.runtime.jobmanager.JobManager - Registered UNIX signal handlers for [TERM, HUP, INT] 2016-03-09 18:04:12,070 PST [INFO] ec2-52-3-248-202.compute-1.ama [main] o.a.f.runtime.jobmanager.JobManager - Loading configuration from /opt/flink-1.0.0/conf 2016-03-09 18:04:12,082 PST [WARN] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Error while reading configuration: Cannot read property 0 2016-03-09 18:04:12,083 PST [WARN] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Error while reading configuration: Cannot read property 1 2016-03-09 18:04:12,083 PST [WARN] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Error while reading configuration: Cannot read property 2 2016-03-09 18:04:12,091 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: recovery.jobmanager.port, 8079 2016-03-09 18:04:12,095 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, ec2-52-3-248-202.compute-1.amazonaws.com 2016-03-09 18:04:12,095 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2016-03-09 18:04:12,095 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 512 2016-03-09 18:04:12,096 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: blob.server.port, 50100-50200 2016-03-09 18:04:12,096 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8080 2016-03-09 18:04:12,097 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: state.backend, filesystem 2016-03-09 18:04:12,097 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: state.backend.fs.checkpointdir, s3://flink-dev/checkpoints 2016-03-09 18:04:12,104 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: fs.hdfs.hadoopconf, /opt/flink/conf 2016-03-09 18:04:12,105 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: fs.overwrite-files, true 2016-03-09 18:04:12,105 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: fs.output.always-create-directory, true 2016-03-09 18:04:12,105 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: recovery.mode, zookeeper 2016-03-09 18:04:12,105 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: recovery.zookeeper.quorum, 52.87.232.166:2181,54.88.145.121:2181,52.3.253.96:2181 2016-03-09 18:04:12,106 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: recovery.zookeeper.path.root, /flink-dev 2016-03-09 18:04:12,106 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: recovery.zookeeper.storageDir, s3://flink-dev/zk_recovery 2016-03-09 18:04:12,106 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: savepoints.state.backend, filesystem 2016-03-09 18:04:12,107 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.f.c.GlobalConfiguration - Loading configuration property: savepoints.state.backend.fs.dir, s3://flink-dev/savepoints 2016-03-09 18:04:12,114 PST [INFO] ec2-52-3-248-202.compute-1.ama [main] o.a.f.runtime.jobmanager.JobManager - Starting JobManager with high-availability 2016-03-09 18:04:12,118 PST [INFO] ec2-52-3-248-202.compute-1.ama [main] o.a.f.runtime.jobmanager.JobManager - Starting JobManager on 172.31.63.152:8079 with execution mode CLUSTER 2016-03-09 18:04:12,172 PST [INFO] ec2-52-3-248-202.compute-1.ama [main] o.a.f.runtime.jobmanager.JobManager - Security is not enabled. Starting non-authenticated JobManager. 2016-03-09 18:04:12,174 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] org.apache.flink.util.NetUtils - Trying to open socket on port 8079 2016-03-09 18:04:12,176 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] org.apache.flink.util.NetUtils - Unable to allocate socket on port java.net.BindException: Cannot assign requested address at java.net.PlainSocketImpl.socketBind(Native Method) at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387) at java.net.ServerSocket.bind(ServerSocket.java:375) at java.net.ServerSocket.<init>(ServerSocket.java:237) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2$$anon$3.createSocket(JobManager.scala:1722) at org.apache.flink.util.NetUtils.createSocketFromPorts(NetUtils.java:237) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:1719) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:1717) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:1717) at scala.util.Try$.apply(Try.scala:192) at org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:1772) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:1717) at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1653) at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) 2016-03-09 18:04:12,180 PST [ERROR] ec2-52-3-248-202.compute-1.ama [main] o.a.f.runtime.jobmanager.JobManager - Failed to run JobManager. java.lang.RuntimeException: Unable to do further retries starting the actor system at org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:1777) at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:1717) at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1653) at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) 2016-03-09 18:04:12,991 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.m.lib.MutableMetricsFactory - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, value=[Rate of successful kerberos logins and latency (milliseconds)], valueName=Time) 2016-03-09 18:04:13,006 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.m.lib.MutableMetricsFactory - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, value=[Rate of failed kerberos logins and latency (milliseconds)], valueName=Time) 2016-03-09 18:04:13,007 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.m.lib.MutableMetricsFactory - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, value=[GetGroups], valueName=Time) 2016-03-09 18:04:13,008 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.m.impl.MetricsSystemImpl - UgiMetrics, User and group related metrics 2016-03-09 18:04:13,217 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] org.apache.hadoop.util.Shell - Failed to detect a valid hadoop home directory java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set. at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:303) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:328) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80) at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:272) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:790) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:760) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:633) at org.apache.flink.runtime.util.EnvironmentInformation.getUserRunning(EnvironmentInformation.java:90) at org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:284) at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1595) at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) 2016-03-09 18:04:13,319 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] org.apache.hadoop.util.Shell - setsid exited with exit code 0 2016-03-09 18:04:13,325 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.s.a.util.KerberosName - Kerberos krb5 configuration not found, setting default realm to empty 2016-03-09 18:04:13,328 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] org.apache.hadoop.security.Groups - Creating new Groups object 2016-03-09 18:04:13,329 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.hadoop.util.NativeCodeLoader - Trying to load the custom-built native-hadoop library... 2016-03-09 18:04:13,330 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.hadoop.util.NativeCodeLoader - Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path 2016-03-09 18:04:13,330 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.hadoop.util.NativeCodeLoader - java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 2016-03-09 18:04:13,330 PST [WARN] ec2-52-3-248-202.compute-1.ama [main] o.a.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2016-03-09 18:04:13,331 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.hadoop.util.PerformanceAdvisory - Falling back to shell based 2016-03-09 18:04:13,332 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.s.JniBasedUnixGroupsMappingWithFallback - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping 2016-03-09 18:04:13,462 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] org.apache.hadoop.security.Groups - Group mapping impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; cacheTimeout=300000; warningDeltaMs=5000 2016-03-09 18:04:13,469 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.security.UserGroupInformation - hadoop login 2016-03-09 18:04:13,470 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.security.UserGroupInformation - hadoop login commit 2016-03-09 18:04:13,474 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.security.UserGroupInformation - using local user:UnixPrincipal: root 2016-03-09 18:04:13,476 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.security.UserGroupInformation - Using user: "UnixPrincipal: root" with name root 2016-03-09 18:04:13,476 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.security.UserGroupInformation - User entry: "root" 2016-03-09 18:04:13,477 PST [DEBUG] ec2-52-3-248-202.compute-1.ama [main] o.a.h.security.UserGroupInformation - UGI loginUser:root (auth:SIMPLE) 2016-03-09 18:04:13,478 PST [INFO] ec2-52-3-248-202.compute-1.ama [main] o.a.f.runtime.jobmanager