Hi Stephan, I am able to figure out the issue... Here is my explanation.. As I've said, I'm trying to setup Flink HA cluster in docker containers managed by Amazon ECS. I've a remote zookeeper cluster running in AWS. There are few issues when we deploy it using docker
--- Flink uses *jobmanager.rpc.address *to bind as well as for storing it in the zookeeper. Now this address could be the host_ipaddress or running_container_ipaddress. If I set it to host_ipaddress then jobmanager is not able to bind because this is not the container's ip address. If I use the container's ip address then it is able to bind, but when it pushes its details to zookeeper , its container's ip address. So remote taskmanager's are not able to discover it. Ideally *jobmanager.rpc.address *should be split into *jobmanager.bind.address (*to bind to jobmanager*) *and *jobmanager.discovery.address* (to publish in zookeeper so that remote taskmanager's can discover it).. eg: Let's assume EC2_Instance_Ip = 1.1.1.1 Container_Ip = 2.2.2.2 (This container is running in this EC2_Instance) recovery.jobmanager.port = 3000 jobmanager.web.port = 8080 I mapped port 3000 on container to 3000 on host and 8080 on container to 8080 on host... In flink-conf.yml assume *Case 1* jobmanager.rpc.address = 2.2.2.2 (Container's Ip address) Now 2.2.2.2 will be written in zookeeper. So external taskmanager would like to use this address to communicate with the jobmanager but it will not be able to connect since 2.2.2.2 is not discoverable from outside EC2 container. *Case 2* jobmanager.rpc.address = 1.1.1.1 (EC2_Instance Ip address) Container does not know this address, so it will not be able to bind at all. As you can see we need 2 ip address... one for binding and another for discovery. ---- In docker world we have to expose all the ports we want to use ( in bridged network mode). By default the jobmanager uses random port number for communication, since we do not know the port number in advance so we set r*ecovery.jobmanager.port* and exposed it in Dockerfile. Same is the case with blob.server.port on taskmanager's. Hope I clarified it, please let me know if you have any other question. On Thu, Mar 10, 2016 at 10:47 AM, Stephan Ewen <se...@apache.org> wrote: > Hi! > > Is it possible that the docker container config forbids to open ports? > Flink will try to open some ports and needs the OS or container to permit > that. > > Greetings, > Stephan > > > On Thu, Mar 10, 2016 at 6:27 PM, Deepak Jha <dkjhan...@gmail.com> wrote: > > > Hi Stephan, > > I tried 0.10.2 as well still running into the same issue. > > > > On Thursday, March 10, 2016, Deepak Jha <dkjhan...@gmail.com> wrote: > > > > > Yes. Flink 1.0.0 > > > > > > On Thursday, March 10, 2016, Stephan Ewen <se...@apache.org > > > <javascript:_e(%7B%7D,'cvml','se...@apache.org');>> wrote: > > > > > >> Hi! > > >> > > >> Is this Flink 1.0.0 ? > > >> > > >> Stephan > > >> > > >> > > >> On Thu, Mar 10, 2016 at 6:02 AM, Deepak Jha <dkjhan...@gmail.com> > > wrote: > > >> > > >> > Hi All, > > >> > > > >> > I'm trying to setup Flink 1.0.0 cluster on Docker (separate > containers > > >> for > > >> > jobmanager and taskmanager) inside AWS (Using AWS ECS service). I > > >> tested it > > >> > locally and its working fine but on AWS Docker, I am running into > > >> following > > >> > issue > > >> > > > >> > *2016-03-09 18:04:12,114 PST [INFO] ec2-52-3-248-202.compute-1.ama > > >> [main] > > >> > o.a.f.runtime.jobmanager.JobManager - Starting JobManager with > > >> > high-availability* > > >> > *2016-03-09 18:04:12,118 PST [INFO] ec2-52-3-248-202.compute-1.ama > > >> [main] > > >> > o.a.f.runtime.jobmanager.JobManager - Starting JobManager on > > >> > 172.31.63.152:8079 <http://172.31.63.152:8079> with execution mode > > >> CLUSTER* > > >> > *2016-03-09 18:04:12,172 PST [INFO] ec2-52-3-248-202.compute-1.ama > > >> [main] > > >> > o.a.f.runtime.jobmanager.JobManager - Security is not enabled. > > Starting > > >> > non-authenticated JobManager.* > > >> > *2016-03-09 18:04:12,174 PST [DEBUG] ec2-52-3-248-202.compute-1.ama > > >> [main] > > >> > org.apache.flink.util.NetUtils - Trying to open socket on port 8079* > > >> > *2016-03-09 18:04:12,176 PST [DEBUG] ec2-52-3-248-202.compute-1.ama > > >> [main] > > >> > org.apache.flink.util.NetUtils - Unable to allocate socket on port* > > >> > *java.net.BindException: Cannot assign requested address* > > >> > * at java.net.PlainSocketImpl.socketBind(Native Method)* > > >> > * at > > >> > > > java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)* > > >> > * at java.net.ServerSocket.bind(ServerSocket.java:375)* > > >> > * at java.net.ServerSocket.<init>(ServerSocket.java:237)* > > >> > * at > > >> > > > >> > > > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2$$anon$3.createSocket(JobManager.scala:1722)* > > >> > * at > > >> > > > org.apache.flink.util.NetUtils.createSocketFromPorts(NetUtils.java:237)* > > >> > * at > > >> > > > >> > > > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:1719)* > > >> > * at > > >> > > > >> > > > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:1717)* > > >> > * at > > >> > > > >> > > > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:1717)* > > >> > * at scala.util.Try$.apply(Try.scala:192)* > > >> > * at > > >> > > > >> > > > org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:1772)* > > >> > * at > > >> > > > >> > > > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:1717)* > > >> > * at > > >> > > > >> > > > org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1653)* > > >> > * at > > >> > > org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala)* > > >> > *2016-03-09 18:04:12,180 PST [ERROR] ec2-52-3-248-202.compute-1.ama > > >> [main] > > >> > o.a.f.runtime.jobmanager.JobManager - Failed to run JobManager.* > > >> > *java.lang.RuntimeException: Unable to do further retries starting > the > > >> > actor system* > > >> > * at > > >> > > > >> > > > org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:1777)* > > >> > * at > > >> > > > >> > > > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:1717)* > > >> > * at > > >> > > > >> > > > org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1653)* > > >> > * at > > >> > > org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala)* > > >> > *2016-03-09 18:04:12,991 PST [DEBUG] ec2-52-3-248-202.compute-1.ama > > >> [main] > > >> > o.a.h.m.lib.MutableMetricsFactory - field > > >> > org.apache.hadoop.metrics2.lib.MutableRate > > >> > > > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess > > >> > with annotation > @org.apache.hadoop.metrics2.annotation.Metric(about=, > > >> > sampleName=Ops, always=false, type=DEFAULT, value=[Rate of > successful > > >> > kerberos logins and latency (milliseconds)], valueName=Time)* > > >> > > > >> > > > >> > Initially Jobmanager tries to bind to port 0 which did not work. On > > >> > looking further into it, I tried using recovery jobmanager port > using > > >> > different port combinations, but it does not seems to be working... > > I've > > >> > exposed the ports in the docker compose file as well.... > > >> > > > >> > > > >> > PFA the jobmanager log file for details also the jobmanager config > > >> file... > > >> > -- > > >> > Thanks, > > >> > Deepak Jha > > >> > > > >> > > > >> > > > > > > > > > -- > > > Sent from Gmail Mobile > > > > > > > > > -- > > Sent from Gmail Mobile > > > -- Thanks, Deepak Jha