Re: Docker-Flink Project: TaskManagers can't talk to JobManager if they are on different nodes

Till Rohrmann Sun, 12 Nov 2017 10:07:11 -0800

Great to hear that you got it working :-) Looking forward to your blog post
to learn more about your experiences :-)


Cheers,
Till

On Fri, Nov 10, 2017 at 10:18 PM, Vergilio, Thalita <
t.vergilio4...@student.leedsbeckett.ac.uk> wrote:

> Hi Til,
>
>
> Thank you very much for that. And thanks for your help. I have finally
> managed to get the multi-cloud setup on Docker Swarm working by tweaking
> the Flink image slightly to set these configuration options to known
> values. I have also used the Weave Net Docker plugin to create my
> cross-cloud network.
>
>
> I am in the process of documenting my experience in a blog article, which
> I will share in this list so others can hopefully benefit from it.
>
>
> Thank you and the rest of the Flink team once again for all your help and
> support.
>
>
> Best wishes,
>
>
> Thalita
> ------------------------------
> *From:* Till Rohrmann <trohrm...@apache.org>
> *Sent:* 10 November 2017 12:15:00
>
> *To:* Vergilio, Thalita
> *Cc:* Piotr Nowojski; user@flink.apache.org; Patrick Lucas
> *Subject:* Re: Docker-Flink Project: TaskManagers can't talk to
> JobManager if they are on different nodes
>
>
> Hi Thalita, yes you can use the mentioned configuration parameters to set
> the ports for the TaskManager and the BlobServer. However, you must make
> sure that there is at most one TM running on a host, otherwise you run into
> port collisions.
>
> For taskmanager.rpc.port and blob.server.port you can define a range.
>
> Cheers,
> Till
> 
>
> On Fri, Nov 10, 2017 at 11:47 AM, Vergilio, Thalita <
> t.vergilio4...@student.leedsbeckett.ac.uk> wrote:
>
> Hi All,
>
>
> I just wanted to let you know that I have finally managed to get the
> multi-cloud setup working!! I honestly can't believe my eyes. I used a
> Docker plugin called Weave to create the Swarm network, a public external
> IP address for each node and opened a range of ports, and I can now get my
> Google Cloud machine to connect to the Azure machines.
>
>
> There are still some minor issues, i.e. I don't know which exact ports to
> open for TaskManager communication in Flink. They seem to be getting
> assigned randomly at runtime, so I had to open a wide range of ports to
> allow the communication to happen, which is far from ideal.
>
>
> Is there a way of finding out what these ports are and setting them to a
> constant value? Looking at the documentation, the suspects are:
>
>
>
>    -
>
>    taskmanager.rpc.port: The task manager’s IPC port (DEFAULT: 0, which
>    lets the OS choose a free port).
>    -
>
>    taskmanager.data.port: The task manager’s port used for data exchange
>    operations (DEFAULT: 0, which lets the OS choose a free port).
>    -
>
>    blob.server.port: Port definition for the blob server (serving user
>    JARs) on the TaskManagers. By default the port is set to 0, which means
>    that the operating system is picking an ephemeral port. Flink also accepts
>    a list of ports (“50100,50101”), ranges (“50100-50200”) or a combination of
>    both. It is recommended to set a range of ports to avoid collisions when
>    multiple JobManagers are running on the same machine.
>
> Many thanks,
>
>
> Thalita
>
> ------------------------------
> *From:* Vergilio, Thalita
> *Sent:* 09 November 2017 22:04:24
> *To:* Till Rohrmann
>
> *Cc:* Piotr Nowojski; user@flink.apache.org; Patrick Lucas
> *Subject:* Re: Docker-Flink Project: TaskManagers can't talk to
> JobManager if they are on different nodes
>
>
> Hi Till,
>
>
> I have made some progress with the name resolution for machines that are
> not in the same subnet. The problem I am facing now is Flink-specific, so I
> wonder if you could help me.
>
>
> It is all running fine in a multi-cloud setup with the jobmanager in Azure
> and the taskmanager in the Google cloud. However, when I scale the
> taskmanager up and it start running on Azure nodes as well, I get an Akka
> error which I presume means the taskmanagers can't talk to each other when
> parallelising the task.
>
>
> Do you know what the IP address and port below are? Are they assigned by
> Flink?
>
>
> Thank you very much.
>
>
> Thalita
>
>
> java.lang.Exception: Cannot deploy task Source: Read(UnboundedKafkaSource) -> 
> Flat Map -> KafkaPuePipelineProcessor/Window.Into()/Window.Assign.out -> 
> ParMultiDo(Anonymous) -> ToKeyedWorkItem (2/3) 
> (b9f31626fb7d83d39e24e570e034f03e) - TaskManager 
> (3a9c37463c88510a44097df0c99b5f90 @ 172.18.0.3 (dataPort=38963)) not 
> responding after a timeout of 10000 ms
>       at 
> org.apache.flink.runtime.executiongraph.Execution$2.apply(Execution.java:437)
>       at 
> org.apache.flink.runtime.executiongraph.Execution$2.apply(Execution.java:429)
>       at 
> org.apache.flink.runtime.concurrent.impl.FlinkFuture$3.recover(FlinkFuture.java:201)
>       at akka.dispatch.Recover.internal(Future.scala:268)
>       at akka.dispatch.japi$RecoverBridge.apply(Future.scala:184)
>       at akka.dispatch.japi$RecoverBridge.apply(Future.scala:182)
>       at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
>       at scala.util.Try$.apply(Try.scala:161)
>       at scala.util.Failure.recover(Try.scala:185)
>       at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>       at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka.tcp://flink@172.18.0.3:37959/user/taskmanager#364916492]] after 
> [10000 ms]
>       at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
>       at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
>       at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)
>       ... 1 more
>
>
> ------------------------------
> *From:* Till Rohrmann <trohrm...@apache.org>
> *Sent:* 06 November 2017 13:48:59
> *To:* Vergilio, Thalita
> *Cc:* Piotr Nowojski; user@flink.apache.org; Patrick Lucas
> *Subject:* Re: Docker-Flink Project: TaskManagers can't talk to
> JobManager if they are on different nodes
>
> I'm not entirely sure how docker swarm works but from the Flink
> perspective there mustn't be two TaskManagers running on the same host
> (meaning an entity where you share the same address) if you set the
> TaskManager data port to a fixed value (otherwise only one of them can be
> started due to port conflicts). If you can ensure that this is the case,
> then it should be save to specify a port for the data transmission.
>
> Cheers,
> Till
>
> On Mon, Nov 6, 2017 at 2:37 PM, Vergilio, Thalita <
> t.vergilio4...@student.leedsbeckett.ac.uk> wrote:
>
> Hi Till,
>
>
> Thanks a lot for your answer.
>
>
> Is the taskmanager.data.port unique per TaskManager? The documentation
> says it is assigned at runtime by the OS. My thinking here is that you
> would need to know what that is at service creation time, which would go
> against the whole idea of how services are scaled in Docker Swarm.
>
>
> When you create a Swarm service using 'docker stack deploy' or 'docker
> service create', the configuration that is used at that point is the same
> that will be used by all instances of the service. If you then scale
> TaskManager to 8 or 10 containers, each of them gets the same service
> configuration(the one used to create the service).
>
>
> I have in fact tried to map specific ports in the TaskManager service
> configuration, but then I got "port already in use" when I tried to scale
> up the service.
>
>
> I wonder if there is a way around it.
>
>
> Maybe the people who developed the create-docker-swarm-service.sh script
> in the docker-flink project would be able to shed some light?
>
>
> ------------------------------
> *From:* Till Rohrmann <trohrm...@apache.org>
> *Sent:* 06 November 2017 12:40:33
> *To:* Piotr Nowojski
> *Cc:* Vergilio, Thalita; user@flink.apache.org; Patrick Lucas
>
> *Subject:* Re: Docker-Flink Project: TaskManagers can't talk to
> JobManager if they are on different nodes
>
> Hi Thalita,
>
> in order to make Flink work, I think you have to expose the JobManager RPC
> port, the Blob server port and make sure that the TaskManager can talk to
> each other by exposing the `taskmanager.data.port`. The query server port
> is only necessary if you want to use queryable state.
>
> I've pulled in Patrick who has more experience with running Flink on top
> of Docker. He'll definitely be able to provide more detailed
> recommendations.
>
> Cheers,
> Till
>
> On Mon, Nov 6, 2017 at 9:22 AM, Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
> Till, is there somewhere a list of ports that need to exposed that’s more
> up to date compared to docker-flunk README?
>
> Piotrek
>
> On 3 Nov 2017, at 10:23, Vergilio, Thalita <t.vergilio4822@student.leedsb
> eckett.ac.uk> wrote:
>
> Just an update: by changing the JOB_MANAGER_RPC_ADDRESS to the public IP
> of the JobManager and exposing port 6123 as {{PUBLIC_IP}}:6123:6123, I
> manged to get the TaskManagers from different nodes and even different
> subnets to talk to the JobManager.
>
> This is how I created the services:
>
> docker network create -d overlay overlay
>
> docker service create --name jobmanager --env JOB_MANAGER_RPC_ADDRESS=
> {{PUBLIC_IP}}  -p 8081:8081 -p{{PUBLIC_IP}}:6123:6123 -p 48081:48081 -p
> 6124:6124 -p 6125:6125 --network overlay --constraint 'node.hostname ==
> ubuntu-swarm-manager' flink jobmanager
>
> docker service create --name taskmanager --env JOB_MANAGER_RPC_ADDRESS=
> {{PUBLIC_IP}}  -p 6121:6121 -p 6122:6122  --network overlay --constraint
> 'node.hostname != ubuntu-swarm-manager' flink taskmanager
>
> However, I am still encountering errors further down the line. When I
> submit a job using the Web UI, it fails because the JobManager can't talk
> to the TaskManager on port 35033. I presume this is the
> taskmanager.data.port, which needs to be set to a range and this range
> exposed when I create the service?
>
> Are there any other ports that I need to open at service creation time?
>
> Connecting the channel failed: Connecting to remote task manager + 
> '/{{IP_ADDRESS_OF_MANAGER}}:35033' has failed. This might indicate that the 
> remote task manager has been lost.
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:131)
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:83)
>       at 
> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:112)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:433)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:455)
>       at 
> org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:91)
>       at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:213)
>       at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
>       at java.lang.Thread.run(Thread.java:748)
>
>
>
> ------------------------------
> *From:* Piotr Nowojski <pi...@data-artisans.com>
> *Sent:* 02 November 2017 14:26:32
> *To:* Vergilio, Thalita
> *Cc:* user@flink.apache.org
> *Subject:* Re: Docker-Flink Project: TaskManagers can't talk to
> JobManager if they are on different nodes
>
> Did you try to expose required ports that are listed in the README when
> starting the containers?
>
> https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
>
> Ports:
> • The Web Client is on port 48081
> • JobManager RPC port 6123 (default, not exposed to host)
> • TaskManagers RPC port 6122 (default, not exposed to host)
> • TaskManagers Data port 6121 (default, not exposed to host)
>
> Piotrek
>
> On 2 Nov 2017, at 14:44, javalass <t.vergilio4822@student.leedsb
> eckett.ac.uk> wrote:
>
> I am using the Docker-Flink project in:
> https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
>
> I am creating the services with the following commands:
> docker network create -d overlay overlay
> docker service create --name jobmanager --env
> JOB_MANAGER_RPC_ADDRESS=jobmanager -p 8081:8081 --network overlay
> --constraint 'node.hostname == ubuntu-swarm-manager' flink jobmanager
> docker service create --name taskmanager --env
> JOB_MANAGER_RPC_ADDRESS=jobmanager --network overlay --constraint
> 'node.hostname != ubuntu-swarm-manager' flink taskmanager
>
> I wonder if there's any configuration I'm missing. This is the error I get:
> - Trying to register at JobManager akka.tcp://flink@jobmanager:6123/
> user/jobmanager (attempt 4, timeout: 4000 milliseconds)
>
>
>
>
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>
>
> To view the terms under which this email is distributed, please go to:-
> http://disclaimer.leedsbeckett.ac.uk/disclaimer/disclaimer.html
>
>
>
> To view the terms under which this email is distributed, please go to:-
> http://disclaimer.leedsbeckett.ac.uk/disclaimer/disclaimer.html
>
>
> To view the terms under which this email is distributed, please go to:-
> http://disclaimer.leedsbeckett.ac.uk/disclaimer/disclaimer.html
>
>
> To view the terms under which this email is distributed, please go to:-
> http://disclaimer.leedsbeckett.ac.uk/disclaimer/disclaimer.html
>
>

Re: Docker-Flink Project: TaskManagers can't talk to JobManager if they are on different nodes

Reply via email to