YARN JobManager HA using wrong network interface

Maximilian Bode Thu, 03 Mar 2016 00:23:55 -0800

Hi everyone,

we are trying to get to work JobManager HA in the context of a per-job YARN 
session using the 1.0.0-rc3 from a few days ago and are having a problem 
concerning task managers with several network interfaces.


After manually killing the job manager process, the jobmanager.log on the newly 
allocated second job manager reads:
---
2016-03-02 18:01:09,635 WARN  Remoting                                          
            - Tried to associate with unreachable remote address 
[akka.tcp://[email protected]:34811]. Address is now gated for 5000 ms, all 
messages to this address will be delivered to dead letters. Reason: Connection 
refused: /10.127.68.136:34811
2016-03-02 18:01:09,644 WARN  
org.apache.flink.runtime.webmonitor.JobManagerRetriever       - Failed to 
retrieve leader gateway and port.
akka.actor.ActorNotFound: Actor not found for: 
ActorSelection[Anchor(akka.tcp://[email protected]:34811/), 
Path(/user/jobmanager)]
        at 
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
        at 
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
        at 
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
        at 
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
        at 
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
        at 
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
        at 
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
        at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
        at 
akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
        at 
akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
        at 
akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
        at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
        at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
        at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
        at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
        at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
        at 
akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
        at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
        at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
        at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
        at 
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
        at 
akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
        at akka.actor.ActorCell.terminate(ActorCell.scala:369)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
---
where the IP not found is from the old job manager. So far, is this the 
expected behavior?

The problem then arises on a new task manager, which also tries to connect to 
the old job manager unsuccessfully. The ZooKeeperLeaderRetrievalService starts 
cycling through the available network interfaces, as can be seen in the 
relevant taskmanager.log:
---
2016-03-02 18:01:13,636 INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Starting ZooKeeperLeaderRetrievalService.
2016-03-02 18:01:13,646 INFO  
org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to 
select the network interface and address to use by connecting to the leading 
JobManager.
2016-03-02 18:01:13,646 INFO  
org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager 
will try to connect for 10000 milliseconds before falling back to heuristics
2016-03-02 18:01:13,712 INFO  org.apache.flink.runtime.net.ConnectionUtils      
            - Retrieved new target address /10.127.68.136:34811.
2016-03-02 18:01:14,079 INFO  org.apache.flink.runtime.net.ConnectionUtils      
            - Trying to connect to address /10.127.68.136:34811
2016-03-02 18:01:14,082 INFO  org.apache.flink.runtime.net.ConnectionUtils      
            - Failed to connect from address 
'task.manager.eth0.hostname.com/10.127.68.136': Connection refused
2016-03-02 18:01:14,082 INFO  org.apache.flink.runtime.net.ConnectionUtils      
            - Failed to connect from address '/10.127.68.136': Connection 
refused
2016-03-02 18:01:14,082 INFO  org.apache.flink.runtime.net.ConnectionUtils      
            - Failed to connect from address '/10.120.193.110': Connection 
refused
2016-03-02 18:01:14,082 INFO  org.apache.flink.runtime.net.ConnectionUtils      
            - Failed to connect from address '/10.127.68.136': Connection 
refused
2016-03-02 18:01:14,083 INFO  org.apache.flink.runtime.net.ConnectionUtils      
            - Failed to connect from address '/127.0.0.1': Connection refused
2016-03-02 18:01:14,083 INFO  org.apache.flink.runtime.net.ConnectionUtils      
            - Failed to connect from address '/10.120.193.110': Connection 
refused
2016-03-02 18:01:14,083 INFO  org.apache.flink.runtime.net.ConnectionUtils      
            - Failed to connect from address '/10.127.68.136': Connection 
refused
2016-03-02 18:01:14,083 INFO  org.apache.flink.runtime.net.ConnectionUtils      
            - Failed to connect from address '/127.0.0.1': Connection refused
---
After five repetitions, the task manager stops trying to retrieve the leader 
and using the HEURISTIC strategy ends up using  eth1 (10.120.193.110) from now 
on:
---
2016-03-02 18:01:23,650 INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Stopping ZooKeeperLeaderRetrievalService.
2016-03-02 18:01:23,655 INFO  org.apache.zookeeper.ClientCnxn                   
            - EventThread shut down
2016-03-02 18:01:23,655 INFO  org.apache.zookeeper.ZooKeeper                    
            - Session: 0x25229757cff035b closed
2016-03-02 18:01:23,664 INFO  org.apache.flink.runtime.taskmanager.TaskManager  
            - TaskManager will use hostname/address 
'task.manager.eth1.hostname.com' (10.120.193.110) for communication.
---
Following the new jobmanager is discovered and the taskmanager is able to 
register at the jobmanager using eth1. The problem is that connections TO eth1 
are not possible. So flink should always use eth0. The exception we later see 
is:
---
java.io.IOException: Connecting the channel failed: Connecting to remote task 
manager + 'other.task.manager.eth1.hostname/10.120.193.111:46620' has failed. 
This might indicate that the remote task manager has been lost.
        at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196)
        at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:131)
        at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:83)
        at 
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:60)
        at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:115)
        at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:388)
        at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:411)
        at 
org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:108)
        at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:175)
        at 
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:65)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:224)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
        at java.lang.Thread.run(Thread.java:744)
---
The root cause seems to be that network interface selection is still using the 
old jobmanager location and hence is not able to choose the right interface. In 
particular, it seems that iteration order over the network interfaces differs 
between the HEURISTIC and SLOW strategy, which then leads to the wrong 
interface being selected.

Cheers,
 Max
—
Maximilian Bode * Junior Consultant * [email protected]
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

signature.asc
Description: Message signed with OpenPGP using GPGMail

YARN JobManager HA using wrong network interface

Reply via email to