I haven't run into this until today.  I spun up a fresh cluster to do some more 
testing, and it seems that every single executor fails because it can't connect 
to the driver.  This is in the YARN logs:

14/10/02 16:24:11 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
driver: akka.tcp://sparkDriver@GATEWAY-1:60855/user/CoarseGrainedScheduler
14/10/02 16:24:11 ERROR executor.CoarseGrainedExecutorBackend: Driver 
Disassociated [akka.tcp://sparkExecutor@DATANODE-3:58232] -> 
[akka.tcp://sparkDriver@GATEWAY-1:60855] disassociated! Shutting down.

And this is what shows up from the driver:

14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Registered executor: 
Actor[akka.tcp://sparkExecutor@DATANODE-1:60341/user/Executor#1289950113] with 
ID 2
14/10/02 16:43:06 INFO util.RackResolver: Resolved DATANODE-1 to 
/rack/node8da83a04def73517bf437e95aeefa2469b1daf14
14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Executor 2 
disconnected, so removing it

It doesn't appear to be a networking issue.  Networking works both directions 
and there's no firewall blocking ports.  Googling the issue, it sounds like the 
most common problem is overallocation of memory, but I'm not doing that.  I've 
got these settings for a 3 * 128GB node cluster:

spark.executor.instances            17
spark.executor.memory               12424m
spark.yarn.executor.memoryOverhead  3549

That makes it 6 * 15973 = 95838 MB per node, which is well beneath the 128GB 
limit.

Frankly I'm stumped.  It worked fine when I spun up a cluster last week, but 
now it doesn't.  The logs give me no indication as to what the problem actually 
is.  Any pointers to where else I might look?

Thanks in advance.

Greg

Reply via email to