I haven't run into this until today. I spun up a fresh cluster to do some more testing, and it seems that every single executor fails because it can't connect to the driver. This is in the YARN logs:
14/10/02 16:24:11 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://sparkDriver@GATEWAY-1:60855/user/CoarseGrainedScheduler 14/10/02 16:24:11 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@DATANODE-3:58232] -> [akka.tcp://sparkDriver@GATEWAY-1:60855] disassociated! Shutting down. And this is what shows up from the driver: 14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@DATANODE-1:60341/user/Executor#1289950113] with ID 2 14/10/02 16:43:06 INFO util.RackResolver: Resolved DATANODE-1 to /rack/node8da83a04def73517bf437e95aeefa2469b1daf14 14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Executor 2 disconnected, so removing it It doesn't appear to be a networking issue. Networking works both directions and there's no firewall blocking ports. Googling the issue, it sounds like the most common problem is overallocation of memory, but I'm not doing that. I've got these settings for a 3 * 128GB node cluster: spark.executor.instances 17 spark.executor.memory 12424m spark.yarn.executor.memoryOverhead 3549 That makes it 6 * 15973 = 95838 MB per node, which is well beneath the 128GB limit. Frankly I'm stumped. It worked fine when I spun up a cluster last week, but now it doesn't. The logs give me no indication as to what the problem actually is. Any pointers to where else I might look? Thanks in advance. Greg