So, I actually figured it out, and it's all my fault.  I had an older version 
of spark on the datanodes and was passing in spark.executor.extraClassPath to 
pick it up.  It was a holdover from some initial work before I got everything 
working right.  Once I removed that, it picked up the spark JAR from hdfs 
instead and ran without issue.

Sorry for the false alarm.

The AM container logs were what I had pasted in the original email, btw.

Greg

From: Andrew Or <and...@databricks.com<mailto:and...@databricks.com>>
Date: Thursday, October 2, 2014 12:24 PM
To: Greg <greg.h...@rackspace.com<mailto:greg.h...@rackspace.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: weird YARN errors on new Spark on Yarn cluster

Hi Greg,

Have you looked at the AM container logs? (You may already know this, but) you 
can get these through the RM web UI or through:

yarn logs -applicationId <your app ID>

If an AM throws an exception then the executors may not be started properly.

-Andrew



2014-10-02 9:47 GMT-07:00 Greg Hill 
<greg.h...@rackspace.com<mailto:greg.h...@rackspace.com>>:
I haven't run into this until today.  I spun up a fresh cluster to do some more 
testing, and it seems that every single executor fails because it can't connect 
to the driver.  This is in the YARN logs:

14/10/02 16:24:11 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
driver: akka.tcp://sparkDriver@GATEWAY-1:60855/user/CoarseGrainedScheduler
14/10/02 16:24:11 ERROR executor.CoarseGrainedExecutorBackend: Driver 
Disassociated [akka.tcp://sparkExecutor@DATANODE-3:58232] -> 
[akka.tcp://sparkDriver@GATEWAY-1:60855] disassociated! Shutting down.

And this is what shows up from the driver:

14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Registered executor: 
Actor[akka.tcp://sparkExecutor@DATANODE-1:60341/user/Executor#1289950113] with 
ID 2
14/10/02 16:43:06 INFO util.RackResolver: Resolved DATANODE-1 to 
/rack/node8da83a04def73517bf437e95aeefa2469b1daf14
14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Executor 2 
disconnected, so removing it

It doesn't appear to be a networking issue.  Networking works both directions 
and there's no firewall blocking ports.  Googling the issue, it sounds like the 
most common problem is overallocation of memory, but I'm not doing that.  I've 
got these settings for a 3 * 128GB node cluster:

spark.executor.instances            17
spark.executor.memory               12424m
spark.yarn.executor.memoryOverhead  3549

That makes it 6 * 15973 = 95838 MB per node, which is well beneath the 128GB 
limit.

Frankly I'm stumped.  It worked fine when I spun up a cluster last week, but 
now it doesn't.  The logs give me no indication as to what the problem actually 
is.  Any pointers to where else I might look?

Thanks in advance.

Greg

Reply via email to