For one of my Spark jobs, my workers/executors are dying and leaving the 
cluster.

On the master, I see something like the following in the log file.  I'm 
surprised to see the '60' seconds in the master log below because I explicitly 
set it to '600' (or so I thought) in my spark job (see below).   This is 
happening at the end of my job when I'm trying to persist a large RDD (probably 
around 300+GB) back to S3 (in 256 partitions).  My cluster consists of 6 
r3.8xlarge machines.  The job successfully works when I'm outputting 100GB or 
200GB.

If  you have any thoughts/insights, it would be appreciated. 

Thanks.

Darin.

Here is where I'm setting the 'timeout' in my spark job.
SparkConf conf = new SparkConf().setAppName("SparkSync 
Application").set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer").set("spark.rdd.compress","true")  
 .set("spark.core.connection.ack.wait.timeout","600");​
On the master, I see the following in the log file.

4/11/13 17:20:39 WARN master.Master: Removing 
worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no 
heartbeat in 60 seconds14/11/13 17:20:39 INFO master.Master: Removing worker 
worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on 
ip-10-35-184-232.ec2.internal:5187714/11/13 17:20:39 INFO master.Master: 
Telling app of lost executor: 2

On a worker, I see something like the following in the log file.

14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 
attemptsjava.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]  at 
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)  at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)  at 
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)  at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
  at scala.concurrent.Await$.result(package.scala:107)  at 
org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)  at 
org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362)14/11/13 
17:21:11 INFO httpclient.HttpMethodDirector: I/O exception 
(java.net.SocketException) caught when processing request: Broken pipe14/11/13 
17:21:11 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:32 
INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) 
caught when processing request: Broken pipe14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:58 WARN 
utils.RestUtils: Retried connection 6 times, which exceeds the maximum retry 
count of 514/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, 
which exceeds the maximum retry count of 514/11/13 17:22:57 WARN 
util.AkkaUtils: Error sending message in 1 
attemptsjava.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]

Reply via email to