Hi there
I am running 30 APPs in my spark cluster, and some of the APPs got
exception like below:[root@slave3 0]# cat stderr
15/06/29 17:20:08 INFO executor.CoarseGrainedExecutorBackend: Registered signal
handlers for [TERM, HUP, INT]
15/06/29 17:20:09 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/06/29 17:20:09 INFO spark.SecurityManager: Changing view acls to: root
15/06/29 17:20:09 INFO spark.SecurityManager: Changing modify acls to: root
15/06/29 17:20:09 INFO spark.SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(root); users with
modify permissions: Set(root)
15/06/29 17:20:09 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/06/29 17:20:09 INFO Remoting: Starting remoting
15/06/29 17:20:10 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://driverPropsFetcher@slave3:51026]
15/06/29 17:20:10 INFO util.Utils: Successfully started service
'driverPropsFetcher' on port 51026.
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1643)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:128)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:224)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:144)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
... 4 more
when i am running 20 APPs,it is OK. So I doubt this problem looks like executor
get disassicated with the driver due to high I/O pressure or network
latency.however I have no idea which parameter is spark could fix this. Any
idea will be appreciated.
Here is some infomation about my cluster:1master and 6workers.every node has
8cores and 12GB memory.
And settings in my spark-default.conf and spark-env.sh is like this:
spark-default.conf
spark.master spark://master:7077
spark.eventLog.enabled true
spark.eventLog.dir /var/log/spark
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 8g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one
two three"
spark.kryoserializer.buffer.max.mb 128
spark.storage.memoryFraction 0.2
spark.shuffle.memoryFraction 0.4
spark.sql.shuffle.partitions 32
spark.scheduler.mode FAIR
spark.worker.cleanup.appDataTtl 259200
spark.port.maxRetries 10000
spark.scheduler.maxRegisteredResourcesWaitingTime 40
spark-env.sh:export SPARK_WORKER_INSTANCES=1
export SPARK_EXECUTOR_INSTANCES=8
export SPARK_EXECUTOR_CORES=1
export SPARK_EXECUTOR_MEMORY=1g
--------------------------------
Thanks&Best regards!
San.Luo