Hi, I'm new to Spark. I have built small spark on yarn cluster, which contains 1 master(20GB RAM, 8 core), 3 worker(4GB RAM, 4 core). When trying to run a command sc.parallelize(1 to 1000).count() through $SPARK_HOME/bin/spark-shell, sometimes the command can submit a job successfully, sometimes it is failure with following exception.
I can definitely make sure the three workers are registered to master after checking out spark webui. There are spark memory-related parameters to be configured in spark-env.sh file, for instance, SPARK_EXECUTOR_MEMORY=2G, SPARK_DRIVER_MEMORY=1G, SPARK_WORKER_MEMORY=4G. Would anyone help to give me hint how to resolve this issue? I have not give any hint after google search. *# bin/spark-shellSpark assembly has been built with Hive, including Datanucleus jars on classpath15/02/11 12:21:39 INFO SecurityManager: Changing view acls to: root,15/02/11 12:21:39 INFO SecurityManager: Changing modify acls to: root,15/02/11 12:21:39 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, ); users with modify permissions: Set(root, )15/02/11 12:21:39 INFO HttpServer: Starting HTTP Server15/02/11 12:21:39 INFO Utils: Successfully started service 'HTTP class server' on port 28968.Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.1.0 /_/Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.6.0_24)Type in expressions to have them evaluated.Type :help for more information.15/02/11 12:21:43 INFO SecurityManager: Changing view acls to: root,15/02/11 12:21:43 INFO SecurityManager: Changing modify acls to: root,15/02/11 12:21:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, ); users with modify permissions: Set(root, )15/02/11 12:21:44 INFO Slf4jLogger: Slf4jLogger started15/02/11 12:21:44 INFO Remoting: Starting remoting15/02/11 12:21:44 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@xpan-biqa1:6862]15/02/11 12:21:44 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@xpan-biqa1:6862]15/02/11 12:21:44 INFO Utils: Successfully started service 'sparkDriver' on port 6862.15/02/11 12:21:44 INFO SparkEnv: Registering MapOutputTracker15/02/11 12:21:44 INFO SparkEnv: Registering BlockManagerMaster15/02/11 12:21:44 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150211122144-ed2615/02/11 12:21:44 INFO Utils: Successfully started service 'Connection manager for block manager' on port 40502.15/02/11 12:21:44 INFO ConnectionManager: Bound socket to port 40502 with id = ConnectionManagerId(xpan-biqa1,40502)15/02/11 12:21:44 INFO MemoryStore: MemoryStore started with capacity 265.0 MB15/02/11 12:21:44 INFO BlockManagerMaster: Trying to register BlockManager15/02/11 12:21:44 INFO BlockManagerMasterActor: Registering block manager xpan-biqa1:40502 with 265.0 MB RAM15/02/11 12:21:44 INFO BlockManagerMaster: Registered BlockManager15/02/11 12:21:44 INFO HttpFileServer: HTTP File server directory is /tmp/spark-0a80ce6b-6a05-4163-a97d-07753f627ec815/02/11 12:21:44 INFO HttpServer: Starting HTTP Server15/02/11 12:21:44 INFO Utils: Successfully started service 'HTTP file server' on port 25939.15/02/11 12:21:44 INFO Utils: Successfully started service 'SparkUI' on port 4040.15/02/11 12:21:44 INFO SparkUI: Started SparkUI at http://xpan-biqa1:4040 <http://xpan-biqa1:4040>15/02/11 12:21:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable15/02/11 12:21:46 INFO EventLoggingListener: Logging events to hdfs://xpan-biqa1:7020/spark/spark-shell-142362850543115/02/11 12:21:46 INFO AppClient$ClientActor: Connecting to master spark://xpan-biqa1:7077...15/02/11 12:21:46 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.015/02/11 12:21:46 INFO SparkILoop: Created spark context..Spark context available as sc.scala> 15/02/11 12:22:06 INFO AppClient$ClientActor: Connecting to master spark://xpan-biqa1:7077...scala> sc.parallelize(1 to 1000).count()15/02/11 12:22:24 INFO SparkContext: Starting job: count at <console>:1315/02/11 12:22:24 INFO DAGScheduler: Got job 0 (count at <console>:13) with 2 output partitions (allowLocal=false)15/02/11 12:22:24 INFO DAGScheduler: Final stage: Stage 0(count at <console>:13)15/02/11 12:22:24 INFO DAGScheduler: Parents of final stage: List()15/02/11 12:22:24 INFO DAGScheduler: Missing parents: List()15/02/11 12:22:24 INFO DAGScheduler: Submitting Stage 0 (ParallelCollectionRDD[0] at parallelize at <console>:13), which has no missing parents15/02/11 12:22:24 INFO MemoryStore: ensureFreeSpace(1088) called with curMem=0, maxMem=27784249315/02/11 12:22:24 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1088.0 B, free 265.0 MB)15/02/11 12:22:24 INFO MemoryStore: ensureFreeSpace(800) called with curMem=1088, maxMem=27784249315/02/11 12:22:24 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 800.0 B, free 265.0 MB)15/02/11 12:22:24 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xpan-biqa1:40502 (size: 800.0 B, free: 265.0 MB)15/02/11 12:22:24 INFO BlockManagerMaster: Updated info of block broadcast_0_piece015/02/11 12:22:24 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (ParallelCollectionRDD[0] at parallelize at <console>:13)15/02/11 12:22:24 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks15/02/11 12:22:26 INFO AppClient$ClientActor: Connecting to master spark://xpan-biqa1:7077...15/02/11 12:22:39 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory15/02/11 12:22:46 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.15/02/11 12:22:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool15/02/11 12:22:46 INFO TaskSchedulerImpl: Cancelling stage 015/02/11 12:22:46 INFO DAGScheduler: Failed to run count at <console>:1315/02/11 12:22:46 INFO SparkUI: Stopped Spark web UI at http://xpan-biqa1:4040 <http://xpan-biqa1:4040>15/02/11 12:22:46 INFO DAGScheduler: Stopping DAGScheduler15/02/11 12:22:46 INFO SparkDeploySchedulerBackend: Shutting down all executors15/02/11 12:22:46 INFO SparkDeploySchedulerBackend: Asking each executor to shut downorg.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up. at org.apache.spark.scheduler.DAGScheduler.org <http://org.apache.spark.scheduler.DAGScheduler.org>$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)scala> 15/02/11 12:22:47 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!* Regards, Ryan