Hi Akhil et al.,I made the following changes: In spark-env.sh I added the following three entries (standalone mode) export SPARK_MASTER_IP=pzxnvm2018.x.y.name.orgexport SPARK_WORKER_MEMORY=4Gexport SPARK_WORKER_CORES=3 I then use start-master and start-slaves commands to start the services. Another sthing that I have noticed is that the number of cores that I specified is npot used: 2022 shows up with only 1 core and 2023 and 2024 show up with 4 cores. In the Web UI:URL: spark://pzxnvm2018.x.y.name.org:7077 I run the spark shell command from pzxnvm2018. /etc/hosts on my master node has following entry:master-ip pzxnvm2018.x.y.name.org pzxnvm2018 /etc/hosts on my a worker node has following entry: worker-ip pzxnvm2023.x.y.name.org pzxnvm2023
However, on my master node log file I still see this: ERROR EndpointWriter: AssociationError [akka.tcp://sparkmas...@pzxnvm2018.x.y.name.org:7077] -> [akka.tcp://spark@localhost:43569]: Error [Association failed with [akka.tcp://spark@localhost:43569]] My spark-shell has the following o/p scala> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140708100139-000014/07/08 10:01:39 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/0 on worker-20140708095558-pzxnvm2024.x.y.name.orgg-50218 (pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/0 on hostPort pzxnvm2024.x.y.name.org:50218 with 4 cores, 512.0 MB RAM14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/1 on worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/1 on hostPort pzxnvm2023.x.y.name.org:38294 with 4 cores, 512.0 MB RAM14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/2 on worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/2 on hostPort pzxnvm2022.x.y.name.org:41826 with 1 cores, 512.0 MB RAM14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/0 is now RUNNING14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/1 is now RUNNING14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/2 is now RUNNING14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/0 is now FAILED (Command exited with code 1)14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/0 removed: Command exited with code 114/07/08 10:01:42 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/3 on worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/3 on hostPort pzxnvm2024.x.y.name.org:50218 with 4 cores, 512.0 MB RAM14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/3 is now RUNNING14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/1 is now FAILED (Command exited with code 1)14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/1 removed: Command exited with code 114/07/08 10:01:42 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/4 on worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/4 on hostPort pzxnvm2023.x.y.name.org:38294 with 4 cores, 512.0 MB RAM14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/4 is now RUNNING14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/2 is now FAILED (Command exited with code 1)14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/2 removed: Command exited with code 114/07/08 10:01:43 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/5 on worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/5 on hostPort pzxnvm2022.x.y.name.org:41826 with 1 cores, 512.0 MB RAM14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/5 is now RUNNING14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/3 is now FAILED (Command exited with code 1)14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/3 removed: Command exited with code 114/07/08 10:01:44 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/6 on worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/6 on hostPort pzxnvm2024.x.y.name.org:50218 with 4 cores, 512.0 MB RAM14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/6 is now RUNNING14/07/08 10:01:45 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/4 is now FAILED (Command exited with code 1)14/07/08 10:01:45 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/4 removed: Command exited with code 114/07/08 10:01:45 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/7 on worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores Date: Tue, 8 Jul 2014 12:29:21 +0530 Subject: Re: Spark: All masters are unresponsive! From: ak...@sigmoidanalytics.com To: user@spark.apache.org Are you sure this is your master URL spark://pzxnvm2018:7077 ? You can look it up in the WebUI (mostly http://pzxnvm2018:8080) top left corner. Also make sure you are able to telnet pzxnvm2018 7077 from the machines where you are running the spark shell. ThanksBest Regards On Tue, Jul 8, 2014 at 12:21 PM, Sameer Tilak <ssti...@live.com> wrote: Hi All, I am having a few issues with stability and scheduling. When I use spark shell to submit my application. I get the following error message and spark shell crashes. I have a small 4-node cluster for PoC. I tried both manual and scripts-based cluster set up. I tried using FQDN as well for specifying the master node, but no luck. 14/07/07 23:44:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MappedRDD[6] at map at JaccardScore.scala:83)14/07/07 23:44:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks 14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:0 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:0 as 2322 bytes in 0 ms 14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:1 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:1 as 2322 bytes in 0 ms14/07/07 23:44:35 INFO Executor: Running task ID 1 14/07/07 23:44:35 INFO Executor: Running task ID 214/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally 14/07/07 23:44:35 INFO HadoopRDD: Input split: hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:0+9723938914/07/07 23:44:35 INFO HadoopRDD: Input split: hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:97239389+97239390 14/07/07 23:44:54 INFO AppClient$ClientActor: Connecting to master spark://pzxnvm2018:7077...14/07/07 23:45:14 INFO AppClient$ClientActor: Connecting to master spark://pzxnvm2018:7077...14/07/07 23:45:35 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up. 14/07/07 23:45:35 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.14/07/07 23:45:35 WARN HadoopRDD: Exception in RecordReader.close() java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264) at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:2135) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:83) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:208) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:193) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:113) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) 14/07/07 23:45:35 ERROR Executor: Exception in task ID 2java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264) at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2213) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717) at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722)