Hello, has anyone found this problem before? I am sorry to insist but I can not guess what is happening. Should I ask to the dev mailing list? Many thanks in advance. El 05/03/2014 23:57, "Christian" <chri...@gmail.com> escribió:
> I have deployed a Spark cluster in standalone mode with 3 machines: > > node1/192.168.1.2 -> master > node2/192.168.1.3 -> worker 20 cores 12g > node3/192.168.1.4 -> worker 20 cores 12g > > The web interface shows the workers correctly. > > When I launch the scala job (which only requires 256m of memory) these are > the logs: > > 14/03/05 23:24:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 > with 55 tasks > 14/03/05 23:24:21 WARN scheduler.TaskSchedulerImpl: Initial job has not > accepted any resources; check your cluster UI to ensure that workers are > registered and have sufficient memory > 14/03/05 23:24:23 INFO client.AppClient$ClientActor: Connecting to master > spark://node1:7077... > 14/03/05 23:24:36 WARN scheduler.TaskSchedulerImpl: Initial job has not > accepted any resources; check your cluster UI to ensure that workers are > registered and have sufficient memory > 14/03/05 23:24:43 INFO client.AppClient$ClientActor: Connecting to master > spark://node1:7077... > 14/03/05 23:24:51 WARN scheduler.TaskSchedulerImpl: Initial job has not > accepted any resources; check your cluster UI to ensure that workers are > registered and have sufficient memory > 14/03/05 23:25:03 ERROR client.AppClient$ClientActor: All masters are > unresponsive! Giving up. > 14/03/05 23:25:03 ERROR cluster.SparkDeploySchedulerBackend: Spark cluster > looks dead, giving up. > 14/03/05 23:25:03 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 0.0 > from pool > 14/03/05 23:25:03 INFO scheduler.DAGScheduler: Failed to run > saveAsNewAPIHadoopFile at CondelCalc.scala:146 > Exception in thread "main" org.apache.spark.SparkException: Job aborted: > Spark cluster looks down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) > ... > > The generated logs by the master and the 2 workers are attached, but I > found something weird in the master logs: > > 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:57297*with 20 > cores, 12.0 GB RAM > 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:34188*with 20 > cores, 12.0 GB RAM > > It reports that the two workers are node1:57297 and node1:34188 instead of > node3 and node2 respectively. > > $ cat /etc/hosts > ... > 192.168.1.2 node1 > 192.168.1.3 node2 > 192.168.1.4 node3 > ... > > $ nslookup node2 > Server: 192.168.1.1 > Address: 192.168.1.1#53 > > Name: node2.cluster.local > Address: 192.168.1.3 > > $ nslookup node3 > Server: 192.168.1.1 > Address: 192.168.1.1#53 > > Name: node3.cluster.local > Address: 192.168.1.4 > > $ ssh node1 "ps aux | grep spark" > cperez 17023 1.4 0.1 4691944 154532 pts/3 Sl 23:37 0:15 > /data/users/cperez/opt/jdk/bin/java -cp > :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop > -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m > org.apache.spark.deploy.master.Master --ip node1 --port 7077 --webui-port > 8080 > > $ ssh node2 "ps aux | grep spark" > cperez 17511 2.7 0.1 4625248 156304 ? Sl 23:37 0:07 > /data/users/cperez/opt/jdk/bin/java -cp > :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop > -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m > org.apache.spark.deploy.worker.Worker spark://node1:7077 > > $ ssh node2 "netstat -lptun | grep 17511" > tcp 0 0 :::8081 :::* > LISTEN 17511/java > tcp 0 0 ::ffff:192.168.1.3:34188 :::* > LISTEN 17511/java > > $ ssh node3 "ps aux | grep spark" > cperez 7543 1.9 0.1 4625248 158600 ? Sl 23:37 0:09 > /data/users/cperez/opt/jdk/bin/java -cp > :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop > -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m > org.apache.spark.deploy.worker.Worker spark://node1:7077 > > $ ssh node3 "netstat -lptun | grep 7543" > tcp 0 0 :::8081 :::* > LISTEN 7543/java > tcp 0 0 ::ffff:192.168.1.4:57297 :::* > LISTEN 7543/java > > I am completely blocked at this, any help would be very helpful to me. > Many thanks in advance. > Christian >