I have met the same problem on spark 0.9. Master lost all of the workers, because the work's heartbeat is timeout. And master show "Registering worker 10.2.6.134:56158 with 24 cores, 32.0 GB RAM" . But master didn't add restarted workerid to workerset.
On Thu, Feb 27, 2014 at 8:14 AM, Shirish <shirish.ku...@gmail.com> wrote: > I am an newbie!! I am running Spark 0.90 in standalone mode on my mac. The > master and worker run on the same machine. Both of them startup fine (at > least that is what I see in the log). > > *Upon start-up master log is:* > > 14/02/26 15:38:08 INFO Slf4jLogger: Slf4jLogger started > 14/02/26 15:38:08 INFO Remoting: Starting remoting > 14/02/26 15:38:08 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkMaster@Shirishs-MacBook-Pro.local:7077] > 14/02/26 15:38:08 INFO Master: Starting Spark master at > spark://Shirishs-MacBook-Pro.local:7077 > 14/02/26 15:38:08 INFO MasterWebUI: Started Master web UI at > http://192.168.1.106:8080 > 14/02/26 15:38:08 INFO Master: I have been elected leader! New state: ALIVE > 14/02/26 15:38:22 INFO Master: Registering worker > Shirishs-MacBook-Pro.local:56830 with 4 cores, 15.0 GB RAM > > *and the worker log is:* > > 14/02/26 15:38:21 INFO Slf4jLogger: Slf4jLogger started > 14/02/26 15:38:21 INFO Remoting: Starting remoting > 14/02/26 15:38:21 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkWorker@192.168.1.106:56830] > 14/02/26 15:38:21 INFO Worker: Starting Spark worker 192.168.1.106:56830 > with 4 cores, 15.0 GB RAM > 14/02/26 15:38:21 INFO Worker: Spark home: > /Users/shirish_kumar/Developer/spark-0.9.0-incubating14/02/26 15:38:22 INFO > WorkerWebUI: Started Worker web UI at http://192.168.1.106:808114/02/26 > 15:38:22 INFO Worker: Connecting to master > spark://Shirishs-MacBook-Pro.local:7077...14/02/26 15:38:22 INFO Worker: > Successfully registered with master spark://Shirishs-MacBook-Pro.local:7077 > > When I launch my job using: > > ./bin/spark-class org.apache.spark.deploy.Client launch > spark://Shirishs-MacBook-Pro.local:7077 > > file:///Users/shirish_kumar/Developer/spark_app/SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar > SimpleApp > > *Here is what I see in the master log:* > > 14/02/26 15:38:36 INFO Master: Driver submitted > org.apache.spark.deploy.worker.DriverWrapper14/02/26 15:38:36 INFO Master: > Launching driver driver-20140226153836-0000 on worker > worker-20140226153821-192.168.1.106-56830 > 14/02/26 15:38:39 INFO Master: Registering worker > Shirishs-MacBook-Pro.local:56830 with 4 cores, 15.0 GB RAM > 14/02/26 15:38:39 INFO Master: Attempted to re-register worker at same > address: akka.tcp://sparkWorker@192.168.1.106:56830 > 14/02/26 15:38:39 WARN Master: Got heartbeat from unregistered worker > worker-20140226153839-192.168.1.106-56830 > 14/02/26 15:38:42 INFO Master: akka.tcp://driverClient@192.168.1.106:56834 > got disassociated, removing it. > 14/02/26 15:38:42 INFO Master: akka.tcp://driverClient@192.168.1.106:56834 > got disassociated, removing it. > 14/02/26 15:38:42 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka://sparkMaster/deadLetters] to > > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40192.168.1.106%3A56835-2#330912359] > was not delivered. [1] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/02/26 15:38:42 ERROR EndpointWriter: AssociationError > [akka.tcp://sparkMaster@Shirishs-MacBook-Pro.local:7077] -> > [akka.tcp://driverClient@192.168.1.106:56834]: Error [Association failed > with [akka.tcp://driverClient@192.168.1.106:56834]] [ > akka.remote.EndpointAssociationException: Association failed with > [akka.tcp://driverClient@192.168.1.106:56834] > Caused by: > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: > Connection refused: /192.168.1.106:56834 > ] > 14/02/26 15:38:42 INFO Master: akka.tcp://driverClient@192.168.1.106:56834 > got disassociated, removing it. > 14/02/26 15:38:42 INFO Master: akka.tcp://driverClient@192.168.1.106:56834 > got disassociated, removing it. > 14/02/26 15:38:42 ERROR EndpointWriter: AssociationError > [akka.tcp://sparkMaster@Shirishs-MacBook-Pro.local:7077] -> > [akka.tcp://driverClient@192.168.1.106:56834]: Error [Association failed > with [akka.tcp://driverClient@192.168.1.106:56834]] [ > akka.remote.EndpointAssociationException: Association failed with > [akka.tcp://driverClient@192.168.1.106:56834] > Caused by: > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: > Connection refused: /192.168.1.106:56834 > ] > 14/02/26 15:38:42 ERROR EndpointWriter: AssociationError > [akka.tcp://sparkMaster@Shirishs-MacBook-Pro.local:7077] -> > [akka.tcp://driverClient@192.168.1.106:56834]: Error [Association failed > with [akka.tcp://driverClient@192.168.1.106:56834]] [ > akka.remote.EndpointAssociationException: Association failed with > [akka.tcp://driverClient@192.168.1.106:56834] > Caused by: > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: > Connection refused: /192.168.1.106:56834 > ] > 14/02/26 15:38:42 INFO Master: akka.tcp://driverClient@192.168.1.106:56834 > got disassociated, removing it. > 14/02/26 15:40:52 WARN Master: Got heartbeat from unregistered worker > worker-20140226153839-192.168.1.106-56830 > 14/02/26 15:41:09 WARN Master: Got heartbeat from unregistered worker > worker-20140226153839-192.168.1.106-56830 > > *The worker log is:* > > 14/02/26 15:38:36 INFO Worker: Asked to launch driver > driver-20140226153836-0000 > 2014-02-26 15:38:36.790 java[14619:3c0b] Unable to load realm info from > SCDynamicStore > 14/02/26 15:38:36 INFO DriverRunner: Copying user jar > > file:/Users/shirish_kumar/Developer/spark_app/SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar > to > > /Users/shirish_kumar/Developer/spark-0.9.0-incubating/work/driver-20140226153836-0000/simple-project_2.10-1.0.jar > 14/02/26 15:38:37 INFO DriverRunner: Launch Command: > "/Library/Java/JavaVirtualMachines/jdk1.7.0_40.jdk/Contents/Home/bin/java" > "-cp" > > ":/Users/shirish_kumar/Developer/spark-0.9.0-incubating/work/driver-20140226153836-0000/simple-project_2.10-1.0.jar:/Users/shirish_kumar/Developer/spark-0.9.0-incubating/conf:/Users/shirish_kumar/Developer/spark-0.9.0-incubating/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop1.0.4.jar" > "-Dspark.worker.timeout=600" "-Dspark.akka.timeout=200" > "-Dspark.worker.timeout=600" "-Dspark.akka.timeout=200" "-Xms512M" > "-Xmx512M" "org.apache.spark.deploy.worker.DriverWrapper" > "akka.tcp://sparkWorker@192.168.1.106:56830/user/Worker" "SimpleApp" > 14/02/26 15:38:39 ERROR OneForOneStrategy: FAILED (of class > scala.Enumeration$Val) > scala.MatchError: FAILED (of class scala.Enumeration$Val) > at > > org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 14/02/26 15:38:39 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka://sparkWorker/deadLetters] to > > Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40192.168.1.106%3A56838-2#531095069] > was not delivered. [1] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/02/26 15:38:39 ERROR EndpointWriter: AssociationError > [akka.tcp://sparkWorker@192.168.1.106:56830] -> > [akka.tcp://Driver@192.168.1.106:56836]: Error [Association failed with > [akka.tcp://Driver@192.168.1.106:56836]] [ > akka.remote.EndpointAssociationException: Association failed with > [akka.tcp://Driver@192.168.1.106:56836] > Caused by: > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: > Connection refused: /192.168.1.106:56836 > ] > 14/02/26 15:38:39 ERROR EndpointWriter: AssociationError > [akka.tcp://sparkWorker@192.168.1.106:56830] -> > [akka.tcp://Driver@192.168.1.106:56836]: Error [Association failed with > [akka.tcp://Driver@192.168.1.106:56836]] [ > akka.remote.EndpointAssociationException: Association failed with > [akka.tcp://Driver@192.168.1.106:56836] > Caused by: > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: > Connection refused: /192.168.1.106:56836 > ] > 14/02/26 15:38:39 ERROR EndpointWriter: AssociationError > [akka.tcp://sparkWorker@192.168.1.106:56830] -> > [akka.tcp://Driver@192.168.1.106:56836]: Error [Association failed with > [akka.tcp://Driver@192.168.1.106:56836]] [ > akka.remote.EndpointAssociationException: Association failed with > [akka.tcp://Driver@192.168.1.106:56836] > Caused by: > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: > Connection refused: /192.168.1.106:56836 > ] > 14/02/26 15:38:39 INFO Worker: Starting Spark worker 192.168.1.106:56830 > with 4 cores, 15.0 GB RAM > 14/02/26 15:38:39 INFO Worker: Spark home: > /Users/shirish_kumar/Developer/spark-0.9.0-incubating > 14/02/26 15:38:39 INFO WorkerWebUI: Started Worker web UI at > http://192.168.1.106:8081 > 14/02/26 15:38:39 INFO Worker: Connecting to master > spark://Shirishs-MacBook-Pro.local:7077... > 14/02/26 15:38:39 INFO Worker: Successfully registered with master > spark://Shirishs-MacBook-Pro.local:7077 > > > The WebUI (8080) shows the worker as dead and the "new" worker never gets > registered and I can no longer submit any jobs. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/worker-keeps-getting-disassociated-upon-a-failed-job-spark-version-0-90-tp2099.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >