Hi Ted. Taking a look at the logs, I get the feeling like there may be an uncaught exception blowing up the SparkContext.stop method, causing it to not reach the line where it gets set as inactive. The line referenced below in SparkContext (SparkContext.scala:1644) is the call: _dagScheduler.stop()
15/07/29 15:17:09 INFO Client: Deleting staging directory .sparkStaging/application_1436825124867_0223 15/07/29 15:17:09 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state KILLED! 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 15/07/29 15:17:09 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 15/07/29 15:17:09 INFO SparkUI: Stopped Spark web UI at http://<address removed> 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: <address removed> 15/07/29 15:17:09 INFO DAGScheduler: Stopping DAGScheduler 15/07/29 15:17:09 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkYarnAM@<address removed>] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: <address removed> 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors Exception in thread "Yarn application state monitor" org.apache.spark.SparkException: Error asking standalone scheduler to shut down executors at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257) ... 6 more 15/07/29 15:17:09 ERROR YarnScheduler: Lost executor 2 on node09.tresata.com: remote Rpc client disassociated 15/07/29 15:17:09 ERROR YarnScheduler: Lost executor 1 on node06.tresata.com: remote Rpc client disassociated 5/07/29 15:19:09 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 128182 ms exceeds timeout 120000 ms 15/07/29 15:19:09 ERROR YarnScheduler: Lost an executor 2 (already removed): Executor heartbeat timed out after 128182 ms 15/07/29 15:19:09 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 126799 ms exceeds timeout 120000 ms 15/07/29 15:19:09 ERROR YarnScheduler: Lost an executor 1 (already removed): Executor heartbeat timed out after 126799 ms 15/07/29 15:19:09 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2 15/07/29 15:19:09 WARN YarnClientSchedulerBackend: Executor to kill 2 does not exist! 15/07/29 15:19:09 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkYarnAM@<address removed>]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /<address removed> 15/07/29 15:19:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: <address removed> It seems like the call, driverEndpoint.askWithRetry[Boolean](StopExecutors) inside the stopExecutors() method of CoarseGrainedSchedulerBackend is throwing an exception, which then bubbles up as a SparkException ("Error asking standalone scheduler to shut down executors"). This is then not caught by the SparkContext.stop() method (and thus stop() never reaches the clearActiveContext() call). Does this sound right? Also, does bq == be quiet??? Thanks for the reply! -Andres On Wed, Jul 29, 2015 at 1:10 PM, Ted Yu <yuzhih...@gmail.com> wrote: > bq. it seems like we never get to the clearActiveContext() call by the end > > Looking at stop() method, there is only one early return > after stopped.compareAndSet() call. > Is there any clue from driver log ? > > Cheers > > On Wed, Jul 29, 2015 at 9:38 AM, Andres Perez <and...@tresata.com> wrote: > >> Hi everyone. I'm running into an issue with SparkContexts when running on >> Yarn. The issue is observable when I reproduce these steps in the >> spark-shell (version 1.4.1): >> >> scala> sc >> res0: org.apache.spark.SparkContext = >> org.apache.spark.SparkContext@7b965dee >> >> *Note the pointer address of sc. >> >> (Then yarn application -kill <application-id> on the corresponding yarn >> application) >> >> scala> val rdd = sc.parallelize(List(1,2,3)) >> java.lang.IllegalStateException: Cannot call methods on a stopped >> SparkContext >> at >> org.apache.spark.SparkContext.org >> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103) >> >> at >> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1914) >> at >> >> org.apache.spark.SparkContext.parallelize$default$2(SparkContext.scala:695) >> ... 49 elided >> >> (Great, the SparkContext has been stopped by the killed yarn application, >> as >> expected.) >> >> alternatively: >> >> scala> sc.stop() >> 15/07/29 12:10:14 INFO SparkContext: SparkContext already stopped. >> >> (OK, so it's confirmed that it has been stopped.) >> >> scala> org.apache.spark.SparkContext.getOrCreate >> res3: org.apache.spark.SparkContext = >> org.apache.spark.SparkContext@7b965dee >> >> (Hm, that's the same SparkContext, note the pointer address.) >> >> The issue here is that the SparkContext.getOrCreate method returns either >> the active SparkContext, if it exists, or creates a new one. Here it is >> returning the original SparkContext, meaning the one we verified was >> stopped >> above is still active. How can we recover from this? We can't use the >> current one once it's been stopped (unless we allow for multiple contexts >> to >> run using the spark.driver.allowMultipleContexts flag, but that's a >> band-aid >> solution), and we can't seem to create a new one, because the old one is >> still marked as active. >> >> Digging a little deeper, in the body of the stop() method of SparkContext, >> it seems like we never get to the clearActiveContext() call by the end, >> which would have marked the context as inactive. Any future call to >> stop(), >> however, will exit early since the stopped variable is true (hence the >> "SparkContext already stopped." log message). So I don't see any other way >> to mark the context as not active. Something about how the SparkContext >> was >> stopped after killing the yarn application is preventing the SparkContext >> from cleaning up properly. >> >> Any ideas about this? >> >> Thanks, >> >> Andres >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/stopped-SparkContext-remaining-active-tp24065.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >