Re: stopped SparkContext remaining active

Andres Perez Wed, 29 Jul 2015 12:47:45 -0700

Hi Ted. Taking a look at the logs, I get the feeling like there may be an
uncaught exception blowing up the SparkContext.stop method, causing it to
not reach the line where it gets set as inactive. The line referenced below
in SparkContext (SparkContext.scala:1644) is the call: _dagScheduler.stop()


15/07/29 15:17:09 INFO Client: Deleting staging directory
.sparkStaging/application_1436825124867_0223
15/07/29 15:17:09 ERROR YarnClientSchedulerBackend: Yarn application has
already exited with state KILLED!
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/metrics/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/api,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/static,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs,null}
15/07/29 15:17:09 INFO SparkUI: Stopped Spark web UI at http://<address
removed>
15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster has disassociated: <address removed>
15/07/29 15:17:09 INFO DAGScheduler: Stopping DAGScheduler
15/07/29 15:17:09 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkYarnAM@<address removed>] has failed, address is
now gated for [5000] ms. Reason is: [Disassociated].
15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster has disassociated: <address removed>
15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all
executors
Exception in thread "Yarn application state monitor"
org.apache.spark.SparkException: Error asking standalone scheduler to shut
down executors
        at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
        at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
        at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
        at
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
        at
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
        at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
        at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
Caused by: java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325)
        at
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
        at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
        at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:190)15/07/29
15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down

        at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
        at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
        at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
        ... 6 more
15/07/29 15:17:09 ERROR YarnScheduler: Lost executor 2 on node09.tresata.com:
remote Rpc client disassociated
15/07/29 15:17:09 ERROR YarnScheduler: Lost executor 1 on node06.tresata.com:
remote Rpc client disassociated
5/07/29 15:19:09 WARN HeartbeatReceiver: Removing executor 2 with no recent
heartbeats: 128182 ms exceeds timeout 120000 ms
15/07/29 15:19:09 ERROR YarnScheduler: Lost an executor 2 (already
removed): Executor heartbeat timed out after 128182 ms
15/07/29 15:19:09 WARN HeartbeatReceiver: Removing executor 1 with no
recent heartbeats: 126799 ms exceeds timeout 120000 ms
15/07/29 15:19:09 ERROR YarnScheduler: Lost an executor 1 (already
removed): Executor heartbeat timed out after 126799 ms
15/07/29 15:19:09 INFO YarnClientSchedulerBackend: Requesting to kill
executor(s) 2
15/07/29 15:19:09 WARN YarnClientSchedulerBackend: Executor to kill 2 does
not exist!
15/07/29 15:19:09 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkYarnAM@<address removed>]. Address is now gated
for 5000 ms, all messages to this address will be delivered to dead
letters. Reason: Connection refused: /<address removed>
15/07/29 15:19:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster has disassociated: <address removed>

It seems like the call,

driverEndpoint.askWithRetry[Boolean](StopExecutors)

inside the stopExecutors() method of CoarseGrainedSchedulerBackend is
throwing an exception, which then bubbles up as a SparkException ("Error
asking standalone scheduler to shut down executors"). This is then not
caught by the SparkContext.stop() method (and thus stop() never reaches
the clearActiveContext() call). Does this sound right?

Also, does bq == be quiet???

Thanks for the reply!

-Andres

On Wed, Jul 29, 2015 at 1:10 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. it seems like we never get to the clearActiveContext() call by the end
>
> Looking at stop() method, there is only one early return
> after stopped.compareAndSet() call.
> Is there any clue from driver log ?
>
> Cheers
>
> On Wed, Jul 29, 2015 at 9:38 AM, Andres Perez <and...@tresata.com> wrote:
>
>> Hi everyone. I'm running into an issue with SparkContexts when running on
>> Yarn. The issue is observable when I reproduce these steps in the
>> spark-shell (version 1.4.1):
>>
>> scala> sc
>> res0: org.apache.spark.SparkContext =
>> org.apache.spark.SparkContext@7b965dee
>>
>> *Note the pointer address of sc.
>>
>> (Then yarn application -kill <application-id> on the corresponding yarn
>> application)
>>
>> scala> val rdd = sc.parallelize(List(1,2,3))
>> java.lang.IllegalStateException: Cannot call methods on a stopped
>> SparkContext
>>   at
>> org.apache.spark.SparkContext.org
>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>
>>   at
>> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1914)
>>   at
>>
>> org.apache.spark.SparkContext.parallelize$default$2(SparkContext.scala:695)
>>   ... 49 elided
>>
>> (Great, the SparkContext has been stopped by the killed yarn application,
>> as
>> expected.)
>>
>> alternatively:
>>
>> scala> sc.stop()
>> 15/07/29 12:10:14 INFO SparkContext: SparkContext already stopped.
>>
>> (OK, so it's confirmed that it has been stopped.)
>>
>> scala> org.apache.spark.SparkContext.getOrCreate
>> res3: org.apache.spark.SparkContext =
>> org.apache.spark.SparkContext@7b965dee
>>
>> (Hm, that's the same SparkContext, note the pointer address.)
>>
>> The issue here is that the SparkContext.getOrCreate method returns either
>> the active SparkContext, if it exists, or creates a new one. Here it is
>> returning the original SparkContext, meaning the one we verified was
>> stopped
>> above is still active. How can we recover from this? We can't use the
>> current one once it's been stopped (unless we allow for multiple contexts
>> to
>> run using the spark.driver.allowMultipleContexts flag, but that's a
>> band-aid
>> solution), and we can't seem to create a new one, because the old one is
>> still marked as active.
>>
>> Digging a little deeper, in the body of the stop() method of SparkContext,
>> it seems like we never get to the clearActiveContext() call by the end,
>> which would have marked the context as inactive. Any future call to
>> stop(),
>> however, will exit early since the stopped variable is true (hence the
>> "SparkContext already stopped." log message). So I don't see any other way
>> to mark the context as not active. Something about how the SparkContext
>> was
>> stopped after killing the yarn application is preventing the SparkContext
>> from cleaning up properly.
>>
>> Any ideas about this?
>>
>> Thanks,
>>
>> Andres
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/stopped-SparkContext-remaining-active-tp24065.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: stopped SparkContext remaining active

Reply via email to