Re: No active SparkContext black hole

Mark Libucha Mon, 10 Oct 2016 09:54:55 -0700

Hi Jeff,

Thanks for your response. This happens during long running yarn-client
Spark jobs, everything is going fine, lots of output in the interpreter
log, then we see a failed sent message.


 INFO [2016-10-05 17:31:49,586] ({spark-dynamic-executor-allocation}
Logging.scala[logInfo]:58) - Requesting to kill executor(s) 202
 INFO [2016-10-05 17:31:49,625] ({spark-dynamic-executor-allocation}
Logging.scala[logInfo]:58) - Removing executor 202 because it has been idle
for 60 seconds (new desired total will be 197)
 INFO [2016-10-05 17:31:49,626] ({spark-dynamic-executor-allocation}
Logging.scala[logInfo]:58) - Requesting to kill executor(s) 201
 WARN [2016-10-05 17:33:49,630] ({spark-dynamic-executor-allocation}
Logging.scala[logWarning]:91) - Error sending message [message =
RequestExecutors(196,69600,Map....

Then:

org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
seconds]. This timeout is controlled by spark.rpc.askTimeout
        at org.apache.spark.rpc.RpcTimeout.org
$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
        at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
        at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
        at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
        at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
        at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
        at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:62)
        at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.killExecutors(CoarseGrainedSchedulerBackend.scala:513)
        at
org.apache.spark.SparkContext.killExecutors(SparkContext.scala:1472)
        at
org.apache.spark.ExecutorAllocationClient$class.killExecutor(ExecutorAllocationClient.scala:61)
        at
org.apache.spark.SparkContext.killExecutor(SparkContext.scala:1491)
        at org.apache.spark.ExecutorAllocationManager.org
$apache$spark$ExecutorAllocationManager$$removeExecutor(ExecutorAllocationManager.scala:418)
        at
org.apache.spark.ExecutorAllocationManager$$anonfun$org$apache$spark$ExecutorAllocationManager$$schedule$1.apply(ExecutorAllocationManager.scala:284)
        at
org.apache.spark.ExecutorAllocationManager$$anonfun$org$apache$spark$ExecutorAllocationManager$$schedule$1.apply(ExecutorAllocationManager.scala:280)
        at
scala.collection.mutable.MapLike$$anonfun$retain$2.apply(MapLike.scala:213)
        at
scala.collection.mutable.MapLike$$anonfun$retain$2.apply(MapLike.scala:212)
        at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
        at scala.collection.mutable.MapLike$class.retain(MapLike.scala:212)
        at scala.collection.mutable.AbstractMap.retain(Map.scala:91)
        at org.apache.spark.ExecutorAllocationManager.org
$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:280)
        at
org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:224)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[120 seconds]
        at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        ... 26 more

There is no recovery, even though we see the Spark Job still running on the
Hadoop cluster. Worse, sometimes the Zeppelin notebook can't be cancelled
and we have to restart Zeppelin to reuse the notebook.

Let me know if you'd like more info/logs.

Thanks,

Mark

On Fri, Oct 7, 2016 at 10:13 PM, Jianfeng (Jeff) Zhang <
jzh...@hortonworks.com> wrote:

>
> Could you paste the log ?
>
>
> Best Regard,
> Jeff Zhang
>
>
> From: Mark Libucha <mlibu...@gmail.com>
> Reply-To: "users@zeppelin.apache.org" <users@zeppelin.apache.org>
> Date: Friday, October 7, 2016 at 12:11 AM
> To: "users@zeppelin.apache.org" <users@zeppelin.apache.org>
> Subject: Re: No active SparkContext black hole
>
> Actually, it's stuck in the Running state. Trying to cancel it causes the
> No active SparkContext to appear in the log. Seems like a bug.
>
> On Thu, Oct 6, 2016 at 9:06 AM, Mark Libucha <mlibu...@gmail.com> wrote:
>
>> Hello again,
>>
>> On "longer" running jobs (I'm using yarn-client mode), I sometimes get
>> RPC timeouts. Seems like Zeppelin is losing connectivity with the Spark
>> cluster. I can deal with that.
>>
>> But my notebook has sections stuck in the "Cancel" state, and I can't get
>> them out. When I re-click on cancel, I see "No active SparkContext" in the
>> log. But I can't reload a new instance of the notebook, or kill the one
>> that's stuck, without restarting all of zeppelin.
>>
>> Suggestions?
>>
>> Thanks,
>>
>> Mark
>>
>
>

Re: No active SparkContext black hole

Reply via email to