Hi Jeff, Thanks for your response. This happens during long running yarn-client Spark jobs, everything is going fine, lots of output in the interpreter log, then we see a failed sent message.
INFO [2016-10-05 17:31:49,586] ({spark-dynamic-executor-allocation} Logging.scala[logInfo]:58) - Requesting to kill executor(s) 202 INFO [2016-10-05 17:31:49,625] ({spark-dynamic-executor-allocation} Logging.scala[logInfo]:58) - Removing executor 202 because it has been idle for 60 seconds (new desired total will be 197) INFO [2016-10-05 17:31:49,626] ({spark-dynamic-executor-allocation} Logging.scala[logInfo]:58) - Requesting to kill executor(s) 201 WARN [2016-10-05 17:33:49,630] ({spark-dynamic-executor-allocation} Logging.scala[logWarning]:91) - Error sending message [message = RequestExecutors(196,69600,Map.... Then: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:62) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.killExecutors(CoarseGrainedSchedulerBackend.scala:513) at org.apache.spark.SparkContext.killExecutors(SparkContext.scala:1472) at org.apache.spark.ExecutorAllocationClient$class.killExecutor(ExecutorAllocationClient.scala:61) at org.apache.spark.SparkContext.killExecutor(SparkContext.scala:1491) at org.apache.spark.ExecutorAllocationManager.org $apache$spark$ExecutorAllocationManager$$removeExecutor(ExecutorAllocationManager.scala:418) at org.apache.spark.ExecutorAllocationManager$$anonfun$org$apache$spark$ExecutorAllocationManager$$schedule$1.apply(ExecutorAllocationManager.scala:284) at org.apache.spark.ExecutorAllocationManager$$anonfun$org$apache$spark$ExecutorAllocationManager$$schedule$1.apply(ExecutorAllocationManager.scala:280) at scala.collection.mutable.MapLike$$anonfun$retain$2.apply(MapLike.scala:213) at scala.collection.mutable.MapLike$$anonfun$retain$2.apply(MapLike.scala:212) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at scala.collection.mutable.MapLike$class.retain(MapLike.scala:212) at scala.collection.mutable.AbstractMap.retain(Map.scala:91) at org.apache.spark.ExecutorAllocationManager.org $apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:280) at org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:224) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 26 more There is no recovery, even though we see the Spark Job still running on the Hadoop cluster. Worse, sometimes the Zeppelin notebook can't be cancelled and we have to restart Zeppelin to reuse the notebook. Let me know if you'd like more info/logs. Thanks, Mark On Fri, Oct 7, 2016 at 10:13 PM, Jianfeng (Jeff) Zhang < jzh...@hortonworks.com> wrote: > > Could you paste the log ? > > > Best Regard, > Jeff Zhang > > > From: Mark Libucha <mlibu...@gmail.com> > Reply-To: "users@zeppelin.apache.org" <users@zeppelin.apache.org> > Date: Friday, October 7, 2016 at 12:11 AM > To: "users@zeppelin.apache.org" <users@zeppelin.apache.org> > Subject: Re: No active SparkContext black hole > > Actually, it's stuck in the Running state. Trying to cancel it causes the > No active SparkContext to appear in the log. Seems like a bug. > > On Thu, Oct 6, 2016 at 9:06 AM, Mark Libucha <mlibu...@gmail.com> wrote: > >> Hello again, >> >> On "longer" running jobs (I'm using yarn-client mode), I sometimes get >> RPC timeouts. Seems like Zeppelin is losing connectivity with the Spark >> cluster. I can deal with that. >> >> But my notebook has sections stuck in the "Cancel" state, and I can't get >> them out. When I re-click on cancel, I see "No active SparkContext" in the >> log. But I can't reload a new instance of the notebook, or kill the one >> that's stuck, without restarting all of zeppelin. >> >> Suggestions? >> >> Thanks, >> >> Mark >> > >