The end of your example is the same as SPARK-1749. When a Mesos job causes an exception to be thrown in the DAGScheduler, that causes the DAGScheduler to need to shutdown the system. As part of that shutdown procedure, the DAGScheduler tries to kill any running jobs; but Mesos doesn't support that, so that's why you see the failure of doCancelAllJobs in your stack trace.
I don't believe that any of that is the root cause of your problem though, which has to be found in whatever is causing the DAGScheduler to need to shutdown in the first place. On Sun, May 25, 2014 at 12:10 PM, Perttu Ranta-aho < perttu.ranta...@gmail.com> wrote: > Hi, > > We have a small Mesos (0.18.1) cluster with 4 nodes. Upgraded to Spark > 1.0.0-rc9, to overcome some PySpark bugs. But now we are experiencing > random crashes with almost every job. Local jobs run fine, but same code > with same data set in Mesos cluster leads to errors like: > > 14/05/22 15:03:34 ERROR DAGSchedulerActorSupervisor: eventProcesserActor > failed due to the error EOF reached before Python server acknowledged; > shutting down SparkContext > 14/05/22 15:03:34 INFO DAGScheduler: Failed to run saveAsTextFile at > NativeMethodAccessorImpl.java:-2 > Traceback (most recent call last): > File "tag_prefixes.py", line 58, in <module> > tag_prefix_counts.saveAsTextFile('tag_prefix_counts.data') > File "/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/pyspark/rdd.py", > line 910, in saveAsTextFile > keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path) > File > "/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", > line 537, in __call__ > File > "/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError14/05/22 15:03:34 INFO TaskSchedulerImpl: > Cancelling stage 0 > : An error occurred while calling o44.saveAsTextFile. > : org.apache.spark.SparkException: Job 0 cancelled as part of cancellation > of all jobs > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499) > > > Which looks similar to https://issues.apache.org/jira/browse/SPARK-1749, > with the exception that the code isn't "bad". Furthermore we are seeing > lots of Mesos(?) warnings like this: > > W0522 14:51:19.045565 10497 sched.cpp:901] Attempting to launch task 869 > with an unknown offer 20140516-155535-170164746-5050-22001-112345 > > Which we didn't see with previous Mesos&Spark versions. There aren't any > related errors in Mesos slave logs, instead they report jobs done without > problems. Scala code seems to run without problems, so I suppose this isn't > issue with out Mesos instalation > > Any ideas what might be wrong? Or is this bug in Spark? > > > -Perttu >