The end of your example is the same as SPARK-1749.  When a Mesos job causes
an exception to be thrown in the DAGScheduler, that causes the DAGScheduler
to need to shutdown the system.  As part of that shutdown procedure, the
DAGScheduler tries to kill any running jobs; but Mesos doesn't support
that, so that's why you see the failure of doCancelAllJobs in your stack
trace.

I don't believe that any of that is the root cause of your problem though,
which has to be found in whatever is causing the DAGScheduler to need to
shutdown in the first place.


On Sun, May 25, 2014 at 12:10 PM, Perttu Ranta-aho <
perttu.ranta...@gmail.com> wrote:

> Hi,
>
> We have a small Mesos (0.18.1) cluster with 4 nodes. Upgraded to Spark
> 1.0.0-rc9, to overcome some PySpark bugs. But now we are experiencing
> random crashes with almost every job. Local jobs run fine, but same code
> with same data set in Mesos cluster leads to errors like:
>
> 14/05/22 15:03:34 ERROR DAGSchedulerActorSupervisor: eventProcesserActor
> failed due to the error EOF reached before Python server acknowledged;
> shutting down SparkContext
> 14/05/22 15:03:34 INFO DAGScheduler: Failed to run saveAsTextFile at
> NativeMethodAccessorImpl.java:-2
> Traceback (most recent call last):
>   File "tag_prefixes.py", line 58, in <module>
>     tag_prefix_counts.saveAsTextFile('tag_prefix_counts.data')
>   File "/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/pyspark/rdd.py",
> line 910, in saveAsTextFile
>     keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
>   File
> "/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> line 537, in __call__
>   File
> "/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError14/05/22 15:03:34 INFO TaskSchedulerImpl:
> Cancelling stage 0
> : An error occurred while calling o44.saveAsTextFile.
> : org.apache.spark.SparkException: Job 0 cancelled as part of cancellation
> of all jobs
>         at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>         at
> org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499)
>
>
> Which looks similar to https://issues.apache.org/jira/browse/SPARK-1749,
> with the exception that the code isn't "bad". Furthermore we are seeing
> lots of Mesos(?) warnings like this:
>
> W0522 14:51:19.045565 10497 sched.cpp:901] Attempting to launch task 869
> with an unknown offer 20140516-155535-170164746-5050-22001-112345
>
> Which we didn't see with previous Mesos&Spark versions. There aren't any
> related errors in Mesos slave logs, instead they report jobs done without
> problems. Scala code seems to run without problems, so I suppose this isn't
> issue with out Mesos instalation
>
> Any ideas what might be wrong? Or is this bug in Spark?
>
>
> -Perttu
>

Reply via email to