Brad and I looked into this error and I have a few hunches about what might be happening.
We didn't observe any failed tasks in the logs. For some reason, the Python driver is failing to acknowledge an accumulator update from a successfully-completed task. Our program doesn't explicitly use accumulators, but it looks like PySpark task results always contain a single Java accumulator with a PysparkAccumulatorParam. DAGScheduler doesn't catch the exception thrown by the failed addInPlace accumulator operation, which results in the entire DAGScheduler crashing and the job freezing. We should try to identify the root cause of the unacknowledged accumulator updates, but in the meantime it would be a good idea to add more exception handling to ensure that the DAGScheduler's main loop never crashes. This might mask the presence of bugs like this accumulator issue, but it would prevent rare bugs from taking out the entire SparkContext (this is especially important for job servers that share a SparkContext across multiple jobs). A general fix might be to add a "catch Exception" block in handleTaskCompletion that turns uncaught exceptions into task failures, and possibly a top-level "catch Exception" block in DAGScheduler.run() that causes any uncaught exceptions to immediately cancel/crash all running jobs (so we fail-fast instead of hanging). It would be nice if there was a way to selectively enable Java-style checked exceptions to avoid introducing these types of failure-handling bugs. On Mon, Mar 3, 2014 at 10:34 AM, Brad Miller <bmill...@eecs.berkeley.edu>wrote: > Hi All, > > After switching from standalone Spark to Mesos I'm experiencing some > instability. I'm running pyspark interactively through iPython > notebook, and get this crash non-deterministically (although pretty > reliably in the first 2000 tasks, often much sooner). > > Exception in thread "DAGScheduler" org.apache.spark.SparkException: > EOF reached before Python server acknowledged > at > org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:340) > at > org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:311) > at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:70) > at > org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:253) > at > org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:251) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95) > at scala.collection.Iterator$class.foreach(Iterator.scala:772) > at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:95) > at org.apache.spark.Accumulators$.add(Accumulators.scala:251) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:662) > at > org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:437) > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502) > at > org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157) > > I'm running the following software versions on all machines: > Spark: 0.8.1 (md5: 5d3c56eaf91c7349886d5c70439730b3) > Mesos: 0.13.0 (md5: 220dc9c1db118bc7599d45631da578b9) > Python 2.7.3 (Stackoverflow mentioned differing python versions may be > to blame --- unless Spark or iPython is specifically invoking an older > version under the hood mine are all the same). > Ubuntu 12.0.4 > > I've modified mesos-daemon.sh as follows: > I had problems launching the cluster with mesos-start-cluster.sh and > traced the problem to (what seemed to be) a bug in mesos-daemon.sh > which used a "--conf" flag that mesos-slave and mesos-master didn't > recognize. I removed the flag and instead added code to read in > environment variables from mesos-deploy-env.sh. > mesos-start-cluster.sh then worked as advertised. > > Incase it's helpful, I've attached several files as follows: > *spark_full_output: output of ipython process where SparkContext was > created > *mesos-deploy-env.sh: mesos config file from slave (identical to > master except for MESOS_MASTER) > *spark-env.sh: spark config file > *mesos-master.INFO: log file from mesos-master > *mesos-master.WARNING: log file from mesos-master > *mesos-daemon.sh: my modified version of mesos-daemon.sh > > Incase anybody from Berkeley is so interested they want to interact > with my deployment, my office is in Soda hall so that can definitely > be arranged. My apologies if anybody received a duplicate message; I > encountered some delays and complication while joining the list. > > -Brad Miller >