Dead lock running multiple Spark jobs on Mesos

Martin Weindel Mon, 12 May 2014 12:37:45 -0700

I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos 0.17.0.

If I run a single Spark Job, the job runs fine on Mesos. Runningmultiple Spark Jobs also works, if I'm using the coarse-grained mode("spark.mesos.coarse" = true).

But if I run two Spark Jobs in parallel using the fine-grained mode, thejobs seem to block each other after a few seconds.

And the Mesos UI reports no idle but also no used CPUs in this state.

As soon as I kill one job, the other continues normally. See below forsome log output.

Looks to me as if something strange happens with the CPU resources.

Can anybody give me a hint about the cause? The jobs read some HDFSfiles, but have no other communication to external processes.

Or any other suggestions how to analyze this problem?

Thanks,

Martin

-----
Here is the relevant log output of the driver of job1:

INFO 17:53:09,247 Missing parents for Stage 2: List()

INFO 17:53:09,250 Submitting Stage 2 (MapPartitionsRDD[9] atmapPartitions at HighTemperatureSpansPerLogfile.java:92), which is nowrunnableINFO 17:53:09,269 Submitting 1 missing tasks from Stage 2(MapPartitionsRDD[9] at mapPartitions atHighTemperatureSpansPerLogfile.java:92)

 INFO 17:53:09,269 Adding task set 2.0 with 1 tasks
................................................................................
*** at this point the job was killed ***


log output of driver of job2:
 INFO 17:53:04,874 Missing parents for Stage 6: List()

INFO 17:53:04,875 Submitting Stage 6 (MappedRDD[23] at values atComputeLogFileTimespan.java:71), which is now runnableINFO 17:53:04,881 Submitting 1 missing tasks from Stage 6(MappedRDD[23] at values at ComputeLogFileTimespan.java:71)

 INFO 17:53:04,882 Adding task set 6.0 with 1 tasks
................................................................................
*** at this point the job 1 was killed ***

INFO 18:01:39,307 Starting task 6.0:0 as TID 7 on executor20140501-141732-308511242-5050-2657-1:myclusternode (PROCESS_LOCAL)

 INFO 18:01:39,307 Serialized task 6.0:0 as 3052 bytes in 0 ms

INFO 18:01:39,328 Asked to send map output locations for shuffle 2 tospark@myclusternode:40542

 INFO 18:01:39,328 Size of output statuses for shuffle 2 is 178 bytes

Dead lock running multiple Spark jobs on Mesos

Reply via email to