I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos 0.17.0.

If I run a single Spark Job, the job runs fine on Mesos. Running multiple Spark Jobs also works, if I'm using the coarse-grained mode ("spark.mesos.coarse" = true).

But if I run two Spark Jobs in parallel using the fine-grained mode, the jobs seem to block each other after a few seconds.
And the Mesos UI reports no idle but also no used CPUs in this state.

As soon as I kill one job, the other continues normally. See below for some log output.
Looks to me as if something strange happens with the CPU resources.

Can anybody give me a hint about the cause? The jobs read some HDFS files, but have no other communication to external processes.
Or any other suggestions how to analyze this problem?

Thanks,

Martin

-----
Here is the relevant log output of the driver of job1:

INFO 17:53:09,247 Missing parents for Stage 2: List()
INFO 17:53:09,250 Submitting Stage 2 (MapPartitionsRDD[9] at mapPartitions at HighTemperatureSpansPerLogfile.java:92), which is now runnable INFO 17:53:09,269 Submitting 1 missing tasks from Stage 2 (MapPartitionsRDD[9] at mapPartitions at HighTemperatureSpansPerLogfile.java:92)
 INFO 17:53:09,269 Adding task set 2.0 with 1 tasks
................................................................................
*** at this point the job was killed ***


log output of driver of job2:
 INFO 17:53:04,874 Missing parents for Stage 6: List()
INFO 17:53:04,875 Submitting Stage 6 (MappedRDD[23] at values at ComputeLogFileTimespan.java:71), which is now runnable INFO 17:53:04,881 Submitting 1 missing tasks from Stage 6 (MappedRDD[23] at values at ComputeLogFileTimespan.java:71)
 INFO 17:53:04,882 Adding task set 6.0 with 1 tasks
................................................................................
*** at this point the job 1 was killed ***
INFO 18:01:39,307 Starting task 6.0:0 as TID 7 on executor 20140501-141732-308511242-5050-2657-1:myclusternode (PROCESS_LOCAL)
 INFO 18:01:39,307 Serialized task 6.0:0 as 3052 bytes in 0 ms
INFO 18:01:39,328 Asked to send map output locations for shuffle 2 to spark@myclusternode:40542
 INFO 18:01:39,328 Size of output statuses for shuffle 2 is 178 bytes

Reply via email to