I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos 0.17.0.
If I run a single Spark Job, the job runs fine on Mesos. Running
multiple Spark Jobs also works, if I'm using the coarse-grained mode
("spark.mesos.coarse" = true).
But if I run two Spark Jobs in parallel using the fine-grained mode, the
jobs seem to block each other after a few seconds.
And the Mesos UI reports no idle but also no used CPUs in this state.
As soon as I kill one job, the other continues normally. See below for
some log output.
Looks to me as if something strange happens with the CPU resources.
Can anybody give me a hint about the cause? The jobs read some HDFS
files, but have no other communication to external processes.
Or any other suggestions how to analyze this problem?
Thanks,
Martin
-----
Here is the relevant log output of the driver of job1:
INFO 17:53:09,247 Missing parents for Stage 2: List()
INFO 17:53:09,250 Submitting Stage 2 (MapPartitionsRDD[9] at
mapPartitions at HighTemperatureSpansPerLogfile.java:92), which is now
runnable
INFO 17:53:09,269 Submitting 1 missing tasks from Stage 2
(MapPartitionsRDD[9] at mapPartitions at
HighTemperatureSpansPerLogfile.java:92)
INFO 17:53:09,269 Adding task set 2.0 with 1 tasks
................................................................................
*** at this point the job was killed ***
log output of driver of job2:
INFO 17:53:04,874 Missing parents for Stage 6: List()
INFO 17:53:04,875 Submitting Stage 6 (MappedRDD[23] at values at
ComputeLogFileTimespan.java:71), which is now runnable
INFO 17:53:04,881 Submitting 1 missing tasks from Stage 6
(MappedRDD[23] at values at ComputeLogFileTimespan.java:71)
INFO 17:53:04,882 Adding task set 6.0 with 1 tasks
................................................................................
*** at this point the job 1 was killed ***
INFO 18:01:39,307 Starting task 6.0:0 as TID 7 on executor
20140501-141732-308511242-5050-2657-1:myclusternode (PROCESS_LOCAL)
INFO 18:01:39,307 Serialized task 6.0:0 as 3052 bytes in 0 ms
INFO 18:01:39,328 Asked to send map output locations for shuffle 2 to
spark@myclusternode:40542
INFO 18:01:39,328 Size of output statuses for shuffle 2 is 178 bytes