Andrew,
thanks for your response. When using the coarse mode, the jobs run fine.
My problem is the fine-grained mode. Here the parallel jobs nearly
always end in a dead lock. It seems to have something to do with
resource allocation, as Mesos shows neither used nor idle CPU resources
in this state. I do not understand what this means.
Any ideas how to analysis this problem are welcome.
Martin
Am 13.05.2014 08:48, schrieb Andrew Ash:
Are you setting a core limit with spark.cores.max? If you don't, in
coarse mode each Spark job uses all available cores on Mesos and
doesn't let them go until the job is terminated. At which point the
other job can access the cores.
https://spark.apache.org/docs/latest/running-on-mesos.html -- "Mesos
Run Modes" section
The quick fix should be to set spark.cores.max to half of your
cluster's cores to support running two jobs concurrently.
Alternatively, switching to fine-grained mode would help here too at
the expense of higher latency on startup.
On Mon, May 12, 2014 at 12:37 PM, Martin Weindel
<martin.wein...@gmail.com <mailto:martin.wein...@gmail.com>> wrote:
I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos
0.17.0.
If I run a single Spark Job, the job runs fine on Mesos. Running
multiple Spark Jobs also works, if I'm using the coarse-grained
mode ("spark.mesos.coarse" = true).
But if I run two Spark Jobs in parallel using the fine-grained
mode, the jobs seem to block each other after a few seconds.
And the Mesos UI reports no idle but also no used CPUs in this state.
As soon as I kill one job, the other continues normally. See below
for some log output.
Looks to me as if something strange happens with the CPU resources.
Can anybody give me a hint about the cause? The jobs read some
HDFS files, but have no other communication to external processes.
Or any other suggestions how to analyze this problem?
Thanks,
Martin
-----
Here is the relevant log output of the driver of job1:
INFO 17:53:09,247 Missing parents for Stage 2: List()
INFO 17:53:09,250 Submitting Stage 2 (MapPartitionsRDD[9] at
mapPartitions at HighTemperatureSpansPerLogfile.java:92), which is
now runnable
INFO 17:53:09,269 Submitting 1 missing tasks from Stage 2
(MapPartitionsRDD[9] at mapPartitions at
HighTemperatureSpansPerLogfile.java:92)
INFO 17:53:09,269 Adding task set 2.0 with 1 tasks
................................................................................
*** at this point the job was killed ***
log output of driver of job2:
INFO 17:53:04,874 Missing parents for Stage 6: List()
INFO 17:53:04,875 Submitting Stage 6 (MappedRDD[23] at values at
ComputeLogFileTimespan.java:71), which is now runnable
INFO 17:53:04,881 Submitting 1 missing tasks from Stage 6
(MappedRDD[23] at values at ComputeLogFileTimespan.java:71)
INFO 17:53:04,882 Adding task set 6.0 with 1 tasks
................................................................................
*** at this point the job 1 was killed ***
INFO 18:01:39,307 Starting task 6.0:0 as TID 7 on executor
20140501-141732-308511242-5050-2657-1:myclusternode (PROCESS_LOCAL)
INFO 18:01:39,307 Serialized task 6.0:0 as 3052 bytes in 0 ms
INFO 18:01:39,328 Asked to send map output locations for shuffle
2 to spark@
<mailto:sp...@ustst018-cep-node1.usu.usu.grp:40542>myclusternode:40542
<mailto:sp...@ustst018-cep-node1.usu.usu.grp:40542>
INFO 18:01:39,328 Size of output statuses for shuffle 2 is 178 bytes