Are you setting a core limit with spark.cores.max?  If you don't, in coarse
mode each Spark job uses all available cores on Mesos and doesn't let them
go until the job is terminated.  At which point the other job can access
the cores.

https://spark.apache.org/docs/latest/running-on-mesos.html -- "Mesos Run
Modes" section

The quick fix should be to set spark.cores.max to half of your cluster's
cores to support running two jobs concurrently.  Alternatively, switching
to fine-grained mode would help here too at the expense of higher latency
on startup.



On Mon, May 12, 2014 at 12:37 PM, Martin Weindel
<martin.wein...@gmail.com>wrote:

>  I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos
> 0.17.0.
>
> If I run a single Spark Job, the job runs fine on Mesos. Running multiple
> Spark Jobs also works, if I'm using the coarse-grained mode
> ("spark.mesos.coarse" = true).
>
> But if I run two Spark Jobs in parallel using the fine-grained mode, the
> jobs seem to block each other after a few seconds.
> And the Mesos UI reports no idle but also no used CPUs in this state.
>
> As soon as I kill one job, the other continues normally. See below for
> some log output.
> Looks to me as if something strange happens with the CPU resources.
>
> Can anybody give me a hint about the cause? The jobs read some HDFS files,
> but have no other communication to external processes.
> Or any other suggestions how to analyze this problem?
>
> Thanks,
>
> Martin
>
> -----
> Here is the relevant log output of the driver of job1:
>
> INFO 17:53:09,247 Missing parents for Stage 2: List()
>  INFO 17:53:09,250 Submitting Stage 2 (MapPartitionsRDD[9] at
> mapPartitions at HighTemperatureSpansPerLogfile.java:92), which is now
> runnable
>  INFO 17:53:09,269 Submitting 1 missing tasks from Stage 2
> (MapPartitionsRDD[9] at mapPartitions at
> HighTemperatureSpansPerLogfile.java:92)
>  INFO 17:53:09,269 Adding task set 2.0 with 1 tasks
>
> ................................................................................
>
> *** at this point the job was killed ***
>
>
> log output of driver of job2:
>  INFO 17:53:04,874 Missing parents for Stage 6: List()
>  INFO 17:53:04,875 Submitting Stage 6 (MappedRDD[23] at values at
> ComputeLogFileTimespan.java:71), which is now runnable
>  INFO 17:53:04,881 Submitting 1 missing tasks from Stage 6 (MappedRDD[23]
> at values at ComputeLogFileTimespan.java:71)
>  INFO 17:53:04,882 Adding task set 6.0 with 1 tasks
>
> ................................................................................
>
> *** at this point the job 1 was killed ***
> INFO 18:01:39,307 Starting task 6.0:0 as TID 7 on executor
> 20140501-141732-308511242-5050-2657-1:myclusternode (PROCESS_LOCAL)
>  INFO 18:01:39,307 Serialized task 6.0:0 as 3052 bytes in 0 ms
>  INFO 18:01:39,328 Asked to send map output locations for shuffle 2 to
> spark@ 
> <sp...@ustst018-cep-node1.usu.usu.grp:40542>myclusternode:40542<sp...@ustst018-cep-node1.usu.usu.grp:40542>
>
>  INFO 18:01:39,328 Size of output statuses for shuffle 2 is 178 bytes
>

Reply via email to