Hi,

Currently, I am running Spark using the standalone scheduler with 3
machines in our cluster. For these three machines, one runs Spark Master
and the other two run Spark Worker.

We run a machine learning application on this small-scale testbed. A
particular stage in my application is divided into 10 parallel tasks. So I
want to know the pros and cons for different cluster configurations.

Conf 1: Multiple executors each of which runs one task.
Each worker has 5 executors; each of the executors has 1 CPU core. In such
configuration, the scheduler will give one task to each of the executors.
Each of the tasks probably runs in different JVMs.

Conf 2: One executor running multiple tasks.
Each worker has only one executor; each executor has 5 CPU cores. In such
case, the scheduler will give 5 tasks to each executor. Tasks running in
the same executor probably run in the same process but different threads.

I think in many cases, Conf 2 is preferable than Conf 1 since the tasks in
the same executor can share the block manager so data shared among these
tasks doesn't need to be transferred multiple times (e.g. the broadcast
data). However, I am wondering if there is a scenario where Conf 1 is
preferable and does the same conclusion hold when the scheduler is YARN or
Mesos.

Thanks!

Best,
Xiaoye

Reply via email to