Greetings,
I would like to write a script that will launch several different tasks on
multiple nodes that will run simultaneously. Additionally, I would like to
restrict the number of tasks that will run on each node. The following is
pseudo-code for what I have currently (assuming W different tasks running on X
nodes (Y cores per node), with a maximum of Z tasks running on each node, where
Z < Y):
#SBATCH -t <walltime>
#SBATCH -N X
[BEGIN FOR LOOP]
srun -N 1 -n 1 --ntasks-per-node Z --exclusive <ith command> &
[END FOR LOOP]
wait
I set X to account for the --ntasks-per-node limit I set on the srun command.
My intention in writing this was for srun to populate each node with tasks up
to the --ntasks-per-node maximum Z before starting to launch tasks on the next
node. However, when I submitted this job with sbatch and examined the allocated
nodes while the job was running, I found that srun launched a task on every
core of a node before launching tasks on the next node. In other words, the
--ntasks-per-node option on the srun command does not work as I expected it to,
and Y tasks were launched on each node until the desired number of tasks was
launched. As a consequence, multiple nodes at the end of the nodelist were
allocated for the job and left empty, as I selected the number of nodes X
assuming a smaller number of tasks would run on each node.
I figure the job turned out this way because the "--ntasks-per-node Z" option
applies to each invocation of srun individually, so as long as no single
invocation of srun launches more than Z tasks, they will be launched on the
same node until it is full. Is this correct?
Would the following modified pseudo-code accomplish my goal by setting a
tasks-per-node limit for the entire allocation?
#SBATCH -t <walltime>
#SBATCH -N X
#SBATCH --ntasks-per-node Z
[BEGIN FOR LOOP]
srun -N 1 -n 1 --exclusive <ith command> &
[END FOR LOOP]
wait
My concern with this option is that because I am not also using the "#SBATCH
-n", this code will launch the entire script Z times on each node of the
allocation. Is this true?
Is there a better way to go about this?
Thanks very much,
Tyler Jordan