On 7/30/19 6:03 PM, Brian Andrus wrote:
I think this may be more on how you are calling mpirun and the
mapping of processes.
With the "--exclusive" option, the processes are given access
to all the cores on each box, so mpirun has a choice. IIRC, the
default is to pack them by slot, so fill one node, then move to
the next. Whereas you want to map by node (one process per node
cycling by node)
From the man for mpirun (openmpi):
- --map-by <foo>
- Map to the specified object, defaults to socket.
Supported options include slot, hwthread, core, L1cache,
L2cache, L3cache, socket, numa, board, node, sequential,
distance, and ppr. Any object can include modifiers by adding
a : and any combination of PE=n (bind n processing elements to
each proc), SPAN (load balance the processes across the
allocation), OVERSUBSCRIBE (allow more processes on a node
than processing elements), and NOOVERSUBSCRIBE. This includes
PPR, where the pattern would be terminated by another colon to
separate it from the modifiers.
so adding "--map-by node" would give
you what you are looking for.
Of course, this syntax is for
Openmpi's mpirun command, so YMMV
If using srun (as recommended) instead of invoking mpirun
directly, you can still achieve the same functionality using
exported environment variables as per the mpirun man page, like
OMPI_MCA_rmaps_base_mapping_policy=node srun --export
OMPI_MCA_rmaps_base_mapping_policy ...
in you sbatch script.
Brian Andrus
On 7/30/2019 5:14 AM, CB wrote:
Hi Everyone,
I've recently discovered that when an MPI job is
submitted with the --exclusive flag, Slurm fills up each
node even if the --ntasks-per-node flag is used to set how
many MPI processes is scheduled on each node. Without the
--exclusive flag, Slurm works fine as expected.
Our system is running with Slurm 17.11.7.
The following options works that each node has 16 MPI
processes until all 980 MPI processes are scheduled.with
total of 62 compute nodes. Each of the 61 nodes has 16 MPI
processes and the last one has 4 MPI processes, which is 980
MPI processes in total.
#SBATCH -n 980
#SBATCH --ntasks-per-node=16
However, if the --exclusive option is added, Slurm fills
up each node with 28 MPI processes (the compute node has 28
cores). Interestingly, Slurm still allocates 62 compute
nodes although only 35 nodes of them are actually used to
distribute 980 MPI processes.
#SBATCH -n 980
#SBATCH --ntasks-per-node=16
#SBATCH --exclusive
Has anyone seen this behavior?
- Chansup