Hi All,
We are running Open MPI version 1.10.2, built with support for Slurm
version 16.05.0. When a user specifies "--cpu_bind=none", MPI tries to
bind by core, which segv's if there are more processes than cores.
The user reports:
What I found is that
% srun --ntasks-per-node=8 --cpu_bind=none \
env SHMEM_SYMMETRIC_HEAP_SIZE=1024M bin/all2all.shmem.exe 0
will have the problem, but:
% srun --ntasks-per-node=8 --cpu_bind=none \
env SHMEM_SYMMETRIC_HEAP_SIZE=1024M ./bindit.sh bin/all2all.shmem.exe 0
Will run as expected and print out the usage message because I didn’t
provide the right arguments to the code.
So, it appears that the binding has something to do with the issue. My
binding script is as follows:
% cat bindit.sh
#!/bin/bash
#echo SLURM_LOCALID=$SLURM_LOCALID
stride=1
if [ ! -z "$SLURM_LOCALID" ]; then
let bindCPU=$SLURM_LOCALID*$stride
exec numactl --membind=0 --physcpubind=$bindCPU $*
fi
$*
%
--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users