Hi All,

We are running Open MPI version 1.10.2, built with support for Slurm version 16.05.0. When a user specifies "--cpu_bind=none", MPI tries to bind by core, which segv's if there are more processes than cores.

The user reports:

What I found is that

% srun --ntasks-per-node=8 --cpu_bind=none  \
     env SHMEM_SYMMETRIC_HEAP_SIZE=1024M bin/all2all.shmem.exe 0

will have the problem, but:

% srun --ntasks-per-node=8 --cpu_bind=none  \
     env SHMEM_SYMMETRIC_HEAP_SIZE=1024M ./bindit.sh bin/all2all.shmem.exe 0

Will run as expected and print out the usage message because I didn’t provide the right arguments to the code.

So, it appears that the binding has something to do with the issue. My binding script is as follows:

% cat bindit.sh
#!/bin/bash

#echo SLURM_LOCALID=$SLURM_LOCALID

stride=1

if [ ! -z "$SLURM_LOCALID" ]; then
   let bindCPU=$SLURM_LOCALID*$stride
   exec numactl --membind=0 --physcpubind=$bindCPU $*
fi

$*

%


--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
    May the source be with you!

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to