Couple of things: 1. please do send the output from ompi_info
You can find them attached to this email.
2. please send the slurm envars from your allocation - i.e., after you do your salloc.
Here is an example: $ salloc -N 2 -n 20 --qos=debug salloc: Granted job allocation 1917048 $ srun hostname | sort | uniq -c 12 cn0331 8 cn0333 $ env | grep ^SLURM SLURM_NODELIST=cn[0331,0333] SLURM_NNODES=2 SLURM_JOBID=1917048 SLURM_NTASKS=20 SLURM_TASKS_PER_NODE=12,8 SLURM_JOB_ID=1917048 SLURM_SUBMIT_DIR=/gpfs/home/H76170 SLURM_NPROCS=20 SLURM_JOB_NODELIST=cn[0331,0333] SLURM_JOB_CPUS_PER_NODE=12,8 SLURM_JOB_NUM_NODES=2
Are you sure that slurm is actually "binding" us during this launch? If you just srun your get-allowed-cpu, what does it show? I'm wondering if it just gets reflected in the allocation envar and not actually binding the orteds.
Core binding with Slurm 2.3.3 + OpenMPI 1.4.3 works well: $ mpirun -V mpirun (Open MPI) 1.4.3 Report bugs to http://www.open-mpi.org/community/help/ $ mpirun get-allowed-cpu-ompi 1 Launch 1 Task 01 of 20 (cn0331): 1 Launch 1 Task 03 of 20 (cn0331): 3 Launch 1 Task 04 of 20 (cn0331): 5 Launch 1 Task 02 of 20 (cn0331): 2 Launch 1 Task 09 of 20 (cn0331): 11 Launch 1 Task 11 of 20 (cn0331): 10 Launch 1 Task 12 of 20 (cn0333): 0 Launch 1 Task 13 of 20 (cn0333): 1 Launch 1 Task 14 of 20 (cn0333): 2 Launch 1 Task 15 of 20 (cn0333): 3 Launch 1 Task 16 of 20 (cn0333): 6 Launch 1 Task 17 of 20 (cn0333): 4 Launch 1 Task 18 of 20 (cn0333): 5 Launch 1 Task 19 of 20 (cn0333): 7 Launch 1 Task 00 of 20 (cn0331): 0 Launch 1 Task 05 of 20 (cn0331): 7 Launch 1 Task 06 of 20 (cn0331): 6 Launch 1 Task 07 of 20 (cn0331): 4 Launch 1 Task 08 of 20 (cn0331): 8 Launch 1 Task 10 of 20 (cn0331): 9 But it fails as soon as I switch to OpenMPI 1.7a1r26338: $ module load openmpi_1.7a1r26338 $ mpirun -V mpirun (Open MPI) 1.7a1r26338 Report bugs to http://www.open-mpi.org/community/help/ $ unset OMPI_MCA_mtl OMPI_MCA_pml $ mpirun get-allowed-cpu-ompi 1 Launch 1 Task 12 of 20 (cn0333): 0-23 Launch 1 Task 13 of 20 (cn0333): 0-23 Launch 1 Task 14 of 20 (cn0333): 0-23 Launch 1 Task 15 of 20 (cn0333): 0-23 Launch 1 Task 16 of 20 (cn0333): 0-23 Launch 1 Task 17 of 20 (cn0333): 0-23 Launch 1 Task 18 of 20 (cn0333): 0-23 Launch 1 Task 19 of 20 (cn0333): 0-23 Launch 1 Task 07 of 20 (cn0331): 0-23 Launch 1 Task 08 of 20 (cn0331): 0-23 Launch 1 Task 09 of 20 (cn0331): 0-23 Launch 1 Task 10 of 20 (cn0331): 0-23 Launch 1 Task 11 of 20 (cn0331): 0-23 Launch 1 Task 00 of 20 (cn0331): 0-23 Launch 1 Task 01 of 20 (cn0331): 0-23 Launch 1 Task 02 of 20 (cn0331): 0-23 Launch 1 Task 03 of 20 (cn0331): 0-23 Launch 1 Task 04 of 20 (cn0331): 0-23 Launch 1 Task 05 of 20 (cn0331): 0-23 Launch 1 Task 06 of 20 (cn0331): 0-23 Using srun fails in OpenMPI 1.4.3 environment with the following error:Error obtaining unique transport key from ORTE (orte_precondition_transports not present in
the environment). [...] In OpenMPI 1.7a1r26338, the result of srun is the same as with mpirun: $ module load openmpi_1.7a1r26338 $ srun get-allowed-cpu-ompi 1 Launch 1 Task 00 of 01 (cn0333): 0-23 Launch 1 Task 00 of 01 (cn0333): 0-23 Launch 1 Task 00 of 01 (cn0333): 0-23 Launch 1 Task 00 of 01 (cn0333): 0-23 Launch 1 Task 00 of 01 (cn0333): 0-23 Launch 1 Task 00 of 01 (cn0333): 0-23 Launch 1 Task 00 of 01 (cn0333): 0-23 Launch 1 Task 00 of 01 (cn0333): 0-23 Launch 1 Task 00 of 01 (cn0331): 0-23 Launch 1 Task 00 of 01 (cn0331): 0-23 Launch 1 Task 00 of 01 (cn0331): 0-23 Launch 1 Task 00 of 01 (cn0331): 0-23 Launch 1 Task 00 of 01 (cn0331): 0-23 Launch 1 Task 00 of 01 (cn0331): 0-23 Launch 1 Task 00 of 01 (cn0331): 0-23 Launch 1 Task 00 of 01 (cn0331): 0-23 Launch 1 Task 00 of 01 (cn0331): 0-23 Launch 1 Task 00 of 01 (cn0331): 0-23 Launch 1 Task 00 of 01 (cn0331): 0-23 Launch 1 Task 00 of 01 (cn0331): 0-23 Regards, -- Rémi Palancher http://rezib.org
ompi_info_1.7a1r26338_error_binding.txt.gz
Description: GNU Zip compressed data
ompi_info_1.7a1r26338_psm_undefined_symbol.txt.gz
Description: GNU Zip compressed data