Okay, your tests confirmed my suspicions. Slurm isn't doing any binding at all - that's why your srun of get-allowed-cpu-ompi showed no bindings. I don't see anything in your cmds that would tell slurm to bind us to anything. All your salloc did was tell slurm what to allocate - that doesn't imply binding.
You can get the trunk to bind by adding "--bind-to core" to your cmd line. That should yield the pattern you show from your 1.4.3 test. Of more interest is why the 1.4.3 installation is binding at all. I suspect you have an MCA param set somewhere that tells us to bind-to-core - perhaps in the default MCA param file, or in your environment. It certainly wouldn't be doing that by default. On May 2, 2012, at 8:49 AM, Rémi Palancher wrote: > On Fri, 27 Apr 2012 08:56:15 -0600, Ralph Castain <r...@open-mpi.org> wrote: >> Couple of things: >> >> 1. please do send the output from ompi_info > > You can find them attached to this email. > >> 2. please send the slurm envars from your allocation - i.e., after >> you do your salloc. > > Here is an example: > $ salloc -N 2 -n 20 --qos=debug > salloc: Granted job allocation 1917048 > $ srun hostname | sort | uniq -c > 12 cn0331 > 8 cn0333 > $ env | grep ^SLURM > SLURM_NODELIST=cn[0331,0333] > SLURM_NNODES=2 > SLURM_JOBID=1917048 > SLURM_NTASKS=20 > SLURM_TASKS_PER_NODE=12,8 > SLURM_JOB_ID=1917048 > SLURM_SUBMIT_DIR=/gpfs/home/H76170 > SLURM_NPROCS=20 > SLURM_JOB_NODELIST=cn[0331,0333] > SLURM_JOB_CPUS_PER_NODE=12,8 > SLURM_JOB_NUM_NODES=2 > >> Are you sure that slurm is actually "binding" us during this launch? >> If you just srun your get-allowed-cpu, what does it show? I'm >> wondering if it just gets reflected in the allocation envar and not >> actually binding the orteds. > > Core binding with Slurm 2.3.3 + OpenMPI 1.4.3 works well: > > $ mpirun -V > mpirun (Open MPI) 1.4.3 > > Report bugs to http://www.open-mpi.org/community/help/ > $ mpirun get-allowed-cpu-ompi 1 > Launch 1 Task 01 of 20 (cn0331): 1 > Launch 1 Task 03 of 20 (cn0331): 3 > Launch 1 Task 04 of 20 (cn0331): 5 > Launch 1 Task 02 of 20 (cn0331): 2 > Launch 1 Task 09 of 20 (cn0331): 11 > Launch 1 Task 11 of 20 (cn0331): 10 > Launch 1 Task 12 of 20 (cn0333): 0 > Launch 1 Task 13 of 20 (cn0333): 1 > Launch 1 Task 14 of 20 (cn0333): 2 > Launch 1 Task 15 of 20 (cn0333): 3 > Launch 1 Task 16 of 20 (cn0333): 6 > Launch 1 Task 17 of 20 (cn0333): 4 > Launch 1 Task 18 of 20 (cn0333): 5 > Launch 1 Task 19 of 20 (cn0333): 7 > Launch 1 Task 00 of 20 (cn0331): 0 > Launch 1 Task 05 of 20 (cn0331): 7 > Launch 1 Task 06 of 20 (cn0331): 6 > Launch 1 Task 07 of 20 (cn0331): 4 > Launch 1 Task 08 of 20 (cn0331): 8 > Launch 1 Task 10 of 20 (cn0331): 9 > > But it fails as soon as I switch to OpenMPI 1.7a1r26338: > > $ module load openmpi_1.7a1r26338 > $ mpirun -V > mpirun (Open MPI) 1.7a1r26338 > > Report bugs to http://www.open-mpi.org/community/help/ > $ unset OMPI_MCA_mtl OMPI_MCA_pml > $ mpirun get-allowed-cpu-ompi 1 > Launch 1 Task 12 of 20 (cn0333): 0-23 > Launch 1 Task 13 of 20 (cn0333): 0-23 > Launch 1 Task 14 of 20 (cn0333): 0-23 > Launch 1 Task 15 of 20 (cn0333): 0-23 > Launch 1 Task 16 of 20 (cn0333): 0-23 > Launch 1 Task 17 of 20 (cn0333): 0-23 > Launch 1 Task 18 of 20 (cn0333): 0-23 > Launch 1 Task 19 of 20 (cn0333): 0-23 > Launch 1 Task 07 of 20 (cn0331): 0-23 > Launch 1 Task 08 of 20 (cn0331): 0-23 > Launch 1 Task 09 of 20 (cn0331): 0-23 > Launch 1 Task 10 of 20 (cn0331): 0-23 > Launch 1 Task 11 of 20 (cn0331): 0-23 > Launch 1 Task 00 of 20 (cn0331): 0-23 > Launch 1 Task 01 of 20 (cn0331): 0-23 > Launch 1 Task 02 of 20 (cn0331): 0-23 > Launch 1 Task 03 of 20 (cn0331): 0-23 > Launch 1 Task 04 of 20 (cn0331): 0-23 > Launch 1 Task 05 of 20 (cn0331): 0-23 > Launch 1 Task 06 of 20 (cn0331): 0-23 > > Using srun fails in OpenMPI 1.4.3 environment with the following error: > > Error obtaining unique transport key from ORTE (orte_precondition_transports > not present in > the environment). > [...] > > In OpenMPI 1.7a1r26338, the result of srun is the same as with mpirun: > > $ module load openmpi_1.7a1r26338 > $ srun get-allowed-cpu-ompi 1 > Launch 1 Task 00 of 01 (cn0333): 0-23 > Launch 1 Task 00 of 01 (cn0333): 0-23 > Launch 1 Task 00 of 01 (cn0333): 0-23 > Launch 1 Task 00 of 01 (cn0333): 0-23 > Launch 1 Task 00 of 01 (cn0333): 0-23 > Launch 1 Task 00 of 01 (cn0333): 0-23 > Launch 1 Task 00 of 01 (cn0333): 0-23 > Launch 1 Task 00 of 01 (cn0333): 0-23 > Launch 1 Task 00 of 01 (cn0331): 0-23 > Launch 1 Task 00 of 01 (cn0331): 0-23 > Launch 1 Task 00 of 01 (cn0331): 0-23 > Launch 1 Task 00 of 01 (cn0331): 0-23 > Launch 1 Task 00 of 01 (cn0331): 0-23 > Launch 1 Task 00 of 01 (cn0331): 0-23 > Launch 1 Task 00 of 01 (cn0331): 0-23 > Launch 1 Task 00 of 01 (cn0331): 0-23 > Launch 1 Task 00 of 01 (cn0331): 0-23 > Launch 1 Task 00 of 01 (cn0331): 0-23 > Launch 1 Task 00 of 01 (cn0331): 0-23 > Launch 1 Task 00 of 01 (cn0331): 0-23 > > Regards, > -- > Rémi Palancher > http://rezib.org<ompi_info_1.7a1r26338_error_binding.txt.gz><ompi_info_1.7a1r26338_psm_undefined_symbol.txt.gz>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users