Okay, your tests confirmed my suspicions. Slurm isn't doing any binding at all 
- that's why your srun of get-allowed-cpu-ompi showed no bindings. I don't see 
anything in your cmds that would tell slurm to bind us to anything. All your 
salloc did was tell slurm what to allocate - that doesn't imply binding.

You can get the trunk to bind by adding "--bind-to core" to your cmd line. That 
should yield the pattern you show from your 1.4.3 test.

Of more interest is why the 1.4.3 installation is binding at all. I suspect you 
have an MCA param set somewhere that tells us to bind-to-core - perhaps in the 
default MCA param file, or in your environment. It certainly wouldn't be doing 
that by default.


On May 2, 2012, at 8:49 AM, Rémi Palancher wrote:

> On Fri, 27 Apr 2012 08:56:15 -0600, Ralph Castain <r...@open-mpi.org> wrote:
>> Couple of things:
>> 
>> 1. please do send the output from ompi_info
> 
> You can find them attached to this email.
> 
>> 2. please send the slurm envars from your allocation - i.e., after
>> you do your salloc.
> 
> Here is an example:
> $ salloc -N 2 -n 20 --qos=debug
> salloc: Granted job allocation 1917048
> $ srun hostname | sort | uniq -c
>     12 cn0331
>      8 cn0333
> $ env | grep ^SLURM
> SLURM_NODELIST=cn[0331,0333]
> SLURM_NNODES=2
> SLURM_JOBID=1917048
> SLURM_NTASKS=20
> SLURM_TASKS_PER_NODE=12,8
> SLURM_JOB_ID=1917048
> SLURM_SUBMIT_DIR=/gpfs/home/H76170
> SLURM_NPROCS=20
> SLURM_JOB_NODELIST=cn[0331,0333]
> SLURM_JOB_CPUS_PER_NODE=12,8
> SLURM_JOB_NUM_NODES=2
> 
>> Are you sure that slurm is actually "binding" us during this launch?
>> If you just srun your get-allowed-cpu, what does it show? I'm
>> wondering if it just gets reflected in the allocation envar and not
>> actually binding the orteds.
> 
> Core binding with Slurm 2.3.3 + OpenMPI 1.4.3 works well:
> 
> $ mpirun -V
> mpirun (Open MPI) 1.4.3
> 
> Report bugs to http://www.open-mpi.org/community/help/
> $ mpirun get-allowed-cpu-ompi 1
> Launch 1 Task 01 of 20 (cn0331): 1
> Launch 1 Task 03 of 20 (cn0331): 3
> Launch 1 Task 04 of 20 (cn0331): 5
> Launch 1 Task 02 of 20 (cn0331): 2
> Launch 1 Task 09 of 20 (cn0331): 11
> Launch 1 Task 11 of 20 (cn0331): 10
> Launch 1 Task 12 of 20 (cn0333): 0
> Launch 1 Task 13 of 20 (cn0333): 1
> Launch 1 Task 14 of 20 (cn0333): 2
> Launch 1 Task 15 of 20 (cn0333): 3
> Launch 1 Task 16 of 20 (cn0333): 6
> Launch 1 Task 17 of 20 (cn0333): 4
> Launch 1 Task 18 of 20 (cn0333): 5
> Launch 1 Task 19 of 20 (cn0333): 7
> Launch 1 Task 00 of 20 (cn0331): 0
> Launch 1 Task 05 of 20 (cn0331): 7
> Launch 1 Task 06 of 20 (cn0331): 6
> Launch 1 Task 07 of 20 (cn0331): 4
> Launch 1 Task 08 of 20 (cn0331): 8
> Launch 1 Task 10 of 20 (cn0331): 9
> 
> But it fails as soon as I switch to OpenMPI 1.7a1r26338:
> 
> $ module load openmpi_1.7a1r26338
> $ mpirun -V
> mpirun (Open MPI) 1.7a1r26338
> 
> Report bugs to http://www.open-mpi.org/community/help/
> $ unset OMPI_MCA_mtl OMPI_MCA_pml
> $ mpirun get-allowed-cpu-ompi 1
> Launch 1 Task 12 of 20 (cn0333): 0-23
> Launch 1 Task 13 of 20 (cn0333): 0-23
> Launch 1 Task 14 of 20 (cn0333): 0-23
> Launch 1 Task 15 of 20 (cn0333): 0-23
> Launch 1 Task 16 of 20 (cn0333): 0-23
> Launch 1 Task 17 of 20 (cn0333): 0-23
> Launch 1 Task 18 of 20 (cn0333): 0-23
> Launch 1 Task 19 of 20 (cn0333): 0-23
> Launch 1 Task 07 of 20 (cn0331): 0-23
> Launch 1 Task 08 of 20 (cn0331): 0-23
> Launch 1 Task 09 of 20 (cn0331): 0-23
> Launch 1 Task 10 of 20 (cn0331): 0-23
> Launch 1 Task 11 of 20 (cn0331): 0-23
> Launch 1 Task 00 of 20 (cn0331): 0-23
> Launch 1 Task 01 of 20 (cn0331): 0-23
> Launch 1 Task 02 of 20 (cn0331): 0-23
> Launch 1 Task 03 of 20 (cn0331): 0-23
> Launch 1 Task 04 of 20 (cn0331): 0-23
> Launch 1 Task 05 of 20 (cn0331): 0-23
> Launch 1 Task 06 of 20 (cn0331): 0-23
> 
> Using srun fails in OpenMPI 1.4.3 environment with the following error:
> 
> Error obtaining unique transport key from ORTE (orte_precondition_transports 
> not present in
> the environment).
> [...]
> 
> In OpenMPI 1.7a1r26338, the result of srun is the same as with mpirun:
> 
> $ module load openmpi_1.7a1r26338
> $ srun get-allowed-cpu-ompi 1
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0333): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> Launch 1 Task 00 of 01 (cn0331): 0-23
> 
> Regards,
> -- 
> Rémi Palancher
> http://rezib.org<ompi_info_1.7a1r26338_error_binding.txt.gz><ompi_info_1.7a1r26338_psm_undefined_symbol.txt.gz>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to