On Fri, 27 Apr 2012 08:56:15 -0600, Ralph Castain <r...@open-mpi.org> wrote:
Couple of things:

1. please do send the output from ompi_info

You can find them attached to this email.

2. please send the slurm envars from your allocation - i.e., after
you do your salloc.

Here is an example:
$ salloc -N 2 -n 20 --qos=debug
salloc: Granted job allocation 1917048
$ srun hostname | sort | uniq -c
     12 cn0331
      8 cn0333
$ env | grep ^SLURM
SLURM_NODELIST=cn[0331,0333]
SLURM_NNODES=2
SLURM_JOBID=1917048
SLURM_NTASKS=20
SLURM_TASKS_PER_NODE=12,8
SLURM_JOB_ID=1917048
SLURM_SUBMIT_DIR=/gpfs/home/H76170
SLURM_NPROCS=20
SLURM_JOB_NODELIST=cn[0331,0333]
SLURM_JOB_CPUS_PER_NODE=12,8
SLURM_JOB_NUM_NODES=2

Are you sure that slurm is actually "binding" us during this launch?
If you just srun your get-allowed-cpu, what does it show? I'm
wondering if it just gets reflected in the allocation envar and not
actually binding the orteds.

Core binding with Slurm 2.3.3 + OpenMPI 1.4.3 works well:

$ mpirun -V
mpirun (Open MPI) 1.4.3

Report bugs to http://www.open-mpi.org/community/help/
$ mpirun get-allowed-cpu-ompi 1
Launch 1 Task 01 of 20 (cn0331): 1
Launch 1 Task 03 of 20 (cn0331): 3
Launch 1 Task 04 of 20 (cn0331): 5
Launch 1 Task 02 of 20 (cn0331): 2
Launch 1 Task 09 of 20 (cn0331): 11
Launch 1 Task 11 of 20 (cn0331): 10
Launch 1 Task 12 of 20 (cn0333): 0
Launch 1 Task 13 of 20 (cn0333): 1
Launch 1 Task 14 of 20 (cn0333): 2
Launch 1 Task 15 of 20 (cn0333): 3
Launch 1 Task 16 of 20 (cn0333): 6
Launch 1 Task 17 of 20 (cn0333): 4
Launch 1 Task 18 of 20 (cn0333): 5
Launch 1 Task 19 of 20 (cn0333): 7
Launch 1 Task 00 of 20 (cn0331): 0
Launch 1 Task 05 of 20 (cn0331): 7
Launch 1 Task 06 of 20 (cn0331): 6
Launch 1 Task 07 of 20 (cn0331): 4
Launch 1 Task 08 of 20 (cn0331): 8
Launch 1 Task 10 of 20 (cn0331): 9

But it fails as soon as I switch to OpenMPI 1.7a1r26338:

$ module load openmpi_1.7a1r26338
$ mpirun -V
mpirun (Open MPI) 1.7a1r26338

Report bugs to http://www.open-mpi.org/community/help/
$ unset OMPI_MCA_mtl OMPI_MCA_pml
$ mpirun get-allowed-cpu-ompi 1
Launch 1 Task 12 of 20 (cn0333): 0-23
Launch 1 Task 13 of 20 (cn0333): 0-23
Launch 1 Task 14 of 20 (cn0333): 0-23
Launch 1 Task 15 of 20 (cn0333): 0-23
Launch 1 Task 16 of 20 (cn0333): 0-23
Launch 1 Task 17 of 20 (cn0333): 0-23
Launch 1 Task 18 of 20 (cn0333): 0-23
Launch 1 Task 19 of 20 (cn0333): 0-23
Launch 1 Task 07 of 20 (cn0331): 0-23
Launch 1 Task 08 of 20 (cn0331): 0-23
Launch 1 Task 09 of 20 (cn0331): 0-23
Launch 1 Task 10 of 20 (cn0331): 0-23
Launch 1 Task 11 of 20 (cn0331): 0-23
Launch 1 Task 00 of 20 (cn0331): 0-23
Launch 1 Task 01 of 20 (cn0331): 0-23
Launch 1 Task 02 of 20 (cn0331): 0-23
Launch 1 Task 03 of 20 (cn0331): 0-23
Launch 1 Task 04 of 20 (cn0331): 0-23
Launch 1 Task 05 of 20 (cn0331): 0-23
Launch 1 Task 06 of 20 (cn0331): 0-23

Using srun fails in OpenMPI 1.4.3 environment with the following error:

Error obtaining unique transport key from ORTE (orte_precondition_transports not present in
the environment).
[...]

In OpenMPI 1.7a1r26338, the result of srun is the same as with mpirun:

$ module load openmpi_1.7a1r26338
$ srun get-allowed-cpu-ompi 1
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0333): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23
Launch 1 Task 00 of 01 (cn0331): 0-23

Regards,
--
Rémi Palancher
http://rezib.org

Attachment: ompi_info_1.7a1r26338_error_binding.txt.gz
Description: GNU Zip compressed data

Attachment: ompi_info_1.7a1r26338_psm_undefined_symbol.txt.gz
Description: GNU Zip compressed data

Reply via email to