Couple of things:

1. please do send the output from ompi_info

2. please send the slurm envars from your allocation - i.e., after you do your 
salloc.

Are you sure that slurm is actually "binding" us during this launch? If you 
just srun your get-allowed-cpu, what does it show? I'm wondering if it just 
gets reflected in the allocation envar and not actually binding the orteds.


On Apr 27, 2012, at 8:41 AM, Rémi Palancher wrote:

> Hi there,
> 
> First, thank you for all your helpful answers!
> 
> On Mon, 2 Apr 2012 20:30:37 -0700, Ralph Castain <r...@open-mpi.org> wrote:
>> I'm afraid the 1.5 series doesn't offer any help in this regard. The
>> required changes only exist in the developers trunk, which will be
>> released as the 1.7 series in the not-too-distant future.
> 
> I've tested the same use case with 1.5.5 and I obtain the exact same result 
> than with 1.4.5. I confirm this version doesn't offer any help on this.
> 
> I've also tested the last available snapshot 1.7a1r26338 of the trunk, but it 
> seems to have 2 regressions:
> 
>  - when PSM enabled, undefined symbol error within mca_mtl_psm.so:
> 
> $ mpirun -n 1 get-allowed-cpu-ompi
> [cn0286:23252] mca: base: component_find: unable to open 
> /home/H76170/openmpi/1.7a1r26338/lib/openmpi/mca_mtl_psm: 
> /home/H76170/openmpi/1.7a1r26338/lib/openmpi/mca_mtl_psm.so: undefined 
> symbol: ompi_mtl_psm_imrecv (ignored)
> --------------------------------------------------------------------------
> A requested component was not found, or was unable to be opened.  This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> Open MPI stopped checking at the first component that it did not find.
> 
> Host:      cn0286
> Framework: mtl
> Component: psm
> --------------------------------------------------------------------------
> [cn0286:23252] mca: base: components_open: component pml / cm open function 
> failed
> --------------------------------------------------------------------------
> No available pml components were found!
> 
> This means that there are no components of this type installed on your
> system or all the components reported that they could not be used.
> 
> This is a fatal error; your MPI process is likely to abort.  Check the
> output of the "ompi_info" command and ensure that components of this
> type are available on your system.  You may also wish to check the
> value of the "component_path" MCA parameter and ensure that it has at
> least one directory that contains valid MCA components.
> --------------------------------------------------------------------------
> [cn0286:23252] PML cm cannot be selected
> 
>  - when disabling PSM support (in order to avoid previous error), binding to 
> cores allocated by Slurm fails:
> 
> $ salloc --qos=debug -N 2 -n 20
> $ srun hostname | sort | uniq -c
>     12 cn0564
>      8 cn0565
> $ module load openmpi_1.7a1r26338
> $ unset OMPI_MCA_mtl OMPI_MCA_pml
> $ mpicc -o get-allowed-cpu-ompi get-allowed-cpu.c
> $ mpirun get-allowed-cpu-ompi
> Launch (null) Task 12 of 20 (cn0565): 0-23
> Launch (null) Task 13 of 20 (cn0565): 0-23
> Launch (null) Task 14 of 20 (cn0565): 0-23
> Launch (null) Task 15 of 20 (cn0565): 0-23
> Launch (null) Task 16 of 20 (cn0565): 0-23
> Launch (null) Task 17 of 20 (cn0565): 0-23
> Launch (null) Task 18 of 20 (cn0565): 0-23
> Launch (null) Task 19 of 20 (cn0565): 0-23
> Launch (null) Task 07 of 20 (cn0564): 0-23
> Launch (null) Task 08 of 20 (cn0564): 0-23
> Launch (null) Task 09 of 20 (cn0564): 0-23
> Launch (null) Task 10 of 20 (cn0564): 0-23
> Launch (null) Task 11 of 20 (cn0564): 0-23
> Launch (null) Task 00 of 20 (cn0564): 0-23
> Launch (null) Task 01 of 20 (cn0564): 0-23
> Launch (null) Task 02 of 20 (cn0564): 0-23
> Launch (null) Task 03 of 20 (cn0564): 0-23
> Launch (null) Task 04 of 20 (cn0564): 0-23
> Launch (null) Task 05 of 20 (cn0564): 0-23
> Launch (null) Task 06 of 20 (cn0564): 0-23
> 
> FYI, I am using Slurm 2.3.3.
> 
> Did I missed something with this trunk version?
> 
> Do you want me to send the corresponding generated config.log, "ompi_info" 
> and "mpirun ompi full"?
> 
> Regards,
> -- 
> Rémi Palancher
> http://rezib.org
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to