Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)
Hi there, First, thank you for all your helpful answers! On Mon, 2 Apr 2012 20:30:37 -0700, Ralph Castain wrote: I'm afraid the 1.5 series doesn't offer any help in this regard. The required changes only exist in the developers trunk, which will be released as the 1.7 series in the not-too-distant future. I've tested the same use case with 1.5.5 and I obtain the exact same result than with 1.4.5. I confirm this version doesn't offer any help on this. I've also tested the last available snapshot 1.7a1r26338 of the trunk, but it seems to have 2 regressions: - when PSM enabled, undefined symbol error within mca_mtl_psm.so: $ mpirun -n 1 get-allowed-cpu-ompi [cn0286:23252] mca: base: component_find: unable to open /home/H76170/openmpi/1.7a1r26338/lib/openmpi/mca_mtl_psm: /home/H76170/openmpi/1.7a1r26338/lib/openmpi/mca_mtl_psm.so: undefined symbol: ompi_mtl_psm_imrecv (ignored) -- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: cn0286 Framework: mtl Component: psm -- [cn0286:23252] mca: base: components_open: component pml / cm open function failed -- No available pml components were found! This means that there are no components of this type installed on your system or all the components reported that they could not be used. This is a fatal error; your MPI process is likely to abort. Check the output of the "ompi_info" command and ensure that components of this type are available on your system. You may also wish to check the value of the "component_path" MCA parameter and ensure that it has at least one directory that contains valid MCA components. -- [cn0286:23252] PML cm cannot be selected - when disabling PSM support (in order to avoid previous error), binding to cores allocated by Slurm fails: $ salloc --qos=debug -N 2 -n 20 $ srun hostname | sort | uniq -c 12 cn0564 8 cn0565 $ module load openmpi_1.7a1r26338 $ unset OMPI_MCA_mtl OMPI_MCA_pml $ mpicc -o get-allowed-cpu-ompi get-allowed-cpu.c $ mpirun get-allowed-cpu-ompi Launch (null) Task 12 of 20 (cn0565): 0-23 Launch (null) Task 13 of 20 (cn0565): 0-23 Launch (null) Task 14 of 20 (cn0565): 0-23 Launch (null) Task 15 of 20 (cn0565): 0-23 Launch (null) Task 16 of 20 (cn0565): 0-23 Launch (null) Task 17 of 20 (cn0565): 0-23 Launch (null) Task 18 of 20 (cn0565): 0-23 Launch (null) Task 19 of 20 (cn0565): 0-23 Launch (null) Task 07 of 20 (cn0564): 0-23 Launch (null) Task 08 of 20 (cn0564): 0-23 Launch (null) Task 09 of 20 (cn0564): 0-23 Launch (null) Task 10 of 20 (cn0564): 0-23 Launch (null) Task 11 of 20 (cn0564): 0-23 Launch (null) Task 00 of 20 (cn0564): 0-23 Launch (null) Task 01 of 20 (cn0564): 0-23 Launch (null) Task 02 of 20 (cn0564): 0-23 Launch (null) Task 03 of 20 (cn0564): 0-23 Launch (null) Task 04 of 20 (cn0564): 0-23 Launch (null) Task 05 of 20 (cn0564): 0-23 Launch (null) Task 06 of 20 (cn0564): 0-23 FYI, I am using Slurm 2.3.3. Did I missed something with this trunk version? Do you want me to send the corresponding generated config.log, "ompi_info" and "mpirun ompi full"? Regards, -- Rémi Palancher http://rezib.org
Re: [OMPI users] Error with multiple MPI runs inside one Slurm allocation (with QLogic PSM)
Couple of things: 1. please do send the output from ompi_info 2. please send the slurm envars from your allocation - i.e., after you do your salloc. Are you sure that slurm is actually "binding" us during this launch? If you just srun your get-allowed-cpu, what does it show? I'm wondering if it just gets reflected in the allocation envar and not actually binding the orteds. On Apr 27, 2012, at 8:41 AM, Rémi Palancher wrote: > Hi there, > > First, thank you for all your helpful answers! > > On Mon, 2 Apr 2012 20:30:37 -0700, Ralph Castain wrote: >> I'm afraid the 1.5 series doesn't offer any help in this regard. The >> required changes only exist in the developers trunk, which will be >> released as the 1.7 series in the not-too-distant future. > > I've tested the same use case with 1.5.5 and I obtain the exact same result > than with 1.4.5. I confirm this version doesn't offer any help on this. > > I've also tested the last available snapshot 1.7a1r26338 of the trunk, but it > seems to have 2 regressions: > > - when PSM enabled, undefined symbol error within mca_mtl_psm.so: > > $ mpirun -n 1 get-allowed-cpu-ompi > [cn0286:23252] mca: base: component_find: unable to open > /home/H76170/openmpi/1.7a1r26338/lib/openmpi/mca_mtl_psm: > /home/H76170/openmpi/1.7a1r26338/lib/openmpi/mca_mtl_psm.so: undefined > symbol: ompi_mtl_psm_imrecv (ignored) > -- > A requested component was not found, or was unable to be opened. This > means that this component is either not installed or is unable to be > used on your system (e.g., sometimes this means that shared libraries > that the component requires are unable to be found/loaded). Note that > Open MPI stopped checking at the first component that it did not find. > > Host: cn0286 > Framework: mtl > Component: psm > -- > [cn0286:23252] mca: base: components_open: component pml / cm open function > failed > -- > No available pml components were found! > > This means that there are no components of this type installed on your > system or all the components reported that they could not be used. > > This is a fatal error; your MPI process is likely to abort. Check the > output of the "ompi_info" command and ensure that components of this > type are available on your system. You may also wish to check the > value of the "component_path" MCA parameter and ensure that it has at > least one directory that contains valid MCA components. > -- > [cn0286:23252] PML cm cannot be selected > > - when disabling PSM support (in order to avoid previous error), binding to > cores allocated by Slurm fails: > > $ salloc --qos=debug -N 2 -n 20 > $ srun hostname | sort | uniq -c > 12 cn0564 > 8 cn0565 > $ module load openmpi_1.7a1r26338 > $ unset OMPI_MCA_mtl OMPI_MCA_pml > $ mpicc -o get-allowed-cpu-ompi get-allowed-cpu.c > $ mpirun get-allowed-cpu-ompi > Launch (null) Task 12 of 20 (cn0565): 0-23 > Launch (null) Task 13 of 20 (cn0565): 0-23 > Launch (null) Task 14 of 20 (cn0565): 0-23 > Launch (null) Task 15 of 20 (cn0565): 0-23 > Launch (null) Task 16 of 20 (cn0565): 0-23 > Launch (null) Task 17 of 20 (cn0565): 0-23 > Launch (null) Task 18 of 20 (cn0565): 0-23 > Launch (null) Task 19 of 20 (cn0565): 0-23 > Launch (null) Task 07 of 20 (cn0564): 0-23 > Launch (null) Task 08 of 20 (cn0564): 0-23 > Launch (null) Task 09 of 20 (cn0564): 0-23 > Launch (null) Task 10 of 20 (cn0564): 0-23 > Launch (null) Task 11 of 20 (cn0564): 0-23 > Launch (null) Task 00 of 20 (cn0564): 0-23 > Launch (null) Task 01 of 20 (cn0564): 0-23 > Launch (null) Task 02 of 20 (cn0564): 0-23 > Launch (null) Task 03 of 20 (cn0564): 0-23 > Launch (null) Task 04 of 20 (cn0564): 0-23 > Launch (null) Task 05 of 20 (cn0564): 0-23 > Launch (null) Task 06 of 20 (cn0564): 0-23 > > FYI, I am using Slurm 2.3.3. > > Did I missed something with this trunk version? > > Do you want me to send the corresponding generated config.log, "ompi_info" > and "mpirun ompi full"? > > Regards, > -- > Rémi Palancher > http://rezib.org > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Multithreading applications with OMPI 1.7
Hi, we've tried to use a multithreaded application with a more recent trunk version (March 21) of OpenMPI. We need to use this version because of CUDA RDMA support. OpenMPI was binding all the threads to a single core, which is undesirable. In OpenMPI 1.5. there was an option --cpus-per-rank, which should have helped in this case, or --bind-to-none. Unfortunately, these options are now gone and I couldn't figure out how to make it work with the newest version. Can anyone offer any hints on this? Thanks, Jens.
Re: [OMPI users] Multithreading applications with OMPI 1.7
On Apr 27, 2012, at 5:20 PM, Jens Glaser wrote: > Hi, > > we've tried to use a multithreaded application with a more recent trunk > version (March 21) of OpenMPI. We need to use this version because of CUDA > RDMA support. OpenMPI was binding all the threads to a single core, which is > undesirable. > In OpenMPI 1.5. there was an option --cpus-per-rank, which should have helped > in this case, or --bind-to-none. --cpus-per-rank is turned "off" at the moment - needs to be updated --bind-to none is the appropriate syntax - that should be the default setting, unless you are specifying a binding policy somewhere in an MCA param. see mpirun -h for the full list of options > > Unfortunately, these options are now gone and I couldn't figure out how to > make it work with the newest version. > > Can anyone offer any hints on this? > > Thanks, > Jens. > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users