[OMPI users] Orted path with module manager on cluster
Hi all, I have built OpenMPI 1.10.2 with RoCE network support on our test cluster. On the cluster we use lmod to manage paths to different versions of softwares. The problem I have is that I receive the "orted: command not found" message given that the path to the orted binary is not exported to the other nodes where my run is launched via a non- interactive ssh connection. I temporarily solved the problem by exporting PATH with the correct path to orted on my .bashrc file but this is not obviously a solution. Any idea how I can fix it? Thank you. Davide
[OMPI users] Pass RoCE flags to srun under SLURM
Hi all, In our cluster the nodes are interconnected with RoCE and I want to set up OpenMPI to run on it via SLURM. I initially compiled OpenMPI 1.10.2 only with IB verbs support and I have no problem making it run over RoCE. Then I have successfully built it with SLURM support as follows: ./configure --with-slurm --with-pmi=/usr/scheduler/slurm --with-verbs - -with-hwloc The problem is that I cannot let it use the RoCE network when I'm using srun. I also tried to export the OpenMPI runtime options but still I cannot correctly initialize the network: $ echo $OMPI_MCA_btl openib,self,sm $ echo $OMPI_MCA_btl_openib_cpc_include rdmacm $ srun -n 2 --mpi=pmi2 ./osu_latency - - No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: test-vmp1245 Local device: mlx4_0 Local port: 2 CPCs attempted: udcm - - - - No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: test-vmp1244 Local device: mlx4_0 Local port: 2 CPCs attempted: udcm - - - - At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[27,4],0]) is on host: test-vmp1244 Process 2 ([[27,4],1]) is on host: test-vmp1245 BTLs attempted: self Your MPI job is now going to abort; sorry. - - - - MPI_INIT has failed because at least one MPI process is unreachable from another. This *usually* means that an underlying communication plugin -- such as a BTL or an MTL -- has either not loaded or not allowed itself to be used. Your MPI job will now abort. You may wish to try to narrow down the problem; * Check the output of ompi_info to see which BTL/MTL plugins are available. * Run your application with MPI_THREAD_SINGLE. * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, if using MTL-based communications) to see exactly which communication plugins were considered and/or discarded. - - *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) [test-vmp1245:3603] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: error: test-vmp1244: task 0: Exited with exit code 1 srun: error: test-vmp1245: task 1: Exited with exit code 1 Any suggestion? Thanks! Davide
Re: [OMPI users] Orted path with module manager on cluster
That made the trick! Thank you guys. Davide On Fri, 2016-03-04 at 08:40 +0900, Gilles Gouaillardet wrote: > Davide, > > you can invoke `which mpirun` instead of mpirun, or mpirun -- > prefix=... > an other option is to rebuild OpenMPI with --enable-mpirun-prefix-by- > default > > Cheers, > > Gilles > > On 3/4/2016 7:22 AM, Davide Vanzo wrote: > > Hi all, > > I have built OpenMPI 1.10.2 with RoCE network support on our test > > cluster. On the cluster we use lmod to manage paths to different > > versions of softwares. The problem I have is that I receive the > > "orted: command not found" message given that the path to > > the orted binary is not exported to the other nodes where my run is > > launched via a non-interactive ssh connection. > > I temporarily solved the problem by exporting PATH with the > > correct path to orted on my .bashrc file but this is not obviously > > a solution. > > Any idea how I can fix it? > > > > Thank you. > > > > Davide > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: http://www.open-mpi.org/community/lists/users/20 > > 16/03/28629.php > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: http://www.open-mpi.org/community/lists/users/2016 > /03/28632.php
Re: [OMPI users] Pass RoCE flags to srun under SLURM
I solved the problem. For some reason the OMPI_MCA_btl_openib_cpc_include environment variable was set to udcm during the tests. By ensuring that it is set to rdmacm solved the issue. Thanks anyway! Davide On Thu, 2016-03-03 at 16:40 -0600, Davide Vanzo wrote: > Hi all, > In our cluster the nodes are interconnected with RoCE and I want to > set up OpenMPI to run on it via SLURM. > I initially compiled OpenMPI 1.10.2 only with IB verbs support and I > have no problem making it run over RoCE. > Then I have successfully built it with SLURM support as follows: > > ./configure --with-slurm --with-pmi=/usr/scheduler/slurm --with- > verbs --with-hwloc > > The problem is that I cannot let it use the RoCE network when I'm > using srun. I also tried to export the OpenMPI runtime options but > still I cannot correctly initialize the network: > > $ echo $OMPI_MCA_btl > openib,self,sm > $ echo $OMPI_MCA_btl_openib_cpc_include > rdmacm > $ srun -n 2 --mpi=pmi2 ./osu_latency > --- > --- > No OpenFabrics connection schemes reported that they were able to be > used on a specific port. As such, the openib BTL (OpenFabrics > support) will be disabled for this port. > > Local host: test-vmp1245 > Local device: mlx4_0 > Local port: 2 > CPCs attempted: udcm > --- > --- > --- > --- > No OpenFabrics connection schemes reported that they were able to be > used on a specific port. As such, the openib BTL (OpenFabrics > support) will be disabled for this port. > > Local host: test-vmp1244 > Local device: mlx4_0 > Local port: 2 > CPCs attempted: udcm > --- > --- > --- > --- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[27,4],0]) is on host: test-vmp1244 > Process 2 ([[27,4],1]) is on host: test-vmp1245 > BTLs attempted: self > > Your MPI job is now going to abort; sorry. > --- > --- > --- > --- > MPI_INIT has failed because at least one MPI process is unreachable > from another. This *usually* means that an underlying communication > plugin -- such as a BTL or an MTL -- has either not loaded or not > allowed itself to be used. Your MPI job will now abort. > > You may wish to try to narrow down the problem; > > * Check the output of ompi_info to see which BTL/MTL plugins are > available. > * Run your application with MPI_THREAD_SINGLE. > * Set the MCA parameter btl_base_verbose to 100 (or > mtl_base_verbose, > if using MTL-based communications) to see exactly which > communication plugins were considered and/or discarded. > --- > --- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > ***and potentially your MPI job) > [test-vmp1245:3603] Local abort before MPI_INIT completed > successfully; not able to aggregate error messages, and not able to > guarantee that all other processes were killed! > srun: error: test-vmp1244: task 0: Exited with exit code 1 > srun: error: test-vmp1245: task 1: Exited with exit code 1 > > Any suggestion? > Thanks! > > Davide > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: http://www.open-mpi.org/community/lists/users/2016 > /03/28630.php