[OMPI users] Orted path with module manager on cluster

2016-03-03 Thread Davide Vanzo
Hi all,
I have built OpenMPI 1.10.2 with RoCE network support on our test
cluster. On the cluster we use lmod to manage paths to different
versions of softwares. The problem I have is that I receive the "orted:
command not found" message given that the path to the orted binary is
not exported to the other nodes where my run is launched via a non-
interactive ssh connection. I temporarily solved the problem by
exporting PATH with the correct path to orted on my .bashrc file but
this is not obviously a solution.
Any idea how I can fix it?

Thank you.

Davide

[OMPI users] Pass RoCE flags to srun under SLURM

2016-03-03 Thread Davide Vanzo
Hi all,
In our cluster the nodes are interconnected with RoCE and I want to set
up OpenMPI to run on it via SLURM.
I initially compiled OpenMPI 1.10.2 only with IB verbs support and I
have no problem making it run over RoCE.
Then I have successfully built it with SLURM support as follows:

./configure --with-slurm --with-pmi=/usr/scheduler/slurm --with-verbs -
-with-hwloc

The problem is that I cannot let it use the RoCE network when I'm using
srun. I also tried to export the OpenMPI runtime options but still I
cannot correctly initialize the network:

$ echo $OMPI_MCA_btl
openib,self,sm
$ echo $OMPI_MCA_btl_openib_cpc_include 
rdmacm
$ srun -n 2 --mpi=pmi2 ./osu_latency
-
-
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   test-vmp1245
  Local device: mlx4_0
  Local port:   2
  CPCs attempted:   udcm
-
-
-
-
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   test-vmp1244
  Local device: mlx4_0
  Local port:   2
  CPCs attempted:   udcm
-
-
-
-
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[27,4],0]) is on host: test-vmp1244
  Process 2 ([[27,4],1]) is on host: test-vmp1245
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
-
-
-
-
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
-
-
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***and potentially your MPI job)
[test-vmp1245:3603] Local abort before MPI_INIT completed successfully;
not able to aggregate error messages, and not able to guarantee that
all other processes were killed!
srun: error: test-vmp1244: task 0: Exited with exit code 1
srun: error: test-vmp1245: task 1: Exited with exit code 1

Any suggestion?
Thanks!

Davide

Re: [OMPI users] Orted path with module manager on cluster

2016-03-04 Thread Davide Vanzo
That made the trick!
Thank you guys.
Davide
On Fri, 2016-03-04 at 08:40 +0900, Gilles Gouaillardet wrote:
> Davide,
> 
> you can invoke `which mpirun` instead of mpirun, or mpirun --
> prefix=...
> an other option is to rebuild OpenMPI with --enable-mpirun-prefix-by-
> default
> 
> Cheers,
> 
> Gilles
> 
> On 3/4/2016 7:22 AM, Davide Vanzo wrote:
> > Hi all,
> > I have built OpenMPI 1.10.2 with RoCE network support on our test
> > cluster. On the cluster we use lmod to manage paths to different
> > versions of softwares. The problem I have is that I receive the
> > "orted: command not found" message given that the path to
> > the orted binary is not exported to the other nodes where my run is
> > launched via a non-interactive ssh connection.
> > I temporarily solved the problem by exporting PATH with the
> > correct path to orted on my .bashrc file but this is not obviously
> > a solution.
> > Any idea how I can fix it?
> > 
> > Thank you.
> > 
> > Davide
> > 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: http://www.open-mpi.org/community/lists/users/20
> > 16/03/28629.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016
> /03/28632.php

Re: [OMPI users] Pass RoCE flags to srun under SLURM

2016-03-04 Thread Davide Vanzo
I solved the problem. For some reason the
OMPI_MCA_btl_openib_cpc_include environment variable was set to udcm
during the tests. By ensuring that it is set to rdmacm solved the
issue.
Thanks anyway!
Davide
On Thu, 2016-03-03 at 16:40 -0600, Davide Vanzo wrote:
> Hi all,
> In our cluster the nodes are interconnected with RoCE and I want to
> set up OpenMPI to run on it via SLURM.
> I initially compiled OpenMPI 1.10.2 only with IB verbs support and I
> have no problem making it run over RoCE.
> Then I have successfully built it with SLURM support as follows:
> 
> ./configure --with-slurm --with-pmi=/usr/scheduler/slurm --with-
> verbs --with-hwloc
> 
> The problem is that I cannot let it use the RoCE network when I'm
> using srun. I also tried to export the OpenMPI runtime options but
> still I cannot correctly initialize the network:
> 
> $ echo $OMPI_MCA_btl
> openib,self,sm
> $ echo $OMPI_MCA_btl_openib_cpc_include 
> rdmacm
> $ srun -n 2 --mpi=pmi2 ./osu_latency
> ---
> ---
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>   Local host:   test-vmp1245
>   Local device: mlx4_0
>   Local port:   2
>   CPCs attempted:   udcm
> ---
> ---
> ---
> ---
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>   Local host:   test-vmp1244
>   Local device: mlx4_0
>   Local port:   2
>   CPCs attempted:   udcm
> ---
> ---
> ---
> ---
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>   Process 1 ([[27,4],0]) is on host: test-vmp1244
>   Process 2 ([[27,4],1]) is on host: test-vmp1245
>   BTLs attempted: self
> 
> Your MPI job is now going to abort; sorry.
> ---
> ---
> ---
> ---
> MPI_INIT has failed because at least one MPI process is unreachable
> from another.  This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or not
> allowed itself to be used.  Your MPI job will now abort.
> 
> You may wish to try to narrow down the problem;
> 
>  * Check the output of ompi_info to see which BTL/MTL plugins are
>    available.
>  * Run your application with MPI_THREAD_SINGLE.
>  * Set the MCA parameter btl_base_verbose to 100 (or
> mtl_base_verbose,
>    if using MTL-based communications) to see exactly which
>    communication plugins were considered and/or discarded.
> ---
> ---
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***and potentially your MPI job)
> [test-vmp1245:3603] Local abort before MPI_INIT completed
> successfully; not able to aggregate error messages, and not able to
> guarantee that all other processes were killed!
> srun: error: test-vmp1244: task 0: Exited with exit code 1
> srun: error: test-vmp1245: task 1: Exited with exit code 1
> 
> Any suggestion?
> Thanks!
> 
> Davide
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016
> /03/28630.php