I solved the problem. For some reason the
OMPI_MCA_btl_openib_cpc_include environment variable was set to udcm
during the tests. By ensuring that it is set to rdmacm solved the
issue.
Thanks anyway!
Davide
On Thu, 2016-03-03 at 16:40 -0600, Davide Vanzo wrote:
> Hi all,
> In our cluster the nodes are interconnected with RoCE and I want to
> set up OpenMPI to run on it via SLURM.
> I initially compiled OpenMPI 1.10.2 only with IB verbs support and I
> have no problem making it run over RoCE.
> Then I have successfully built it with SLURM support as follows:
> 
> ./configure --with-slurm --with-pmi=/usr/scheduler/slurm --with-
> verbs --with-hwloc
> 
> The problem is that I cannot let it use the RoCE network when I'm
> using srun. I also tried to export the OpenMPI runtime options but
> still I cannot correctly initialize the network:
> 
> $ echo $OMPI_MCA_btl
> openib,self,sm
> $ echo $OMPI_MCA_btl_openib_cpc_include 
> rdmacm
> $ srun -n 2 --mpi=pmi2 ./osu_latency
> -------------------------------------------------------------------
> -------
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>   Local host:           test-vmp1245
>   Local device:         mlx4_0
>   Local port:           2
>   CPCs attempted:       udcm
> -------------------------------------------------------------------
> -------
> -------------------------------------------------------------------
> -------
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>   Local host:           test-vmp1244
>   Local device:         mlx4_0
>   Local port:           2
>   CPCs attempted:       udcm
> -------------------------------------------------------------------
> -------
> -------------------------------------------------------------------
> -------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>   Process 1 ([[27,4],0]) is on host: test-vmp1244
>   Process 2 ([[27,4],1]) is on host: test-vmp1245
>   BTLs attempted: self
> 
> Your MPI job is now going to abort; sorry.
> -------------------------------------------------------------------
> -------
> -------------------------------------------------------------------
> -------
> MPI_INIT has failed because at least one MPI process is unreachable
> from another.  This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or not
> allowed itself to be used.  Your MPI job will now abort.
> 
> You may wish to try to narrow down the problem;
> 
>  * Check the output of ompi_info to see which BTL/MTL plugins are
>    available.
>  * Run your application with MPI_THREAD_SINGLE.
>  * Set the MCA parameter btl_base_verbose to 100 (or
> mtl_base_verbose,
>    if using MTL-based communications) to see exactly which
>    communication plugins were considered and/or discarded.
> -------------------------------------------------------------------
> -------
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***    and potentially your MPI job)
> [test-vmp1245:3603] Local abort before MPI_INIT completed
> successfully; not able to aggregate error messages, and not able to
> guarantee that all other processes were killed!
> srun: error: test-vmp1244: task 0: Exited with exit code 1
> srun: error: test-vmp1245: task 1: Exited with exit code 1
> 
> Any suggestion?
> Thanks!
> 
> Davide
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016
> /03/28630.php

Reply via email to