[OMPI users] Open MPI in a Infiniband dual-rail configuration issues

Ludovic Raess Wed, 19 Jul 2017 05:23:10 -0700

Hi,

We have an issue on our 32 nodes Linux cluster regarding the usage of Open MPI 
in a Infiniband dual-rail configuration.


Node config:
- Supermicro dual socket Xeon E5 v3 6 cores CPUs
- 4 Titan X GPUs
- 2 IB Connect X FDR single port HCA (mlx4_0 and mlx4_1)
- Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7

IB dual rail configuration: two independent IB switches (36 ports), each of the 
two single port IB HCA is connected to its own IB subnet.

The nodes are additionally connected via Ethernet for admin.

------------------------------------------------------------

Consider the node topology below as being valid for every of the 32 nodes from 
the cluster:

At the PCIe root complex level, each CPU manages two GPUs and a single IB card :
CPU0     |    CPU1
mlx4_0   |    mlx4_1
GPU0     |    GPU2
GPU1     |    GPU3

MPI ranks are bounded to a socket via a rankfile and are distributed on the 2 
sockets of each node :
rank 0=node01 slot=0:2
rank 1=node01 slot=1:2
rank 2=node02 slot=0:2
...
rank n=nodeNN slot=0,1:2


case 1: with a single IB HCA used (any one of the two), all ranks can 
communicate with each other via
        openib only, and this independently of their relative socket binding. 
The use of tcp btl can be
        explicitly disabled as there is no tcp traffic.

        "mpirun -rf rankfile --mca btl_openib_if_include mlx4_0 --mca btl 
self,openib a.out"

case 2: in some rare cases, the topology of our MPI job is such that processes 
on socket 0 communicate only with
        other processes on socket 0 and the same is true for processes on 
socket 1. In this context, the two IB rails
        are effectively used in parallel and all ranks communicate as needed 
via openib only, no tcp traffic.

        "mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca 
btl self,openib a.out"

case 3: most of the time we have "cross socket" communications between ranks on 
different nodes.
        In this context Open MPI reverts to using tcp when communications 
involve even and odd sockets,
        and it slows down our jobs.

mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 a.out
[node01.octopoda:16129] MCW rank 0 bound to socket 0[core 2[hwt 0]]: 
[././B/././.][./././././.]
[node02.octopoda:12061] MCW rank 1 bound to socket 1[core 10[hwt 0]]: 
[./././././.][././././B/.]
[node02.octopoda:12062] [rank=1] openib: skipping device mlx4_0; it is too far 
away
[node01.octopoda:16130] [rank=0] openib: skipping device mlx4_1; it is too far 
away
[node02.octopoda:12062] [rank=1] openib: using port mlx4_1:1
[node01.octopoda:16130] [rank=0] openib: using port mlx4_0:1
[node02.octopoda:12062] mca: bml: Using self btl to [[11337,1],1] on node node02
[node01.octopoda:16130] mca: bml: Using self btl to [[11337,1],0] on node node01
[node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01
[node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01
[node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01
[node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02
[node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02
[node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02


        trying to force using the two IB HCA and to disable the use of tcp btl 
results in the following error

mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca btl 
self,openib a.out
[node02.octopoda:11818] MCW rank 1 bound to socket 1[core 10[hwt 0]]: 
[./././././.][././././B/.]
[node01.octopoda:15886] MCW rank 0 bound to socket 0[core 2[hwt 0]]: 
[././B/././.][./././././.]
[node01.octopoda:15887] [rank=0] openib: skipping device mlx4_1; it is too far 
away
[node02.octopoda:11819] [rank=1] openib: skipping device mlx4_0; it is too far 
away
[node01.octopoda:15887] [rank=0] openib: using port mlx4_0:1
[node02.octopoda:11819] [rank=1] openib: using port mlx4_1:1
[node02.octopoda:11819] mca: bml: Using self btl to [[25017,1],1] on node node02
[node01.octopoda:15887] mca: bml: Using self btl to [[25017,1],0] on node node01
-------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[25017,1],1]) is on host: node02
  Process 2 ([[25017,1],0]) is on host: node01
  BTLs attempted: self openib

Your MPI job is now going to abort; sorry.
-------------------------------------

We have an issue on our 32 nodes Linux cluster regarding the usage of Open MPI 
in a Infiniband dual-rail configuration.

Node config:
- Supermicro dual socket Xeon E5 v3 6 cores CPUs
- 4 Titan X GPUs
- 2 IB Connect X FDR single port HCA (mlx4_0 and mlx4_1)
- Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7

IB dual rail configuration: two independent IB switches (36 ports), each of the 
two single port IB HCA is connected to its own IB subnet.

The nodes are additionally connected via Ethernet for admin.

------------------------------------------------------------

Consider the node topology below as being valid for every of the 32 nodes from 
the cluster:

At the PCIe root complex level, each CPU manages two GPUs and a single IB card :
CPU0     |    CPU1
mlx4_0   |    mlx4_1
GPU0     |    GPU2
GPU1     |    GPU3

MPI ranks are bounded to a socket via a rankfile and are distributed on the 2 
sockets of each node :
rank 0=node01 slot=0:2
rank 1=node01 slot=1:2
rank 2=node02 slot=0:2
...
rank n=nodeNN slot=0,1:2


case 1: with a single IB HCA used (any one of the two), all ranks can 
communicate with each other via
        openib only, and this independently of their relative socket binding. 
The use of tcp btl can be
        explicitly disabled as there is no tcp traffic.

        "mpirun -rf rankfile --mca btl_openib_if_include mlx4_0 --mca btl 
self,openib a.out"

case 2: in some rare cases, the topology of our MPI job is such that processes 
on socket 0 communicate only with
        other processes on socket 0 and the same is true for processes on 
socket 1. In this context, the two IB rails
        are effectively used in parallel and all ranks communicate as needed 
via openib only, no tcp traffic.

        "mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca 
btl self,openib a.out"

case 3: most of the time we have "cross socket" communications between ranks on 
different nodes.
        In this context Open MPI reverts to using tcp when communications 
involve even and odd sockets,
        and it slows down our jobs.

mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 a.out
[node01.octopoda:16129] MCW rank 0 bound to socket 0[core 2[hwt 0]]: 
[././B/././.][./././././.]
[node02.octopoda:12061] MCW rank 1 bound to socket 1[core 10[hwt 0]]: 
[./././././.][././././B/.]
[node02.octopoda:12062] [rank=1] openib: skipping device mlx4_0; it is too far 
away
[node01.octopoda:16130] [rank=0] openib: skipping device mlx4_1; it is too far 
away
[node02.octopoda:12062] [rank=1] openib: using port mlx4_1:1
[node01.octopoda:16130] [rank=0] openib: using port mlx4_0:1
[node02.octopoda:12062] mca: bml: Using self btl to [[11337,1],1] on node node02
[node01.octopoda:16130] mca: bml: Using self btl to [[11337,1],0] on node node01
[node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01
[node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01
[node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01
[node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02
[node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02
[node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02


        trying to force using the two IB HCA and to disable the use of tcp btl 
results in the following error

mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca btl 
self,openib a.out
[node02.octopoda:11818] MCW rank 1 bound to socket 1[core 10[hwt 0]]: 
[./././././.][././././B/.]
[node01.octopoda:15886] MCW rank 0 bound to socket 0[core 2[hwt 0]]: 
[././B/././.][./././././.]
[node01.octopoda:15887] [rank=0] openib: skipping device mlx4_1; it is too far 
away
[node02.octopoda:11819] [rank=1] openib: skipping device mlx4_0; it is too far 
away
[node01.octopoda:15887] [rank=0] openib: using port mlx4_0:1
[node02.octopoda:11819] [rank=1] openib: using port mlx4_1:1
[node02.octopoda:11819] mca: bml: Using self btl to [[25017,1],1] on node node02
[node01.octopoda:15887] mca: bml: Using self btl to [[25017,1],0] on node node01
-------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[25017,1],1]) is on host: node02
  Process 2 ([[25017,1],0]) is on host: node01
  BTLs attempted: self openib

Your MPI job is now going to abort; sorry.
-------------------------------------

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Open MPI in a Infiniband dual-rail configuration issues

Reply via email to