Hi, We have an issue on our 32 nodes Linux cluster regarding the usage of Open MPI in a Infiniband dual-rail configuration.
Node config: - Supermicro dual socket Xeon E5 v3 6 cores CPUs - 4 Titan X GPUs - 2 IB Connect X FDR single port HCA (mlx4_0 and mlx4_1) - Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7 IB dual rail configuration: two independent IB switches (36 ports), each of the two single port IB HCA is connected to its own IB subnet. The nodes are additionally connected via Ethernet for admin. ------------------------------------------------------------ Consider the node topology below as being valid for every of the 32 nodes from the cluster: At the PCIe root complex level, each CPU manages two GPUs and a single IB card : CPU0 | CPU1 mlx4_0 | mlx4_1 GPU0 | GPU2 GPU1 | GPU3 MPI ranks are bounded to a socket via a rankfile and are distributed on the 2 sockets of each node : rank 0=node01 slot=0:2 rank 1=node01 slot=1:2 rank 2=node02 slot=0:2 ... rank n=nodeNN slot=0,1:2 case 1: with a single IB HCA used (any one of the two), all ranks can communicate with each other via openib only, and this independently of their relative socket binding. The use of tcp btl can be explicitly disabled as there is no tcp traffic. "mpirun -rf rankfile --mca btl_openib_if_include mlx4_0 --mca btl self,openib a.out" case 2: in some rare cases, the topology of our MPI job is such that processes on socket 0 communicate only with other processes on socket 0 and the same is true for processes on socket 1. In this context, the two IB rails are effectively used in parallel and all ranks communicate as needed via openib only, no tcp traffic. "mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca btl self,openib a.out" case 3: most of the time we have "cross socket" communications between ranks on different nodes. In this context Open MPI reverts to using tcp when communications involve even and odd sockets, and it slows down our jobs. mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 a.out [node01.octopoda:16129] MCW rank 0 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.] [node02.octopoda:12061] MCW rank 1 bound to socket 1[core 10[hwt 0]]: [./././././.][././././B/.] [node02.octopoda:12062] [rank=1] openib: skipping device mlx4_0; it is too far away [node01.octopoda:16130] [rank=0] openib: skipping device mlx4_1; it is too far away [node02.octopoda:12062] [rank=1] openib: using port mlx4_1:1 [node01.octopoda:16130] [rank=0] openib: using port mlx4_0:1 [node02.octopoda:12062] mca: bml: Using self btl to [[11337,1],1] on node node02 [node01.octopoda:16130] mca: bml: Using self btl to [[11337,1],0] on node node01 [node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01 [node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01 [node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01 [node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02 [node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02 [node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02 trying to force using the two IB HCA and to disable the use of tcp btl results in the following error mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca btl self,openib a.out [node02.octopoda:11818] MCW rank 1 bound to socket 1[core 10[hwt 0]]: [./././././.][././././B/.] [node01.octopoda:15886] MCW rank 0 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.] [node01.octopoda:15887] [rank=0] openib: skipping device mlx4_1; it is too far away [node02.octopoda:11819] [rank=1] openib: skipping device mlx4_0; it is too far away [node01.octopoda:15887] [rank=0] openib: using port mlx4_0:1 [node02.octopoda:11819] [rank=1] openib: using port mlx4_1:1 [node02.octopoda:11819] mca: bml: Using self btl to [[25017,1],1] on node node02 [node01.octopoda:15887] mca: bml: Using self btl to [[25017,1],0] on node node01 ------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[25017,1],1]) is on host: node02 Process 2 ([[25017,1],0]) is on host: node01 BTLs attempted: self openib Your MPI job is now going to abort; sorry. -------------------------------------
We have an issue on our 32 nodes Linux cluster regarding the usage of Open MPI in a Infiniband dual-rail configuration. Node config: - Supermicro dual socket Xeon E5 v3 6 cores CPUs - 4 Titan X GPUs - 2 IB Connect X FDR single port HCA (mlx4_0 and mlx4_1) - Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7 IB dual rail configuration: two independent IB switches (36 ports), each of the two single port IB HCA is connected to its own IB subnet. The nodes are additionally connected via Ethernet for admin. ------------------------------------------------------------ Consider the node topology below as being valid for every of the 32 nodes from the cluster: At the PCIe root complex level, each CPU manages two GPUs and a single IB card : CPU0 | CPU1 mlx4_0 | mlx4_1 GPU0 | GPU2 GPU1 | GPU3 MPI ranks are bounded to a socket via a rankfile and are distributed on the 2 sockets of each node : rank 0=node01 slot=0:2 rank 1=node01 slot=1:2 rank 2=node02 slot=0:2 ... rank n=nodeNN slot=0,1:2 case 1: with a single IB HCA used (any one of the two), all ranks can communicate with each other via openib only, and this independently of their relative socket binding. The use of tcp btl can be explicitly disabled as there is no tcp traffic. "mpirun -rf rankfile --mca btl_openib_if_include mlx4_0 --mca btl self,openib a.out" case 2: in some rare cases, the topology of our MPI job is such that processes on socket 0 communicate only with other processes on socket 0 and the same is true for processes on socket 1. In this context, the two IB rails are effectively used in parallel and all ranks communicate as needed via openib only, no tcp traffic. "mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca btl self,openib a.out" case 3: most of the time we have "cross socket" communications between ranks on different nodes. In this context Open MPI reverts to using tcp when communications involve even and odd sockets, and it slows down our jobs. mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 a.out [node01.octopoda:16129] MCW rank 0 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.] [node02.octopoda:12061] MCW rank 1 bound to socket 1[core 10[hwt 0]]: [./././././.][././././B/.] [node02.octopoda:12062] [rank=1] openib: skipping device mlx4_0; it is too far away [node01.octopoda:16130] [rank=0] openib: skipping device mlx4_1; it is too far away [node02.octopoda:12062] [rank=1] openib: using port mlx4_1:1 [node01.octopoda:16130] [rank=0] openib: using port mlx4_0:1 [node02.octopoda:12062] mca: bml: Using self btl to [[11337,1],1] on node node02 [node01.octopoda:16130] mca: bml: Using self btl to [[11337,1],0] on node node01 [node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01 [node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01 [node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node node01 [node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02 [node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02 [node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node node02 trying to force using the two IB HCA and to disable the use of tcp btl results in the following error mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca btl self,openib a.out [node02.octopoda:11818] MCW rank 1 bound to socket 1[core 10[hwt 0]]: [./././././.][././././B/.] [node01.octopoda:15886] MCW rank 0 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.] [node01.octopoda:15887] [rank=0] openib: skipping device mlx4_1; it is too far away [node02.octopoda:11819] [rank=1] openib: skipping device mlx4_0; it is too far away [node01.octopoda:15887] [rank=0] openib: using port mlx4_0:1 [node02.octopoda:11819] [rank=1] openib: using port mlx4_1:1 [node02.octopoda:11819] mca: bml: Using self btl to [[25017,1],1] on node node02 [node01.octopoda:15887] mca: bml: Using self btl to [[25017,1],0] on node node01 ------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[25017,1],1]) is on host: node02 Process 2 ([[25017,1],0]) is on host: node01 BTLs attempted: self openib Your MPI job is now going to abort; sorry. -------------------------------------
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users