[OMPI users] Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks

Sébastien Boisvert Sat, 17 Sep 2011 18:59:35 -0400

Hello,

Open-MPI 1.4.3 on Mellanox Infiniband hardware gives a latency of 250 
microseconds with 256 MPI ranks on super-computer A (name is colosse).


The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic 
Infiniband hardware with 512 MPI ranks on super-computer B (name is guillimin).


Here are the relevant information listed in 
http://www.open-mpi.org/community/help/


1. Check the FAQ first.

done !


2. The version of Open MPI that you're using.

Open-MPI 1.4.3


3. The config.log file from the top-level Open MPI directory, if available 
(please compress!).

See below.

Command file: http://pastebin.com/mW32ntSJ


4. The output of the "ompi_info --all" command from the node where you're 
invoking mpirun. 

ompi_info -a on colosse: http://pastebin.com/RPyY9s24


5. If running on more than one node -- especially if you're having problems 
launching Open MPI processes -- also include the output of the "ompi_info -v 
ompi full --parsable" command from each node on which you're trying to run. 

I am not having problems launching Open-MPI processes.


6. A detailed description of what is failing.

Open-MPI 1.4.3 on Mellanox Infiniband hardware give a latency of 250 
microseconds with 256 MPI ranks on super-computer A (name is colosse).

The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic 
Infiniband hardware on 512 MPI ranks on super-computer B (name is guillimin).

Details follow.


I am developing a distributed genome assembler that runs with the 
message-passing interface (I am a PhD student). 
It is called Ray. Link: http://github.com/sebhtml/ray

I recently added the option -test-network-only so that Ray can be used to test 
the latency. Each MPI rank has to send 100000 messages (4000 bytes each), one 
by one.
The destination of any message is picked up at random.


On colosse, a super-computer located at Laval University, I get an average 
latency of 250 microseconds with the test done in Ray.

See http://pastebin.com/9nyjSy5z

On colosse, the hardware is Mellanox Infiniband QDR ConnectX and the MPI 
middleware is Open-MPI 1.4.3 compiled with gcc 4.4.2.

colosse has 8 compute cores per node (Intel Nehalem).


Testing the latency with ibv_rc_pingpong on colosse gives 11 microseconds. 

  local address:  LID 0x048e, QPN 0x1c005c, PSN 0xf7c66b
  remote address: LID 0x018c, QPN 0x2c005c, PSN 0x5428e6
8192000 bytes in 0.01 seconds = 5776.64 Mbit/sec
1000 iters in 0.01 seconds = 11.35 usec/iter

So I know that the Infiniband has a correct latency between two HCAs because of 
the output of ibv_rc_pingpong.



Adding the parameter --mca btl_openib_verbose 1 to mpirun shows that Open-MPI 
detects the hardware correctly:

[r107-n57][[59764,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] 
Querying INI files for vendor 0x02c9, part ID 26428
[r107-n57][[59764,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found 
corresponding INI values: Mellanox Hermon

see http://pastebin.com/pz03f0B3


So I don't think this is the problem described in the FAQ ( 
http://www.open-mpi.org/faq/?category=openfabrics#mellanox-connectx-poor-latency
 )
and on the mailing list ( 
http://www.open-mpi.org/community/lists/users/2007/10/4238.php ) because the 
INI values are found.




Running the network test implemented in Ray on 32 MPI ranks, I get an average 
latency of 65 microseconds. 

See http://pastebin.com/nWDmGhvM


Thus, with 256 MPI ranks I get an average latency of 250 microseconds and with 
32 MPI ranks I get 65 microseconds.


Running the network test on 32 MPI ranks again but only allowing the MPI rank 0 
to send messages gives a latency of 10 microseconds for this rank.
See http://pastebin.com/dWMXsHpa



Because I get 10 microseconds in the network test in Ray when only the MPI rank 
sends messages, I would say that there may be some I/O contention.

To test this hypothesis, I re-ran the test, but allowed only 1 MPI rank per 
node to send messages (there are 8 MPI ranks per node and a total of 32 MPI 
ranks).
Ranks 0, 8, 16 and 24 all reported 13 microseconds. See 
http://pastebin.com/h84Fif3g

The next test was to allow 2 MPI ranks on each node to send messages. Ranks 0, 
1, 8, 9, 16, 17, 24, and 25 reported 15 microseconds. 
See http://pastebin.com/REdhJXkS

With 3 MPI ranks per node that can send messages, ranks 0, 1, 2, 8, 9, 10, 16, 
17, 18, 24, 25, 26 reported 20 microseconds. See http://pastebin.com/TCd6xpuC

Finally, with 4 MPI ranks per node that can send messages, I got 23 
microseconds. See http://pastebin.com/V8zjae7s


So the MPI ranks on a given node seem to fight for access to the HCA port.

Each colosse node has 1 port (ibv_devinfo) and the max_mtu is 2048 bytes. See 
http://pastebin.com/VXMAZdeZ






At this point, some may think that there may be a bug in the network test 
itself. So I tested the same code on another super-computer.

On guillimin, a super-computer located at McGill University, I get an average 
latency (with Ray -test-network-only) of 10 microseconds when running Ray on 
512 MPI ranks.

See http://pastebin.com/nCKF8Xg6

On guillimin, the hardware is Qlogic Infiniband QDR and the MPI middleware is 
MVAPICH2 1.6.

Thus, I know that the network test in Ray works as expected because results on 
guillimin show a latency of 10 microseconds for 512 MPI ranks.

guillimin also has 8 compute cores per node (Intel Nehalem).

On guillimin, each node has one port (ibv_devinfo) and the max_mtu of HCAs is 
4096 bytes. See http://pastebin.com/35T8N5t8








In Ray, only the following MPI functions are utilised:

- MPI_Init
- MPI_Comm_rank
- MPI_Comm_size
- MPI_Finalize

- MPI_Isend

- MPI_Request_free
- MPI_Test
- MPI_Get_count
- MPI_Start
- MPI_Recv_init
- MPI_Cancel

- MPI_Get_processor_name




7. Please include information about your network: 
http://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot

Type: Infiniband

  7.1. Which OpenFabrics version are you running?


ofed-scripts-1.4.2-0_sunhpc1

libibverbs-1.1.3-2.el5
libibverbs-utils-1.1.3-2.el5
libibverbs-devel-1.1.3-2.el5


  7.2. What distro and version of Linux are you running? What is your kernel 
version?


CentOS release 5.6 (Final)

Linux colosse1 2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:31:24 EDT 2011 x86_64 
x86_64 x86_64 GNU/Linux


  7.3. Which subnet manager are you running? (e.g., OpenSM, a vendor-specific 
subnet manager, etc.)

opensm-libs-3.3.3-1.el5_6.1

  7.4. What is the output of the ibv_devinfo command

    hca_id: mlx4_0
            fw_ver:                         2.7.000
            node_guid:                      5080:0200:008d:8f88
            sys_image_guid:                 5080:0200:008d:8f8b
            vendor_id:                      0x02c9
            vendor_part_id:                 26428
            hw_ver:                         0xA0
            board_id:                       X6275_QDR_IB_2.5
            phys_port_cnt:                  1
                    port:   1
                            state:                  active (4)
                            max_mtu:                2048 (4)
                            active_mtu:             2048 (4)
                            sm_lid:                 1222
                            port_lid:               659
                            port_lmc:               0x00



  7.5. What is the output of the ifconfig command

  Not using IPoIB.

  7.6. If running under Bourne shells, what is the output of the "ulimit -l" 
command? 

[sboisver12@colosse1 ~]$ ulimit -l
6000000







The two differences I see between guillimin and colosse are 

- Open-MPI 1.4.3 (colosse) v. MVAPICH2 1.6 (guillimin)
- Mellanox (colosse) v. QLogic (guillimin)


Does anyone experienced such a high latency with Open-MPI 1.4.3 on Mellanox 
HCAs ?






Thank you for your time.


                Sébastien Boisvert

[OMPI users] Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks

Reply via email to