[OMPI users] RE : Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks

Sébastien Boisvert Wed, 21 Sep 2011 15:18:02 -0400

Hi Yevgeny,


You are right on comparing apples with apples.

But MVAPICH2 is not installed on colosse, which is in the CLUMEQ consortium, a 
part of Compute Canada.


Meanwhile, I contacted some people at SciNet, which is also part of Compute 
Canada. 


They told me to try Open-MPI 1.4.3 with the Intel compiler with --mca btl 
self,ofud to use the ofud BTL instead of openib for OpenFabrics transport.


This worked quite good -- I got a low latency of 35 microseconds. Yay !


See http://pastebin.com/VpAd1NrK for the Grid Engine submission script and for 
the Ray latency output.





With  Open-MPI 1.4.3, gcc 4.4.2 and --mca btl self,ofud, the job hangs 
somewhere before Ray I presume because there is nothing in
the standard output and there is nothing in the standard error.

One thing I noticed is that the load of a given node is 7, not 8, which is 
strange because there are, in theory, 8 instances of Ray on each node.

See http://pastebin.com/gVMjQ9Ra





According to the Open-MPI mailing list, ofud "was never really finished".

See http://www.open-mpi.org/community/lists/users/2010/12/14977.php


Could that unfinished status explain why it works with the Intel compiler but 
not with the GNU compiler ?


libibverbs is utilised on the colosse if that matters.



                              Sébastien

             http://github.com/sebhtml/ray

> ________________________________________
> De : Yevgeny Kliteynik [klit...@dev.mellanox.co.il]
> Date d'envoi : 20 septembre 2011 08:14
> À : Open MPI Users
> Cc : Sébastien Boisvert
> Objet : Re: [OMPI users] Latency of 250 microseconds with Open-MPI 1.4.3, 
> Mellanox Infiniband and 256 MPI ranks
> 
> Hi Sébastien,
> 
> If I understand you correctly, you are running your application on two
> different MPIs on two different clusters with two different IB vendors.
> 
> Could you make a comparison more "apples to apples"-ish?
> For instance:
> - run the same version of Open MPI on both clusters
> - run the same version of MVAPICH on both clusters
> 
> 
> -- YK
> 
> On 18-Sep-11 1:59 AM, Sébastien Boisvert wrote:
>> Hello,
>>
>> Open-MPI 1.4.3 on Mellanox Infiniband hardware gives a latency of 250 
>> microseconds with 256 MPI ranks on super-computer A (name is colosse).
>>
>> The same software gives a latency of 10 microseconds with MVAPICH2 and 
>> QLogic Infiniband hardware with 512 MPI ranks on super-computer B (name is 
>> guillimin).
>>
>>
>> Here are the relevant information listed in 
>> http://www.open-mpi.org/community/help/
>>
>>
>> 1. Check the FAQ first.
>>
>> done !
>>
>>
>> 2. The version of Open MPI that you're using.
>>
>> Open-MPI 1.4.3
>>
>>
>> 3. The config.log file from the top-level Open MPI directory, if available 
>> (please compress!).
>>
>> See below.
>>
>> Command file: http://pastebin.com/mW32ntSJ
>>
>>
>> 4. The output of the "ompi_info --all" command from the node where you're 
>> invoking mpirun.
>>
>> ompi_info -a on colosse: http://pastebin.com/RPyY9s24
>>
>>
>> 5. If running on more than one node -- especially if you're having problems 
>> launching Open MPI processes -- also include the output of the "ompi_info -v 
>> ompi full --parsable" command from each node on which you're trying to run.
>>
>> I am not having problems launching Open-MPI processes.
>>
>>
>> 6. A detailed description of what is failing.
>>
>> Open-MPI 1.4.3 on Mellanox Infiniband hardware give a latency of 250 
>> microseconds with 256 MPI ranks on super-computer A (name is colosse).
>>
>> The same software gives a latency of 10 microseconds with MVAPICH2 and 
>> QLogic Infiniband hardware on 512 MPI ranks on super-computer B (name is 
>> guillimin).
>>
>> Details follow.
>>
>>
>> I am developing a distributed genome assembler that runs with the 
>> message-passing interface (I am a PhD student).
>> It is called Ray. Link: http://github.com/sebhtml/ray
>>
>> I recently added the option -test-network-only so that Ray can be used to 
>> test the latency. Each MPI rank has to send 100000 messages (4000 bytes 
>> each), one by one.
>> The destination of any message is picked up at random.
>>
>>
>> On colosse, a super-computer located at Laval University, I get an average 
>> latency of 250 microseconds with the test done in Ray.
>>
>> See http://pastebin.com/9nyjSy5z
>>
>> On colosse, the hardware is Mellanox Infiniband QDR ConnectX and the MPI 
>> middleware is Open-MPI 1.4.3 compiled with gcc 4.4.2.
>>
>> colosse has 8 compute cores per node (Intel Nehalem).
>>
>>
>> Testing the latency with ibv_rc_pingpong on colosse gives 11 microseconds.
>>
>>    local address:  LID 0x048e, QPN 0x1c005c, PSN 0xf7c66b
>>    remote address: LID 0x018c, QPN 0x2c005c, PSN 0x5428e6
>> 8192000 bytes in 0.01 seconds = 5776.64 Mbit/sec
>> 1000 iters in 0.01 seconds = 11.35 usec/iter
>>
>> So I know that the Infiniband has a correct latency between two HCAs because 
>> of the output of ibv_rc_pingpong.
>>
>>
>>
>> Adding the parameter --mca btl_openib_verbose 1 to mpirun shows that 
>> Open-MPI detects the hardware correctly:
>>
>> [r107-n57][[59764,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] 
>> Querying INI files for vendor 0x02c9, part ID 26428
>> [r107-n57][[59764,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] 
>> Found corresponding INI values: Mellanox Hermon
>>
>> see http://pastebin.com/pz03f0B3
>>
>>
>> So I don't think this is the problem described in the FAQ ( 
>> http://www.open-mpi.org/faq/?category=openfabrics#mellanox-connectx-poor-latency
>>  )
>> and on the mailing list ( 
>> http://www.open-mpi.org/community/lists/users/2007/10/4238.php ) because the 
>> INI values are found.
>>
>>
>>
>>
>> Running the network test implemented in Ray on 32 MPI ranks, I get an 
>> average latency of 65 microseconds.
>>
>> See http://pastebin.com/nWDmGhvM
>>
>>
>> Thus, with 256 MPI ranks I get an average latency of 250 microseconds and 
>> with 32 MPI ranks I get 65 microseconds.
>>
>>
>> Running the network test on 32 MPI ranks again but only allowing the MPI 
>> rank 0 to send messages gives a latency of 10 microseconds for this rank.
>> See http://pastebin.com/dWMXsHpa
>>
>>
>>
>> Because I get 10 microseconds in the network test in Ray when only the MPI 
>> rank sends messages, I would say that there may be some I/O contention.
>>
>> To test this hypothesis, I re-ran the test, but allowed only 1 MPI rank per 
>> node to send messages (there are 8 MPI ranks per node and a total of 32 MPI 
>> ranks).
>> Ranks 0, 8, 16 and 24 all reported 13 microseconds. See 
>> http://pastebin.com/h84Fif3g
>>
>> The next test was to allow 2 MPI ranks on each node to send messages. Ranks 
>> 0, 1, 8, 9, 16, 17, 24, and 25 reported 15 microseconds.
>> See http://pastebin.com/REdhJXkS
>>
>> With 3 MPI ranks per node that can send messages, ranks 0, 1, 2, 8, 9, 10, 
>> 16, 17, 18, 24, 25, 26 reported 20 microseconds. See 
>> http://pastebin.com/TCd6xpuC
>>
>> Finally, with 4 MPI ranks per node that can send messages, I got 23 
>> microseconds. See http://pastebin.com/V8zjae7s
>>
>>
>> So the MPI ranks on a given node seem to fight for access to the HCA port.
>>
>> Each colosse node has 1 port (ibv_devinfo) and the max_mtu is 2048 bytes. 
>> See http://pastebin.com/VXMAZdeZ
>>
>>
>>
>>
>>
>>
>> At this point, some may think that there may be a bug in the network test 
>> itself. So I tested the same code on another super-computer.
>>
>> On guillimin, a super-computer located at McGill University, I get an 
>> average latency (with Ray -test-network-only) of 10 microseconds when 
>> running Ray on 512 MPI ranks.
>>
>> See http://pastebin.com/nCKF8Xg6
>>
>> On guillimin, the hardware is Qlogic Infiniband QDR and the MPI middleware 
>> is MVAPICH2 1.6.
>>
>> Thus, I know that the network test in Ray works as expected because results 
>> on guillimin show a latency of 10 microseconds for 512 MPI ranks.
>>
>> guillimin also has 8 compute cores per node (Intel Nehalem).
>>
>> On guillimin, each node has one port (ibv_devinfo) and the max_mtu of HCAs 
>> is 4096 bytes. See http://pastebin.com/35T8N5t8
>>
>>
>>
>>
>>
>>
>>
>>
>> In Ray, only the following MPI functions are utilised:
>>
>> - MPI_Init
>> - MPI_Comm_rank
>> - MPI_Comm_size
>> - MPI_Finalize
>>
>> - MPI_Isend
>>
>> - MPI_Request_free
>> - MPI_Test
>> - MPI_Get_count
>> - MPI_Start
>> - MPI_Recv_init
>> - MPI_Cancel
>>
>> - MPI_Get_processor_name
>>
>>
>>
>>
>> 7. Please include information about your network:
>> http://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot
>>
>> Type: Infiniband
>>
>>    7.1. Which OpenFabrics version are you running?
>>
>>
>> ofed-scripts-1.4.2-0_sunhpc1
>>
>> libibverbs-1.1.3-2.el5
>> libibverbs-utils-1.1.3-2.el5
>> libibverbs-devel-1.1.3-2.el5
>>
>>
>>    7.2. What distro and version of Linux are you running? What is your 
>> kernel version?
>>
>>
>> CentOS release 5.6 (Final)
>>
>> Linux colosse1 2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:31:24 EDT 2011 
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>>
>>    7.3. Which subnet manager are you running? (e.g., OpenSM, a 
>> vendor-specific subnet manager, etc.)
>>
>> opensm-libs-3.3.3-1.el5_6.1
>>
>>    7.4. What is the output of the ibv_devinfo command
>>
>>      hca_id: mlx4_0
>>              fw_ver:                         2.7.000
>>              node_guid:                      5080:0200:008d:8f88
>>              sys_image_guid:                 5080:0200:008d:8f8b
>>              vendor_id:                      0x02c9
>>              vendor_part_id:                 26428
>>              hw_ver:                         0xA0
>>              board_id:                       X6275_QDR_IB_2.5
>>              phys_port_cnt:                  1
>>                      port:   1
>>                              state:                  active (4)
>>                              max_mtu:                2048 (4)
>>                              active_mtu:             2048 (4)
>>                              sm_lid:                 1222
>>                              port_lid:               659
>>                              port_lmc:               0x00
>>
>>
>>
>>    7.5. What is the output of the ifconfig command
>>
>>    Not using IPoIB.
>>
>>    7.6. If running under Bourne shells, what is the output of the "ulimit 
>> -l" command?
>>
>> [sboisver12@colosse1 ~]$ ulimit -l
>> 6000000
>>
>>
>>
>>
>>
>>
>>
>> The two differences I see between guillimin and colosse are
>>
>> - Open-MPI 1.4.3 (colosse) v. MVAPICH2 1.6 (guillimin)
>> - Mellanox (colosse) v. QLogic (guillimin)
>>
>>
>> Does anyone experienced such a high latency with Open-MPI 1.4.3 on Mellanox 
>> HCAs ?
>>
>>
>>
>>
>>
>>
>> Thank you for your time.
>>
>>
>>                  Sébastien Boisvert
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> 
> 



                                                     Sébastien

[OMPI users] RE : Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks

Reply via email to