Hello, Open-MPI 1.4.3 on Mellanox Infiniband hardware gives a latency of 250 microseconds with 256 MPI ranks on super-computer A (name is colosse).
The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic Infiniband hardware with 512 MPI ranks on super-computer B (name is guillimin). Here are the relevant information listed in http://www.open-mpi.org/community/help/ 1. Check the FAQ first. done ! 2. The version of Open MPI that you're using. Open-MPI 1.4.3 3. The config.log file from the top-level Open MPI directory, if available (please compress!). See below. Command file: http://pastebin.com/mW32ntSJ 4. The output of the "ompi_info --all" command from the node where you're invoking mpirun. ompi_info -a on colosse: http://pastebin.com/RPyY9s24 5. If running on more than one node -- especially if you're having problems launching Open MPI processes -- also include the output of the "ompi_info -v ompi full --parsable" command from each node on which you're trying to run. I am not having problems launching Open-MPI processes. 6. A detailed description of what is failing. Open-MPI 1.4.3 on Mellanox Infiniband hardware give a latency of 250 microseconds with 256 MPI ranks on super-computer A (name is colosse). The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic Infiniband hardware on 512 MPI ranks on super-computer B (name is guillimin). Details follow. I am developing a distributed genome assembler that runs with the message-passing interface (I am a PhD student). It is called Ray. Link: http://github.com/sebhtml/ray I recently added the option -test-network-only so that Ray can be used to test the latency. Each MPI rank has to send 100000 messages (4000 bytes each), one by one. The destination of any message is picked up at random. On colosse, a super-computer located at Laval University, I get an average latency of 250 microseconds with the test done in Ray. See http://pastebin.com/9nyjSy5z On colosse, the hardware is Mellanox Infiniband QDR ConnectX and the MPI middleware is Open-MPI 1.4.3 compiled with gcc 4.4.2. colosse has 8 compute cores per node (Intel Nehalem). Testing the latency with ibv_rc_pingpong on colosse gives 11 microseconds. local address: LID 0x048e, QPN 0x1c005c, PSN 0xf7c66b remote address: LID 0x018c, QPN 0x2c005c, PSN 0x5428e6 8192000 bytes in 0.01 seconds = 5776.64 Mbit/sec 1000 iters in 0.01 seconds = 11.35 usec/iter So I know that the Infiniband has a correct latency between two HCAs because of the output of ibv_rc_pingpong. Adding the parameter --mca btl_openib_verbose 1 to mpirun shows that Open-MPI detects the hardware correctly: [r107-n57][[59764,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 26428 [r107-n57][[59764,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: Mellanox Hermon see http://pastebin.com/pz03f0B3 So I don't think this is the problem described in the FAQ ( http://www.open-mpi.org/faq/?category=openfabrics#mellanox-connectx-poor-latency ) and on the mailing list ( http://www.open-mpi.org/community/lists/users/2007/10/4238.php ) because the INI values are found. Running the network test implemented in Ray on 32 MPI ranks, I get an average latency of 65 microseconds. See http://pastebin.com/nWDmGhvM Thus, with 256 MPI ranks I get an average latency of 250 microseconds and with 32 MPI ranks I get 65 microseconds. Running the network test on 32 MPI ranks again but only allowing the MPI rank 0 to send messages gives a latency of 10 microseconds for this rank. See http://pastebin.com/dWMXsHpa Because I get 10 microseconds in the network test in Ray when only the MPI rank sends messages, I would say that there may be some I/O contention. To test this hypothesis, I re-ran the test, but allowed only 1 MPI rank per node to send messages (there are 8 MPI ranks per node and a total of 32 MPI ranks). Ranks 0, 8, 16 and 24 all reported 13 microseconds. See http://pastebin.com/h84Fif3g The next test was to allow 2 MPI ranks on each node to send messages. Ranks 0, 1, 8, 9, 16, 17, 24, and 25 reported 15 microseconds. See http://pastebin.com/REdhJXkS With 3 MPI ranks per node that can send messages, ranks 0, 1, 2, 8, 9, 10, 16, 17, 18, 24, 25, 26 reported 20 microseconds. See http://pastebin.com/TCd6xpuC Finally, with 4 MPI ranks per node that can send messages, I got 23 microseconds. See http://pastebin.com/V8zjae7s So the MPI ranks on a given node seem to fight for access to the HCA port. Each colosse node has 1 port (ibv_devinfo) and the max_mtu is 2048 bytes. See http://pastebin.com/VXMAZdeZ At this point, some may think that there may be a bug in the network test itself. So I tested the same code on another super-computer. On guillimin, a super-computer located at McGill University, I get an average latency (with Ray -test-network-only) of 10 microseconds when running Ray on 512 MPI ranks. See http://pastebin.com/nCKF8Xg6 On guillimin, the hardware is Qlogic Infiniband QDR and the MPI middleware is MVAPICH2 1.6. Thus, I know that the network test in Ray works as expected because results on guillimin show a latency of 10 microseconds for 512 MPI ranks. guillimin also has 8 compute cores per node (Intel Nehalem). On guillimin, each node has one port (ibv_devinfo) and the max_mtu of HCAs is 4096 bytes. See http://pastebin.com/35T8N5t8 In Ray, only the following MPI functions are utilised: - MPI_Init - MPI_Comm_rank - MPI_Comm_size - MPI_Finalize - MPI_Isend - MPI_Request_free - MPI_Test - MPI_Get_count - MPI_Start - MPI_Recv_init - MPI_Cancel - MPI_Get_processor_name 7. Please include information about your network: http://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot Type: Infiniband 7.1. Which OpenFabrics version are you running? ofed-scripts-1.4.2-0_sunhpc1 libibverbs-1.1.3-2.el5 libibverbs-utils-1.1.3-2.el5 libibverbs-devel-1.1.3-2.el5 7.2. What distro and version of Linux are you running? What is your kernel version? CentOS release 5.6 (Final) Linux colosse1 2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:31:24 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux 7.3. Which subnet manager are you running? (e.g., OpenSM, a vendor-specific subnet manager, etc.) opensm-libs-3.3.3-1.el5_6.1 7.4. What is the output of the ibv_devinfo command hca_id: mlx4_0 fw_ver: 2.7.000 node_guid: 5080:0200:008d:8f88 sys_image_guid: 5080:0200:008d:8f8b vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xA0 board_id: X6275_QDR_IB_2.5 phys_port_cnt: 1 port: 1 state: active (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1222 port_lid: 659 port_lmc: 0x00 7.5. What is the output of the ifconfig command Not using IPoIB. 7.6. If running under Bourne shells, what is the output of the "ulimit -l" command? [sboisver12@colosse1 ~]$ ulimit -l 6000000 The two differences I see between guillimin and colosse are - Open-MPI 1.4.3 (colosse) v. MVAPICH2 1.6 (guillimin) - Mellanox (colosse) v. QLogic (guillimin) Does anyone experienced such a high latency with Open-MPI 1.4.3 on Mellanox HCAs ? Thank you for your time. Sébastien Boisvert