Hi Yevgeny,
You are right on comparing apples with apples. But MVAPICH2 is not installed on colosse, which is in the CLUMEQ consortium, a part of Compute Canada. Meanwhile, I contacted some people at SciNet, which is also part of Compute Canada. They told me to try Open-MPI 1.4.3 with the Intel compiler with --mca btl self,ofud to use the ofud BTL instead of openib for OpenFabrics transport. This worked quite good -- I got a low latency of 35 microseconds. Yay ! See http://pastebin.com/VpAd1NrK for the Grid Engine submission script and for the Ray latency output. With Open-MPI 1.4.3, gcc 4.4.2 and --mca btl self,ofud, the job hangs somewhere before Ray I presume because there is nothing in the standard output and there is nothing in the standard error. One thing I noticed is that the load of a given node is 7, not 8, which is strange because there are, in theory, 8 instances of Ray on each node. See http://pastebin.com/gVMjQ9Ra According to the Open-MPI mailing list, ofud "was never really finished". See http://www.open-mpi.org/community/lists/users/2010/12/14977.php Could that unfinished status explain why it works with the Intel compiler but not with the GNU compiler ? libibverbs is utilised on the colosse if that matters. Sébastien http://github.com/sebhtml/ray > ________________________________________ > De : Yevgeny Kliteynik [klit...@dev.mellanox.co.il] > Date d'envoi : 20 septembre 2011 08:14 > À : Open MPI Users > Cc : Sébastien Boisvert > Objet : Re: [OMPI users] Latency of 250 microseconds with Open-MPI 1.4.3, > Mellanox Infiniband and 256 MPI ranks > > Hi Sébastien, > > If I understand you correctly, you are running your application on two > different MPIs on two different clusters with two different IB vendors. > > Could you make a comparison more "apples to apples"-ish? > For instance: > - run the same version of Open MPI on both clusters > - run the same version of MVAPICH on both clusters > > > -- YK > > On 18-Sep-11 1:59 AM, Sébastien Boisvert wrote: >> Hello, >> >> Open-MPI 1.4.3 on Mellanox Infiniband hardware gives a latency of 250 >> microseconds with 256 MPI ranks on super-computer A (name is colosse). >> >> The same software gives a latency of 10 microseconds with MVAPICH2 and >> QLogic Infiniband hardware with 512 MPI ranks on super-computer B (name is >> guillimin). >> >> >> Here are the relevant information listed in >> http://www.open-mpi.org/community/help/ >> >> >> 1. Check the FAQ first. >> >> done ! >> >> >> 2. The version of Open MPI that you're using. >> >> Open-MPI 1.4.3 >> >> >> 3. The config.log file from the top-level Open MPI directory, if available >> (please compress!). >> >> See below. >> >> Command file: http://pastebin.com/mW32ntSJ >> >> >> 4. The output of the "ompi_info --all" command from the node where you're >> invoking mpirun. >> >> ompi_info -a on colosse: http://pastebin.com/RPyY9s24 >> >> >> 5. If running on more than one node -- especially if you're having problems >> launching Open MPI processes -- also include the output of the "ompi_info -v >> ompi full --parsable" command from each node on which you're trying to run. >> >> I am not having problems launching Open-MPI processes. >> >> >> 6. A detailed description of what is failing. >> >> Open-MPI 1.4.3 on Mellanox Infiniband hardware give a latency of 250 >> microseconds with 256 MPI ranks on super-computer A (name is colosse). >> >> The same software gives a latency of 10 microseconds with MVAPICH2 and >> QLogic Infiniband hardware on 512 MPI ranks on super-computer B (name is >> guillimin). >> >> Details follow. >> >> >> I am developing a distributed genome assembler that runs with the >> message-passing interface (I am a PhD student). >> It is called Ray. Link: http://github.com/sebhtml/ray >> >> I recently added the option -test-network-only so that Ray can be used to >> test the latency. Each MPI rank has to send 100000 messages (4000 bytes >> each), one by one. >> The destination of any message is picked up at random. >> >> >> On colosse, a super-computer located at Laval University, I get an average >> latency of 250 microseconds with the test done in Ray. >> >> See http://pastebin.com/9nyjSy5z >> >> On colosse, the hardware is Mellanox Infiniband QDR ConnectX and the MPI >> middleware is Open-MPI 1.4.3 compiled with gcc 4.4.2. >> >> colosse has 8 compute cores per node (Intel Nehalem). >> >> >> Testing the latency with ibv_rc_pingpong on colosse gives 11 microseconds. >> >> local address: LID 0x048e, QPN 0x1c005c, PSN 0xf7c66b >> remote address: LID 0x018c, QPN 0x2c005c, PSN 0x5428e6 >> 8192000 bytes in 0.01 seconds = 5776.64 Mbit/sec >> 1000 iters in 0.01 seconds = 11.35 usec/iter >> >> So I know that the Infiniband has a correct latency between two HCAs because >> of the output of ibv_rc_pingpong. >> >> >> >> Adding the parameter --mca btl_openib_verbose 1 to mpirun shows that >> Open-MPI detects the hardware correctly: >> >> [r107-n57][[59764,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] >> Querying INI files for vendor 0x02c9, part ID 26428 >> [r107-n57][[59764,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] >> Found corresponding INI values: Mellanox Hermon >> >> see http://pastebin.com/pz03f0B3 >> >> >> So I don't think this is the problem described in the FAQ ( >> http://www.open-mpi.org/faq/?category=openfabrics#mellanox-connectx-poor-latency >> ) >> and on the mailing list ( >> http://www.open-mpi.org/community/lists/users/2007/10/4238.php ) because the >> INI values are found. >> >> >> >> >> Running the network test implemented in Ray on 32 MPI ranks, I get an >> average latency of 65 microseconds. >> >> See http://pastebin.com/nWDmGhvM >> >> >> Thus, with 256 MPI ranks I get an average latency of 250 microseconds and >> with 32 MPI ranks I get 65 microseconds. >> >> >> Running the network test on 32 MPI ranks again but only allowing the MPI >> rank 0 to send messages gives a latency of 10 microseconds for this rank. >> See http://pastebin.com/dWMXsHpa >> >> >> >> Because I get 10 microseconds in the network test in Ray when only the MPI >> rank sends messages, I would say that there may be some I/O contention. >> >> To test this hypothesis, I re-ran the test, but allowed only 1 MPI rank per >> node to send messages (there are 8 MPI ranks per node and a total of 32 MPI >> ranks). >> Ranks 0, 8, 16 and 24 all reported 13 microseconds. See >> http://pastebin.com/h84Fif3g >> >> The next test was to allow 2 MPI ranks on each node to send messages. Ranks >> 0, 1, 8, 9, 16, 17, 24, and 25 reported 15 microseconds. >> See http://pastebin.com/REdhJXkS >> >> With 3 MPI ranks per node that can send messages, ranks 0, 1, 2, 8, 9, 10, >> 16, 17, 18, 24, 25, 26 reported 20 microseconds. See >> http://pastebin.com/TCd6xpuC >> >> Finally, with 4 MPI ranks per node that can send messages, I got 23 >> microseconds. See http://pastebin.com/V8zjae7s >> >> >> So the MPI ranks on a given node seem to fight for access to the HCA port. >> >> Each colosse node has 1 port (ibv_devinfo) and the max_mtu is 2048 bytes. >> See http://pastebin.com/VXMAZdeZ >> >> >> >> >> >> >> At this point, some may think that there may be a bug in the network test >> itself. So I tested the same code on another super-computer. >> >> On guillimin, a super-computer located at McGill University, I get an >> average latency (with Ray -test-network-only) of 10 microseconds when >> running Ray on 512 MPI ranks. >> >> See http://pastebin.com/nCKF8Xg6 >> >> On guillimin, the hardware is Qlogic Infiniband QDR and the MPI middleware >> is MVAPICH2 1.6. >> >> Thus, I know that the network test in Ray works as expected because results >> on guillimin show a latency of 10 microseconds for 512 MPI ranks. >> >> guillimin also has 8 compute cores per node (Intel Nehalem). >> >> On guillimin, each node has one port (ibv_devinfo) and the max_mtu of HCAs >> is 4096 bytes. See http://pastebin.com/35T8N5t8 >> >> >> >> >> >> >> >> >> In Ray, only the following MPI functions are utilised: >> >> - MPI_Init >> - MPI_Comm_rank >> - MPI_Comm_size >> - MPI_Finalize >> >> - MPI_Isend >> >> - MPI_Request_free >> - MPI_Test >> - MPI_Get_count >> - MPI_Start >> - MPI_Recv_init >> - MPI_Cancel >> >> - MPI_Get_processor_name >> >> >> >> >> 7. Please include information about your network: >> http://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot >> >> Type: Infiniband >> >> 7.1. Which OpenFabrics version are you running? >> >> >> ofed-scripts-1.4.2-0_sunhpc1 >> >> libibverbs-1.1.3-2.el5 >> libibverbs-utils-1.1.3-2.el5 >> libibverbs-devel-1.1.3-2.el5 >> >> >> 7.2. What distro and version of Linux are you running? What is your >> kernel version? >> >> >> CentOS release 5.6 (Final) >> >> Linux colosse1 2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:31:24 EDT 2011 >> x86_64 x86_64 x86_64 GNU/Linux >> >> >> 7.3. Which subnet manager are you running? (e.g., OpenSM, a >> vendor-specific subnet manager, etc.) >> >> opensm-libs-3.3.3-1.el5_6.1 >> >> 7.4. What is the output of the ibv_devinfo command >> >> hca_id: mlx4_0 >> fw_ver: 2.7.000 >> node_guid: 5080:0200:008d:8f88 >> sys_image_guid: 5080:0200:008d:8f8b >> vendor_id: 0x02c9 >> vendor_part_id: 26428 >> hw_ver: 0xA0 >> board_id: X6275_QDR_IB_2.5 >> phys_port_cnt: 1 >> port: 1 >> state: active (4) >> max_mtu: 2048 (4) >> active_mtu: 2048 (4) >> sm_lid: 1222 >> port_lid: 659 >> port_lmc: 0x00 >> >> >> >> 7.5. What is the output of the ifconfig command >> >> Not using IPoIB. >> >> 7.6. If running under Bourne shells, what is the output of the "ulimit >> -l" command? >> >> [sboisver12@colosse1 ~]$ ulimit -l >> 6000000 >> >> >> >> >> >> >> >> The two differences I see between guillimin and colosse are >> >> - Open-MPI 1.4.3 (colosse) v. MVAPICH2 1.6 (guillimin) >> - Mellanox (colosse) v. QLogic (guillimin) >> >> >> Does anyone experienced such a high latency with Open-MPI 1.4.3 on Mellanox >> HCAs ? >> >> >> >> >> >> >> Thank you for your time. >> >> >> Sébastien Boisvert >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > Sébastien