Hello, I recently tried running HPLinpack, compiled with OMPI, over myrinet MX interconnect. Running a simple hello world program works, but XHPL fails with an error occurring when it tries to MPI_Send:
# mpirun -np 4 -H l0-0,c0-2 --prefix $MPIHOME --mca btl mx,self /opt/hpl/openmpi-hpl/bin/xhpl [l0-0.local:04707] *** An error occurred in MPI_Send [l0-0.local:04707] *** on communicator MPI_COMM_WORLD [l0-0.local:04707] *** MPI_ERR_INTERN: internal error [l0-0.local:04707] *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that job rank 0 with PID 4706 on node "l0-0" exited on signal 15. 3 additional processes aborted (not shown) # mpirun -np 4 -H l0-0,c0-2 --prefix $MPIHOME --mca btl mx,self ~/atumanov/hello Hello from Alex' MPI test program Process 1 on compute-0-2.local out of 4 Hello from Alex' MPI test program Hello from Alex' MPI test program Process 0 on l0-0.local out of 4 Process 3 on compute-0-2.local out of 4 Hello from Alex' MPI test program Process 2 on l0-0.local out of 4 The output from mx_info is as follows: ------------------------------------------------------------------------------------------------- MX Version: 1.2.0g MX Build: r...@blackopt.sw.myri.com:/home/install/rocks/src/roll/myrinet_mx10g/BUILD/mx-1.2.0g Wed Jan 17 18:51:12 PST 2007 1 Myrinet board installed. The MX driver is configured to support up to 4 instances and 1024 nodes. =================================================================== Instance #0: 299.8 MHz LANai, PCI-E x8, 2 MB SRAM Status: Running, P0: Link up MAC Address: 00:60:dd:47:7d:73 Product code: 10G-PCIE-8A-C Part number: 09-03362 Serial number: 314581 Mapper: 00:60:dd:47:7d:73, version = 0x591b1c74, configured Mapped hosts: 2 ROUTE COUNT INDEX MAC ADDRESS HOST NAME P0 ----- ----------- --------- --- 0) 00:60:dd:47:7d:73 compute-0-2.local:0 D 0,0 1) 00:60:dd:47:7d:72 l0-0.local:0 1,0 ------------------------------------------------------------------------------------------------- There are several questions. First of all, am I able to initiate OMPI over MX jobs from the headnode to be executed on 2 compute nodes even though the headnode does not have MX hardware? Secondly, looking at next to last line in the mx_info output, what does letter 'D' stand for? Third, the MX interconnect support OMPI provides - does it mean MX-2G or there's support for MX-10G as well? If anybody has encountered a similar problem and was able to circumvent it please do let me know. Many thanks for your time and for bringing the community together. Sincerely, Alex.