On Feb 14, 2007, at 12:33 PM, Alex Tumanov wrote:
Hello,
I recently tried running HPLinpack, compiled with OMPI, over myrinet
MX interconnect. Running a simple hello world program works, but XHPL
fails with an error occurring when it tries to MPI_Send:
# mpirun -np 4 -H l0-0,c0-2 --prefix $MPIHOME --mca btl mx,self
If you are running more than one process per node, you may need to
add shmem to mx,self. Also, OMPI offers another MX via pml.
Performance was better using pml but George may be getting the btl
closer.
Also, try with and without MX_RCACHE=1 (or MX_RCACHE=2 for the pml)
in your environment.
/opt/hpl/openmpi-hpl/bin/xhpl
[l0-0.local:04707] *** An error occurred in MPI_Send
[l0-0.local:04707] *** on communicator MPI_COMM_WORLD
[l0-0.local:04707] *** MPI_ERR_INTERN: internal error
[l0-0.local:04707] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 4706 on node "l0-0" exited
on signal 15.
3 additional processes aborted (not shown)
# mpirun -np 4 -H l0-0,c0-2 --prefix $MPIHOME --mca btl mx,self ~/
atumanov/hello
Hello from Alex' MPI test program
Process 1 on compute-0-2.local out of 4
Hello from Alex' MPI test program
Hello from Alex' MPI test program
Process 0 on l0-0.local out of 4
Process 3 on compute-0-2.local out of 4
Hello from Alex' MPI test program
Process 2 on l0-0.local out of 4
The output from mx_info is as follows:
----------------------------------------------------------------------
---------------------------
MX Version: 1.2.0g
We have a new version, 1.2.0h, that we recommend all users to upgrade
to.
MX Build: r...@blackopt.sw.myri.com:/home/install/rocks/src/roll/
myrinet_mx10g/BUILD/mx-1.2.0g
Wed Jan 17 18:51:12 PST 2007
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024
nodes.
===================================================================
Instance #0: 299.8 MHz LANai, PCI-E x8, 2 MB SRAM
Status: Running, P0: Link up
MAC Address: 00:60:dd:47:7d:73
Product code: 10G-PCIE-8A-C
Part number: 09-03362
Serial number: 314581
Mapper: 00:60:dd:47:7d:73, version = 0x591b1c74,
configured
Mapped hosts: 2
ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:47:7d:73 compute-0-2.local:0 D 0,0
1) 00:60:dd:47:7d:72 l0-0.local:0 1,0
----------------------------------------------------------------------
---------------------------
There are several questions. First of all, am I able to initiate OMPI
over MX jobs from the headnode to be executed on 2 compute nodes even
though the headnode does not have MX hardware?
Any OMPI people have comments?
Secondly, looking at
next to last line in the mx_info output, what does letter 'D' stand
for?
This means that while a route to this node was loaded at some point
in the past, the most recent batch of route loads were from a map
that did not contain this node. This could be caused by the node
going down, losing connectivity, or just having its fma crash or be
killed. Note that in the last case, the node is still on the fabric,
the old routes likely still work, but it just has no fma running.
Third, the MX interconnect support OMPI provides - does it mean
MX-2G or there's support for MX-10G as well?
Both. If you build OMPI with shared library support, you can change
between MX-10G and MX-2G via LD_LIBRARY_PATH.
Scott
If anybody has encountered a similar problem and was able to
circumvent it please do let me know.
Many thanks for your time and for bringing the community together.
Sincerely,
Alex.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users