Hi, all.

We've got a couple of clusters running RHEL 6.2, and have several
centrally-installed versions/compilations of OpenMPI.  Some of the nodes
have 4xQDR Infiniband, and all the nodes have 1 gigabit ethernet.  I was
gathering some bandwidth and latency numbers using the OSU/OMB tests,
and noticed some weird behavior.

When I run a simple "mpirun ./osu_bw" on a couple of IB-enabled node, I
get numbers consistent with our IB speed (up to about 3800 MB/s), and
when I run the same thing on two nodes with only Ethernet, I get speeds
consistent with that (up to about 120 MB/s).  So far, so good.

The trouble is when I try to add some "--mca" parameters to force it to
use TCP/Ethernet, the program seems to hang.  I get the headers of the
"osu_bw" output, but no results, even on the first case (1 byte payload
per packet).  This is occurring on both the IB-enabled nodes, and on the
Ethernet-only nodes.  The specific syntax I was using was:  "mpirun
--mca btl ^openib --mca btl_tcp_if_exclude ib0 ./osu_bw"

The problem occurs at least with OpenMPI 1.6.3 compiled with GNU 4.4
compilers, with 1.6.3 compiled with Intel 13.0.1 compilers, and with
1.6.5 compiled with Intel 13.0.1 compilers.  I haven't tested any other
combinations yet.

Any ideas here?  It's very possible this is a system configuration
problem, but I don't know where to look.  At this point, any ideas would
be welcome, either about the specific situation, or general pointers on
mpirun debugging flags to use.  I can't find much in the docs yet on
run-time debugging for OpenMPI, as opposed to debugging the application.
 Maybe I'm just looking in the wrong place.


Thanks,

-- 
Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

Reply via email to