Hi, all. We've got a couple of clusters running RHEL 6.2, and have several centrally-installed versions/compilations of OpenMPI. Some of the nodes have 4xQDR Infiniband, and all the nodes have 1 gigabit ethernet. I was gathering some bandwidth and latency numbers using the OSU/OMB tests, and noticed some weird behavior.
When I run a simple "mpirun ./osu_bw" on a couple of IB-enabled node, I get numbers consistent with our IB speed (up to about 3800 MB/s), and when I run the same thing on two nodes with only Ethernet, I get speeds consistent with that (up to about 120 MB/s). So far, so good. The trouble is when I try to add some "--mca" parameters to force it to use TCP/Ethernet, the program seems to hang. I get the headers of the "osu_bw" output, but no results, even on the first case (1 byte payload per packet). This is occurring on both the IB-enabled nodes, and on the Ethernet-only nodes. The specific syntax I was using was: "mpirun --mca btl ^openib --mca btl_tcp_if_exclude ib0 ./osu_bw" The problem occurs at least with OpenMPI 1.6.3 compiled with GNU 4.4 compilers, with 1.6.3 compiled with Intel 13.0.1 compilers, and with 1.6.5 compiled with Intel 13.0.1 compilers. I haven't tested any other combinations yet. Any ideas here? It's very possible this is a system configuration problem, but I don't know where to look. At this point, any ideas would be welcome, either about the specific situation, or general pointers on mpirun debugging flags to use. I can't find much in the docs yet on run-time debugging for OpenMPI, as opposed to debugging the application. Maybe I'm just looking in the wrong place. Thanks, -- Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu