I have things working now. I needed to limit OpenMPI to actual working
interfaces (thanks for the tip). It still seems that should be figured
out correctly... Now I've moved onto stress testing with the bandwidth
testing app I posted earlier in the Infiniband thread:
mpirun -mca btl_tcp_if_include eth0 -mca btl tcp -np 2 -hostfile
/u/mhouston/mpihosts mpi_bandwidth 3750 262144
262144 109.697279 (MillionBytes/sec) 104.615478(MegaBytes/sec)
mpirun -mca btl_tcp_if_include eth0 -mca btl tcp -np 2 -hostfile
/u/mhouston/mpihosts mpi_bandwidth 4000 262144
[spire-2.Stanford.EDU:06645] mca_btl_tcp_frag_send: writev failed with
errno=104mpirun noticed that job rank 1 with PID 21281 on node
"spire-3.stanford.edu" exited on signal 11.
Cranking up the number of messages in flight makes things really
unhappy. I haven't seen this behavior with LAM or MPICH so I thought
I'd mention it.
Thanks!
-Mike