I have things working now. I needed to limit OpenMPI to actual working interfaces (thanks for the tip). It still seems that should be figured out correctly... Now I've moved onto stress testing with the bandwidth testing app I posted earlier in the Infiniband thread:

mpirun -mca btl_tcp_if_include eth0 -mca btl tcp -np 2 -hostfile /u/mhouston/mpihosts mpi_bandwidth 3750 262144

262144  109.697279 (MillionBytes/sec)   104.615478(MegaBytes/sec)

mpirun -mca btl_tcp_if_include eth0 -mca btl tcp -np 2 -hostfile /u/mhouston/mpihosts mpi_bandwidth 4000 262144 [spire-2.Stanford.EDU:06645] mca_btl_tcp_frag_send: writev failed with errno=104mpirun noticed that job rank 1 with PID 21281 on node "spire-3.stanford.edu" exited on signal 11.

Cranking up the number of messages in flight makes things really unhappy. I haven't seen this behavior with LAM or MPICH so I thought I'd mention it.

Thanks!

-Mike

Reply via email to