Somehow I missed this e-mail, sorry... Can you send all the information listed on this web page:

    http://www.open-mpi.org/community/help/


On Jun 26, 2007, at 10:34 AM, Yuan Wan wrote:


Hi all,

I'm benchmarking our new cluster with HPL. I pick OpenMPI as parallel
environment as I found OpenMPi is able to benefit from two giga- ethernet
tcp
networks on our cluster during low-level benchmark.
(bandwidth could be upto 250MB/s)

The HPL code is well built and run well for small problem size.
However, when I turned to run the code on 32-node (128-way), the code will
crash in the half way with the following error message:

---------------------------------------------
[node074:09973] mca_btl_tcp_frag_send: writev failed with errno=104
[node074:09973] mca_btl_tcp_frag_send: writev failed with errno=104
[node073:10234] mca_btl_tcp_frag_send: writev failed with errno=104
[node073:10234] mca_btl_tcp_frag_send: writev failed with errno=104
[node089:29190] mca_btl_tcp_frag_send: writev failed with errno=104
[node090:27881] mca_btl_tcp_frag_send: writev failed with errno=104
[node072:02729] mca_btl_tcp_frag_send: writev failed with errno=104
[node071:03029] mca_btl_tcp_frag_send: writev failed with errno=104
.....
[node084:06044] mca_btl_tcp_frag_send: writev failed with errno=104
[node086:01346] mca_btl_tcp_frag_send: writev failed with errno=104
[node069:16372] mca_btl_tcp_frag_send: writev failed with errno=104
[node100:23294] mca_btl_tcp_frag_send: writev failed with errno=104
[node069:16372] mca_btl_tcp_frag_send: writev failed with errno=104
[node085:04347] mca_btl_tcp_frag_send: writev failed with errno=104
[node087:31391] mca_btl_tcp_frag_send: writev failed with errno=104
---------------------------------------------

According to the following faq instruction, I explicitly tell the
interface name of tow tcp networks, but the code still break.

mpirun --mca btl_tcp_if_include eth0,eth1 -np 128 -bynode -machinefile
hostfile ./xhpl

http://icl.cs.utk.edu/open-mpi/faq/?category=tcp#tcp-selection

If I include only one tcp network, the code won't break, but the
performance is not desirble/


Anyone know how to fix it?

--Yuan


Yuan Wan
---
Unix Section
Information Services Infrastructure Division
University of Edinburgh

tel: 0131 650 4985
email: y...@ed.ac.uk

2032 Computing Services, JCMB
The King's Buildings,
Edinburgh, EH9 3JZ

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to