Hi all,

I'm benchmarking our new cluster with HPL. I pick OpenMPI as parallel environment as I found OpenMPi is able to benefit from two giga-ethernet tcp
networks on our cluster during low-level benchmark.
(bandwidth could be upto 250MB/s)

The HPL code is well built and run well for small problem size.
However, when I turned to run the code on 32-node (128-way), the code will crash in the half way with the following error message:

---------------------------------------------
[node074:09973] mca_btl_tcp_frag_send: writev failed with errno=104
[node074:09973] mca_btl_tcp_frag_send: writev failed with errno=104
[node073:10234] mca_btl_tcp_frag_send: writev failed with errno=104
[node073:10234] mca_btl_tcp_frag_send: writev failed with errno=104
[node089:29190] mca_btl_tcp_frag_send: writev failed with errno=104
[node090:27881] mca_btl_tcp_frag_send: writev failed with errno=104
[node072:02729] mca_btl_tcp_frag_send: writev failed with errno=104
[node071:03029] mca_btl_tcp_frag_send: writev failed with errno=104
.....
[node084:06044] mca_btl_tcp_frag_send: writev failed with errno=104
[node086:01346] mca_btl_tcp_frag_send: writev failed with errno=104
[node069:16372] mca_btl_tcp_frag_send: writev failed with errno=104
[node100:23294] mca_btl_tcp_frag_send: writev failed with errno=104
[node069:16372] mca_btl_tcp_frag_send: writev failed with errno=104
[node085:04347] mca_btl_tcp_frag_send: writev failed with errno=104
[node087:31391] mca_btl_tcp_frag_send: writev failed with errno=104
---------------------------------------------

According to the following faq instruction, I explicitly tell the interface name of tow tcp networks, but the code still break.

mpirun --mca btl_tcp_if_include eth0,eth1 -np 128 -bynode -machinefile hostfile ./xhpl

http://icl.cs.utk.edu/open-mpi/faq/?category=tcp#tcp-selection

If I include only one tcp network, the code won't break, but the performance is not desirble/


Anyone know how to fix it?

--Yuan


Yuan Wan
--- Unix Section
Information Services Infrastructure Division
University of Edinburgh

tel: 0131 650 4985
email: y...@ed.ac.uk

2032 Computing Services, JCMB
The King's Buildings,
Edinburgh, EH9 3JZ

Reply via email to