Hi all,
I'm benchmarking our new cluster with HPL. I pick OpenMPI as parallel
environment as I found OpenMPi is able to benefit from two giga-ethernet
tcp
networks on our cluster during low-level benchmark.
(bandwidth could be upto 250MB/s)
The HPL code is well built and run well for small problem size.
However, when I turned to run the code on 32-node (128-way), the code will
crash in the half way with the following error message:
---------------------------------------------
[node074:09973] mca_btl_tcp_frag_send: writev failed with errno=104
[node074:09973] mca_btl_tcp_frag_send: writev failed with errno=104
[node073:10234] mca_btl_tcp_frag_send: writev failed with errno=104
[node073:10234] mca_btl_tcp_frag_send: writev failed with errno=104
[node089:29190] mca_btl_tcp_frag_send: writev failed with errno=104
[node090:27881] mca_btl_tcp_frag_send: writev failed with errno=104
[node072:02729] mca_btl_tcp_frag_send: writev failed with errno=104
[node071:03029] mca_btl_tcp_frag_send: writev failed with errno=104
.....
[node084:06044] mca_btl_tcp_frag_send: writev failed with errno=104
[node086:01346] mca_btl_tcp_frag_send: writev failed with errno=104
[node069:16372] mca_btl_tcp_frag_send: writev failed with errno=104
[node100:23294] mca_btl_tcp_frag_send: writev failed with errno=104
[node069:16372] mca_btl_tcp_frag_send: writev failed with errno=104
[node085:04347] mca_btl_tcp_frag_send: writev failed with errno=104
[node087:31391] mca_btl_tcp_frag_send: writev failed with errno=104
---------------------------------------------
According to the following faq instruction, I explicitly tell the
interface name of tow tcp networks, but the code still break.
mpirun --mca btl_tcp_if_include eth0,eth1 -np 128 -bynode -machinefile
hostfile ./xhpl
http://icl.cs.utk.edu/open-mpi/faq/?category=tcp#tcp-selection
If I include only one tcp network, the code won't break, but the
performance is not desirble/
Anyone know how to fix it?
--Yuan
Yuan Wan
---
Unix Section
Information Services Infrastructure Division
University of Edinburgh
tel: 0131 650 4985
email: y...@ed.ac.uk
2032 Computing Services, JCMB
The King's Buildings,
Edinburgh, EH9 3JZ