[OMPI users] Help tracing casue of readv errors

Pacey, Mike Fri, 25 Sep 2009 11:38:58 -0400

One my users recently reported random hangs of his OpenMPI application.
I've run some tests using multiple 2-node 16-core runs of the IMB
benchmark and can occasionally replicate the problem. Looking through
the mail archive, a previous occurrence of this error seems to been
suspect code, but as it's IMB failing here, I suspect the problem lies
elsewhere. The full set of errors generated by a failed run are:


[lancs2-015][[37376,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],8][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
[lancs2-015][[37376,1],14][btl_tcp_frag.c:
216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed:
Connection reset by peer (104)
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[lancs2-015][[37376,1],14][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conn
ection reset by peer (104)
[lancs2-015][[37376,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],12][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conn
ection reset by peer (104)
[lancs2-015][[37376,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],12][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conn
ection reset by peer (104)
[lancs2-015][[37376,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],10][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conn
ection reset by peer (104)
[lancs2-015][[37376,1],8][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)

I'm used to OpenMPI terminating cleanly, but that's not happening in
this case. All the OpenMPI processes on one node terminate, while the
processes on the other simply spin with 100% CPU utilisation. I've run
this 2-node test a number of times, and I'm not seeing any pattern (ie,
I can't pin it down to a single node - a subsequent run using the two
nodes involved above ran fine).

Can anyone provide any pointers in tracking down this problem? System
details as follows:

-       OpenMPI 1.3.3, compiled with gcc version 4.1.2 20080704 (Red Hat
4.1.2-44), using only the -prefix and -with-sge options.
-       OS is Scientific Linux SL release 5.3
-       CPUs are 2.3GHz Opteron 2356

Regards,
Mike.

-----

Dr Mike Pacey,                         Email: m.pa...@lancaster.ac.uk
High Performance Systems Support,      Phone: 01524 593543
Information Systems Services,            Fax: 01524 594459
Lancaster University,
Lancaster LA1 4YW

[OMPI users] Help tracing casue of readv errors

Reply via email to