One my users recently reported random hangs of his OpenMPI application. I've run some tests using multiple 2-node 16-core runs of the IMB benchmark and can occasionally replicate the problem. Looking through the mail archive, a previous occurrence of this error seems to been suspect code, but as it's IMB failing here, I suspect the problem lies elsewhere. The full set of errors generated by a failed run are:
[lancs2-015][[37376,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conne ction reset by peer (104) [lancs2-015][[37376,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conne ction reset by peer (104) [lancs2-015][[37376,1],8][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [lancs2-015][[37376,1],14][btl_tcp_frag.c: 216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [lancs2-015][[37376,1],14][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conn ection reset by peer (104) [lancs2-015][[37376,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conne ction reset by peer (104) [lancs2-015][[37376,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conne ction reset by peer (104) [lancs2-015][[37376,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conne ction reset by peer (104) [lancs2-015][[37376,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conne ction reset by peer (104) [lancs2-015][[37376,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conne ction reset by peer (104) [lancs2-015][[37376,1],12][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conn ection reset by peer (104) [lancs2-015][[37376,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conne ction reset by peer (104) [lancs2-015][[37376,1],12][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conn ection reset by peer (104) [lancs2-015][[37376,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conne ction reset by peer (104) [lancs2-015][[37376,1],10][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conn ection reset by peer (104) [lancs2-015][[37376,1],8][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conne ction reset by peer (104) [lancs2-015][[37376,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Conne ction reset by peer (104) I'm used to OpenMPI terminating cleanly, but that's not happening in this case. All the OpenMPI processes on one node terminate, while the processes on the other simply spin with 100% CPU utilisation. I've run this 2-node test a number of times, and I'm not seeing any pattern (ie, I can't pin it down to a single node - a subsequent run using the two nodes involved above ran fine). Can anyone provide any pointers in tracking down this problem? System details as follows: - OpenMPI 1.3.3, compiled with gcc version 4.1.2 20080704 (Red Hat 4.1.2-44), using only the -prefix and -with-sge options. - OS is Scientific Linux SL release 5.3 - CPUs are 2.3GHz Opteron 2356 Regards, Mike. ----- Dr Mike Pacey, Email: m.pa...@lancaster.ac.uk High Performance Systems Support, Phone: 01524 593543 Information Systems Services, Fax: 01524 594459 Lancaster University, Lancaster LA1 4YW