Hi, Yeah, I understand that would be handy, but it's a bit difficult, but I'll see if I could make a simple test case. The problem is, sorry that I forgot to mention that, that this segmentation fault only seems to happen after running the code for a couple of hours (on 10-20 8-core nodes). And for 'exactly' the same code (there are also no different random seed or something), it sometimes gives a segmentation fault, sometimes not (after resubmission).
Thx, Werner Van Geit On 15 Apr 2010, at 19:41, Jeff Squyres (jsquyres) wrote: > Can you send a small program that reproduces the problem, perchance? > > -jms > Sent from my PDA. No type good. > > ----- Original Message ----- > From: users-boun...@open-mpi.org <users-boun...@open-mpi.org> > To: us...@open-mpi.org <us...@open-mpi.org> > Sent: Thu Apr 15 01:57:10 2010 > Subject: [OMPI users] Segmentation fault in mca_btl_tcp > > Hi, > > We are using openmpi 1.4.1 on our cluster computer (in conjunction with > Torque). One of our users has a problem with his jobs generating a > segmentation fault on one of the slaves, this is the backtrace: > > [cstone-00613:28461] *** Process received signal *** > [cstone-00613:28461] Signal: Segmentation fault (11) > [cstone-00613:28461] Signal code: (128) > [cstone-00613:28461] Failing at address: (nil) > [cstone-00613:28462] *** Process received signal *** > [cstone-00613:28462] Signal: Segmentation fault (11) > [cstone-00613:28462] Signal code: Address not mapped (1) > [cstone-00613:28462] Failing at address: (nil) > [cstone-00613:28461] [ 0] /lib64/libc.so.6 [0x2ba1933dce20] > [cstone-00613:28461] [ 1] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so > [0x2ba19530ec7a] > [cstone-00613:28461] [ 2] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so > [0x2ba19530d860] > [cstone-00613:28461] [ 3] /opt/openmpi/lib/libopen-pal.so.0 [0x2ba1938eb16b] > [cstone-00613:28461] [ 4] > /opt/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2ba1938e072e] > [cstone-00613:28461] [ 5] /opt/openmpi/lib/libmpi.so.0 [0x2ba193621b38] > [cstone-00613:28461] [ 6] /opt/openmpi/lib/libmpi.so.0(PMPI_Wait+0x5b) > [0x2ba19364c63b] > [cstone-00613:28461] [ 7] /opt/openmpi/lib/libmpi_f77.so.0(mpi_wait_+0x3a) > [0x2ba192e98b8a] > [cstone-00613:28461] [ 8] ./roms [0x44976c] > [cstone-00613:28461] [ 9] ./roms [0x449d96] > [cstone-00613:28461] [10] ./roms [0x422708] > [cstone-00613:28461] [11] ./roms [0x402908] > [cstone-00613:28461] [12] ./roms [0x402467] > [cstone-00613:28461] [13] ./roms [0x46d20e] > [cstone-00613:28461] [14] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x2ba1933ca164] > [cstone-00613:28461] [15] ./roms [0x401dd9] > [cstone-00613:28461] *** End of error message *** > [cstone-00613:28462] [ 0] /lib64/libc.so.6 [0x2b5d57db6e20] > [cstone-00613:28462] *** End of error message *** > > The other slaves crash with: > [cstone-00612][[21785,1],35][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) > > Since this problem seems to be happening in the network part of MPI my guess > is that there is, or something wrong with the network, or a bug in OpenMPI. > This same problem also appeared at the time that we were using openmpi 1.3 > > How could this problem be solved ? > > (for more info about the system see attachments) > > Thx, > > Werner Van Geit > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users