> > We sometimes see mysterious crashes like this one. At least some of them > are caused by port scanners, i.e. unexpected non-mpi related packets > coming in on the sockets will sometimes cause havoc. >
Port scanners etc I don't really see happening on our cluster, since the nodes are well shielded from the outside, but of course there might be some internal processes that are causing this. At least I can try it by hand, to see if it generates the same kind of problem. Werner Van Geit > We've been getting http traffic in the jobs stdout/err sometimes. That > really makes the users confused :-) > > And yes, we are going to block this but we haven't had time... > > -- > Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden > Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 > Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > On Thu, 2010-04-15 at 15:57 +0900, Werner Van Geit wrote: >> Hi, >> >> We are using openmpi 1.4.1 on our cluster computer (in conjunction with >> Torque). One of our users has a problem with his jobs generating a >> segmentation fault on one of the slaves, this is the backtrace: >> >> [cstone-00613:28461] *** Process received signal *** >> [cstone-00613:28461] Signal: Segmentation fault (11) >> [cstone-00613:28461] Signal code: (128) >> [cstone-00613:28461] Failing at address: (nil) >> [cstone-00613:28462] *** Process received signal *** >> [cstone-00613:28462] Signal: Segmentation fault (11) >> [cstone-00613:28462] Signal code: Address not mapped (1) >> [cstone-00613:28462] Failing at address: (nil) >> [cstone-00613:28461] [ 0] /lib64/libc.so.6 [0x2ba1933dce20] >> [cstone-00613:28461] [ 1] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so >> [0x2ba19530ec7a] >> [cstone-00613:28461] [ 2] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so >> [0x2ba19530d860] >> [cstone-00613:28461] [ 3] /opt/openmpi/lib/libopen-pal.so.0 [0x2ba1938eb16b] >> [cstone-00613:28461] [ 4] >> /opt/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2ba1938e072e] >> [cstone-00613:28461] [ 5] /opt/openmpi/lib/libmpi.so.0 [0x2ba193621b38] >> [cstone-00613:28461] [ 6] /opt/openmpi/lib/libmpi.so.0(PMPI_Wait+0x5b) >> [0x2ba19364c63b] >> [cstone-00613:28461] [ 7] /opt/openmpi/lib/libmpi_f77.so.0(mpi_wait_+0x3a) >> [0x2ba192e98b8a] >> [cstone-00613:28461] [ 8] ./roms [0x44976c] >> [cstone-00613:28461] [ 9] ./roms [0x449d96] >> [cstone-00613:28461] [10] ./roms [0x422708] >> [cstone-00613:28461] [11] ./roms [0x402908] >> [cstone-00613:28461] [12] ./roms [0x402467] >> [cstone-00613:28461] [13] ./roms [0x46d20e] >> [cstone-00613:28461] [14] /lib64/libc.so.6(__libc_start_main+0xf4) >> [0x2ba1933ca164] >> [cstone-00613:28461] [15] ./roms [0x401dd9] >> [cstone-00613:28461] *** End of error message *** >> [cstone-00613:28462] [ 0] /lib64/libc.so.6 [0x2b5d57db6e20] >> [cstone-00613:28462] *** End of error message *** >> >> The other slaves crash with: >> [cstone-00612][[21785,1],35][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] >> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) >> >> Since this problem seems to be happening in the network part of MPI my guess >> is that there is, or something wrong with the network, or a bug in OpenMPI. >> This same problem also appeared at the time that we were using openmpi 1.3 >> >> How could this problem be solved ?