> 
> We sometimes see mysterious crashes like this one. At least some of them
> are caused by port scanners, i.e. unexpected non-mpi related packets
> coming in on the sockets will sometimes cause havoc.
> 

Port scanners etc I don't really see happening on our cluster, since the nodes 
are well shielded from the outside, but of course there might be some internal 
processes that are causing this. At least I can try it by hand, to see if it 
generates the same kind of problem.

Werner Van Geit

> We've been getting http traffic in the jobs stdout/err sometimes. That
> really makes the users confused :-)
> 
> And yes, we are going to block this but we haven't had time...
> 
> -- 
> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
> Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


> On Thu, 2010-04-15 at 15:57 +0900, Werner Van Geit wrote:
>> Hi,
>> 
>> We are using openmpi 1.4.1 on our cluster computer (in conjunction with 
>> Torque). One of our users has a problem with his jobs generating a 
>> segmentation fault on one of the slaves, this is the backtrace:
>> 
>> [cstone-00613:28461] *** Process received signal ***
>> [cstone-00613:28461] Signal: Segmentation fault (11)
>> [cstone-00613:28461] Signal code:  (128)
>> [cstone-00613:28461] Failing at address: (nil)
>> [cstone-00613:28462] *** Process received signal ***
>> [cstone-00613:28462] Signal: Segmentation fault (11)
>> [cstone-00613:28462] Signal code: Address not mapped (1)
>> [cstone-00613:28462] Failing at address: (nil)
>> [cstone-00613:28461] [ 0] /lib64/libc.so.6 [0x2ba1933dce20]
>> [cstone-00613:28461] [ 1] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so 
>> [0x2ba19530ec7a]
>> [cstone-00613:28461] [ 2] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so 
>> [0x2ba19530d860]
>> [cstone-00613:28461] [ 3] /opt/openmpi/lib/libopen-pal.so.0 [0x2ba1938eb16b]
>> [cstone-00613:28461] [ 4] 
>> /opt/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2ba1938e072e]
>> [cstone-00613:28461] [ 5] /opt/openmpi/lib/libmpi.so.0 [0x2ba193621b38]
>> [cstone-00613:28461] [ 6] /opt/openmpi/lib/libmpi.so.0(PMPI_Wait+0x5b) 
>> [0x2ba19364c63b]
>> [cstone-00613:28461] [ 7] /opt/openmpi/lib/libmpi_f77.so.0(mpi_wait_+0x3a) 
>> [0x2ba192e98b8a]
>> [cstone-00613:28461] [ 8] ./roms [0x44976c]
>> [cstone-00613:28461] [ 9] ./roms [0x449d96]
>> [cstone-00613:28461] [10] ./roms [0x422708]
>> [cstone-00613:28461] [11] ./roms [0x402908]
>> [cstone-00613:28461] [12] ./roms [0x402467]
>> [cstone-00613:28461] [13] ./roms [0x46d20e]
>> [cstone-00613:28461] [14] /lib64/libc.so.6(__libc_start_main+0xf4) 
>> [0x2ba1933ca164]
>> [cstone-00613:28461] [15] ./roms [0x401dd9]
>> [cstone-00613:28461] *** End of error message ***
>> [cstone-00613:28462] [ 0] /lib64/libc.so.6 [0x2b5d57db6e20]
>> [cstone-00613:28462] *** End of error message ***
>> 
>> The other slaves crash with:
>> [cstone-00612][[21785,1],35][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> 
>> Since this problem seems to be happening in the network part of MPI my guess 
>> is that there is, or something wrong with the network, or a bug in OpenMPI. 
>> This same problem also appeared at the time that we were using openmpi 1.3
>> 
>> How could this problem be solved ?


Reply via email to