Hi all,

We have a code built with OpenMPI (v1.4.3) and the Intel v12.0 compiler that 
has been tested successfully on 10s - 100s of cores on our cluster. We recently 
ran the same code with 1020 cores and received the following runtime error:

> [d6cneh042:28543] *** Process received signal ***
> [d6cneh061:29839] Signal: Segmentation fault (11)
> [d6cneh061:29839] Signal code: Address not mapped (1)
> [d6cneh061:29839] Failing at address: 0x10
> [d6cneh030:26800] Signal: Segmentation fault (11)
> [d6cneh030:26800] Signal code: Address not mapped (1)
> [d6cneh030:26800] Failing at address: 0x21
> [d6cneh042:28543] Signal: Segmentation fault (11)
> [d6cneh042:28543] Signal code: Address not mapped (1)
> [d6cneh042:28543] Failing at address: 0x10
> [d6cneh021:27646] [ 0] /lib64/libpthread.so.0 [0x39aee0eb10]
> [d6cneh021:27646] [ 1] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 
> [0x2af8b1c8bca8]
> [d6cneh021:27646] [ 2] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 
> [0x2af8b1c8a1ef]
> [d6cneh021:27646] [ 3] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 
> [0x2af8b1c16246]
> [d6cneh021:27646] [ 4] 
> /opt/crc/openmpi/1.4.3/intel-12.0/lib/libopen-pal.so.0(opal_progress+0x86) 
> [0x2af8b22a6a26]
> [d6cneh021:27646] [ 5] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 
> [0x2af8b1c879e7]
> [d6cneh021:27646] [ 6] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 
> [0x2af8b1c1f701]
> [d6cneh021:27646] [ 7] /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0 
> [0x2af8b1c1aec9]
> [d6cneh021:27646] [ 8] 
> /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi.so.0(MPI_Allreduce+0x73) 
> [0x2af8b1be6203]
> [d6cneh021:27646] [ 9] 
> /opt/crc/openmpi/1.4.3/intel-12.0/lib/libmpi_f77.so.0(MPI_ALLREDUCE+0xc5) 
> [0x2af8b1977715]
> [d6cneh021:27646] [10] openmd_MPI [0x5e0b94]
> [d6cneh021:27646] [11] openmd_MPI [0x599877]
> [d6cneh021:27646] [12] openmd_MPI [0x5746e8]
> [d6cneh021:27646] [13] openmd_MPI [0x4f18b8]

Can anyone give some insight into the issue? I should note (as it may be 
relevant) that this job was run across a heterogeneous cluster of Intel Nehalem 
servers with a mixture of Infiniband and Ethernet connections. The OpenMPI 
compiler was built with no IB libraries (so I am assuming everything defaults 
to a TCP transport?).

Thanks in advance for any insight that may help us identify the issue.

Regards.

Tim.

Tim Stitt PhD (User Support Manager).
Center for Research Computing | University of Notre Dame |
P.O. Box 539, Notre Dame, IN 46556 | Phone:  574-631-5287 | Email: 
tst...@nd.edu<mailto:tst...@nd.edu>

Reply via email to