I'm having a strange problem w/OpenMPI 1.8.6. If I run
my OpenMPI test code (compiled against OpenMPI 1.8.6
libraries) on < 131 slots I get no issues. Anything over 131
errors out:

mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ 
--hostfile hostfile-single --mca btl_tcp_if_include eth0 --hetero-nodes 
--use-hwthread-cpus /hpc/home/lanew/mpi/openmpi/ProcessColors3

The hostfile has the number of slots restricted
to the number of cores, while the max-slots includes
the hyperthreading cores (e.g. csclprd3-0-0 slots=6
max-slots=12).

The nodes are a mix of IBM x3550 nodes some
are Sandybridges and others are older Xeons.

I would like to add that the submit node from
which I am launching mpirun has the open files
soft limit (ulimit -a) set to 1024, while the hard limit
(ulimit -Ha) is set to 4096. I know open file limits
were an issue w/an older version of OpenMPI. The
compute nodes all have their hard open files limit
and soft open files limits set to 4096.

Here's the output (csclprd3-0-13 is the last node
listed in the hostfile hostfile-single):

[csclprd3-0-13:28765] Signal: Bus error (7)
[csclprd3-0-13:28765] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28765] Failing at address: 0x7f30002a8980
[csclprd3-0-13:28766] *** Process received signal ***
[csclprd3-0-13:28766] Signal: Bus error (7)
[csclprd3-0-13:28766] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28766] Failing at address: 0x7fe137662880
[csclprd3-0-13:28768] *** Process received signal ***
[csclprd3-0-13:28768] Signal: Bus error (7)
[csclprd3-0-13:28768] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80
[csclprd3-0-13:28770] *** Process received signal ***
[csclprd3-0-13:28770] Signal: Bus error (7)
[csclprd3-0-13:28770] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00
[csclprd3-0-13:28767] *** Process received signal ***
[csclprd3-0-13:28767] Signal: Bus error (7)
[csclprd3-0-13:28767] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980
[csclprd3-0-13:28764] *** Process received signal ***
[csclprd3-0-13:28764] Signal: Bus error (7)
[csclprd3-0-13:28764] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28765] Signal: Bus error (7)
[csclprd3-0-13:28765] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28765] Failing at address: 0x7f30002a8980
[csclprd3-0-13:28766] *** Process received signal ***
[csclprd3-0-13:28766] Signal: Bus error (7)
[csclprd3-0-13:28766] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28766] Failing at address: 0x7fe137662880
[csclprd3-0-13:28768] *** Process received signal ***
[csclprd3-0-13:28768] Signal: Bus error (7)
[csclprd3-0-13:28768] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80
[csclprd3-0-13:28770] *** Process received signal ***
[csclprd3-0-13:28770] Signal: Bus error (7)
[csclprd3-0-13:28770] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00
[csclprd3-0-13:28767] *** Process received signal ***
[csclprd3-0-13:28767] Signal: Bus error (7)
[csclprd3-0-13:28767] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980
[csclprd3-0-13:28764] *** Process received signal ***
[csclprd3-0-13:28764] Signal: Bus error (7)
[csclprd3-0-13:28764] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28768] [ 3] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f9b513ad110]
[csclprd3-0-13:28768] [ 4] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7f0df77b6009]
[csclprd3-0-13:28770] [ 3] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f0df77b6110]
[csclprd3-0-13:28770] [ 4] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b5141d68e]
[csclprd3-0-13:28768] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b514f1715]
[csclprd3-0-13:28768] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f30115ea68e]
[csclprd3-0-13:28765] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f30116be715]
[csclprd3-0-13:28765] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b7bb3b68e]
[csclprd3-0-13:28767] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b7bc0f715]
[csclprd3-0-13:28767] [ 6] [csclprd3-0-13:28764] [ 4] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fa946bb768e]
[csclprd3-0-13:28764] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fe146d4068e]
[csclprd3-0-13:28766] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f0df782668e]
[csclprd3-0-13:28770] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f0df78fa715]
[csclprd3-0-13:28770] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f0df77d0ad6]
[csclprd3-0-13:28770] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fe146e14715]
[csclprd3-0-13:28766] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fe146ceaad6]
[csclprd3-0-13:28766] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b513c7ad6]
[csclprd3-0-13:28768] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b513e6c60]
[csclprd3-0-13:28768] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28768] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b50dc7cdd]
[csclprd3-0-13:28768] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28768] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f3011594ad6]
[csclprd3-0-13:28765] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f30115b3c60]
[csclprd3-0-13:28765] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28765] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f3010f94cdd]
[csclprd3-0-13:28765] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28765] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b7bae5ad6]
[csclprd3-0-13:28767] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b7bb04c60]
[csclprd3-0-13:28767] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28767] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b7b4e5cdd]
[csclprd3-0-13:28767] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28767] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715]
[csclprd3-0-13:28764] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6]
[csclprd3-0-13:28764] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60]
[csclprd3-0-13:28770] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28770] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd]
[csclprd3-0-13:28770] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28770] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60]
[csclprd3-0-13:28766] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28766] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd]
[csclprd3-0-13:28767] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715]
[csclprd3-0-13:28764] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6]
[csclprd3-0-13:28764] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60]
[csclprd3-0-13:28770] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28770] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd]
[csclprd3-0-13:28770] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28770] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60]
[csclprd3-0-13:28766] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28766] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd]
[csclprd3-0-13:28766] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28766] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fa946b80c60]
[csclprd3-0-13:28764] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28764] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa946561cdd]
[csclprd3-0-13:28764] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28764] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 126 with PID 0 on node csclprd3-0-13 exited on 
signal 7 (Bus error).

Could a lack of the necessary NUMA libraries or the wrong version of NUMA
libraries be contributing to this?
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.

Reply via email to