I'm having a strange problem w/OpenMPI 1.8.6. If I run my OpenMPI test code (compiled against OpenMPI 1.8.6 libraries) on < 131 slots I get no issues. Anything over 131 errors out:
mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ --hostfile hostfile-single --mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus /hpc/home/lanew/mpi/openmpi/ProcessColors3 The hostfile has the number of slots restricted to the number of cores, while the max-slots includes the hyperthreading cores (e.g. csclprd3-0-0 slots=6 max-slots=12). The nodes are a mix of IBM x3550 nodes some are Sandybridges and others are older Xeons. I would like to add that the submit node from which I am launching mpirun has the open files soft limit (ulimit -a) set to 1024, while the hard limit (ulimit -Ha) is set to 4096. I know open file limits were an issue w/an older version of OpenMPI. The compute nodes all have their hard open files limit and soft open files limits set to 4096. Here's the output (csclprd3-0-13 is the last node listed in the hostfile hostfile-single): [csclprd3-0-13:28765] Signal: Bus error (7) [csclprd3-0-13:28765] Signal code: Non-existant physical address (2) [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980 [csclprd3-0-13:28766] *** Process received signal *** [csclprd3-0-13:28766] Signal: Bus error (7) [csclprd3-0-13:28766] Signal code: Non-existant physical address (2) [csclprd3-0-13:28766] Failing at address: 0x7fe137662880 [csclprd3-0-13:28768] *** Process received signal *** [csclprd3-0-13:28768] Signal: Bus error (7) [csclprd3-0-13:28768] Signal code: Non-existant physical address (2) [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80 [csclprd3-0-13:28770] *** Process received signal *** [csclprd3-0-13:28770] Signal: Bus error (7) [csclprd3-0-13:28770] Signal code: Non-existant physical address (2) [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00 [csclprd3-0-13:28767] *** Process received signal *** [csclprd3-0-13:28767] Signal: Bus error (7) [csclprd3-0-13:28767] Signal code: Non-existant physical address (2) [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980 [csclprd3-0-13:28764] *** Process received signal *** [csclprd3-0-13:28764] Signal: Bus error (7) [csclprd3-0-13:28764] Signal code: Non-existant physical address (2) [csclprd3-0-13:28765] Signal: Bus error (7) [csclprd3-0-13:28765] Signal code: Non-existant physical address (2) [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980 [csclprd3-0-13:28766] *** Process received signal *** [csclprd3-0-13:28766] Signal: Bus error (7) [csclprd3-0-13:28766] Signal code: Non-existant physical address (2) [csclprd3-0-13:28766] Failing at address: 0x7fe137662880 [csclprd3-0-13:28768] *** Process received signal *** [csclprd3-0-13:28768] Signal: Bus error (7) [csclprd3-0-13:28768] Signal code: Non-existant physical address (2) [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80 [csclprd3-0-13:28770] *** Process received signal *** [csclprd3-0-13:28770] Signal: Bus error (7) [csclprd3-0-13:28770] Signal code: Non-existant physical address (2) [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00 [csclprd3-0-13:28767] *** Process received signal *** [csclprd3-0-13:28767] Signal: Bus error (7) [csclprd3-0-13:28767] Signal code: Non-existant physical address (2) [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980 [csclprd3-0-13:28764] *** Process received signal *** [csclprd3-0-13:28764] Signal: Bus error (7) [csclprd3-0-13:28764] Signal code: Non-existant physical address (2) [csclprd3-0-13:28768] [ 3] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f9b513ad110] [csclprd3-0-13:28768] [ 4] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7f0df77b6009] [csclprd3-0-13:28770] [ 3] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f0df77b6110] [csclprd3-0-13:28770] [ 4] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b5141d68e] [csclprd3-0-13:28768] [ 5] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b514f1715] [csclprd3-0-13:28768] [ 6] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f30115ea68e] [csclprd3-0-13:28765] [ 5] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f30116be715] [csclprd3-0-13:28765] [ 6] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b7bb3b68e] [csclprd3-0-13:28767] [ 5] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b7bc0f715] [csclprd3-0-13:28767] [ 6] [csclprd3-0-13:28764] [ 4] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fa946bb768e] [csclprd3-0-13:28764] [ 5] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fe146d4068e] [csclprd3-0-13:28766] [ 5] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f0df782668e] [csclprd3-0-13:28770] [ 5] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f0df78fa715] [csclprd3-0-13:28770] [ 6] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f0df77d0ad6] [csclprd3-0-13:28770] [ 7] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fe146e14715] [csclprd3-0-13:28766] [ 6] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fe146ceaad6] [csclprd3-0-13:28766] [ 7] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b513c7ad6] [csclprd3-0-13:28768] [ 7] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b513e6c60] [csclprd3-0-13:28768] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] [csclprd3-0-13:28768] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b50dc7cdd] [csclprd3-0-13:28768] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] [csclprd3-0-13:28768] *** End of error message *** /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f3011594ad6] [csclprd3-0-13:28765] [ 7] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f30115b3c60] [csclprd3-0-13:28765] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] [csclprd3-0-13:28765] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f3010f94cdd] [csclprd3-0-13:28765] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] [csclprd3-0-13:28765] *** End of error message *** /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b7bae5ad6] [csclprd3-0-13:28767] [ 7] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b7bb04c60] [csclprd3-0-13:28767] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] [csclprd3-0-13:28767] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b7b4e5cdd] [csclprd3-0-13:28767] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] [csclprd3-0-13:28767] *** End of error message *** /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715] [csclprd3-0-13:28764] [ 6] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6] [csclprd3-0-13:28764] [ 7] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60] [csclprd3-0-13:28770] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] [csclprd3-0-13:28770] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd] [csclprd3-0-13:28770] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] [csclprd3-0-13:28770] *** End of error message *** /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60] [csclprd3-0-13:28766] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] [csclprd3-0-13:28766] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd] [csclprd3-0-13:28767] *** End of error message *** /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715] [csclprd3-0-13:28764] [ 6] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6] [csclprd3-0-13:28764] [ 7] /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60] [csclprd3-0-13:28770] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] [csclprd3-0-13:28770] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd] [csclprd3-0-13:28770] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] [csclprd3-0-13:28770] *** End of error message *** /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60] [csclprd3-0-13:28766] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] [csclprd3-0-13:28766] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd] [csclprd3-0-13:28766] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] [csclprd3-0-13:28766] *** End of error message *** /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fa946b80c60] [csclprd3-0-13:28764] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] [csclprd3-0-13:28764] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa946561cdd] [csclprd3-0-13:28764] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] [csclprd3-0-13:28764] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 126 with PID 0 on node csclprd3-0-13 exited on signal 7 (Bus error). Could a lack of the necessary NUMA libraries or the wrong version of NUMA libraries be contributing to this? IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is strictly prohibited. Thank you for your cooperation.