FWIW: I don’t think this actually has anything to do with the #procs you are 
trying to run. Instead, I expect it has to do with confusion over how many 
cores it can bind across. When you tell it to use-hwthread-cpus, you are asking 
us to map processes to hwthreads, and not cores. I don’t know which nodes are 
which, but it could be that we are getting incorrect info somewhere.

Given that you are limiting the number of procs to the number of cores, is 
there some reason why you are asking us to use-hwthread-cpus? Why not just 
leave it at the default core level?

I also suspect that you would have no problems if you —bind-to none - does that 
in fact work?


> On Jun 18, 2015, at 4:54 PM, Lane, William <william.l...@cshs.org> wrote:
> 
> I'm having a strange problem w/OpenMPI 1.8.6. If I run
> my OpenMPI test code (compiled against OpenMPI 1.8.6
> libraries) on < 131 slots I get no issues. Anything over 131
> errors out:
> 
> mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ 
> --hostfile hostfile-single --mca btl_tcp_if_include eth0 --hetero-nodes 
> --use-hwthread-cpus /hpc/home/lanew/mpi/openmpi/ProcessColors3
> 
> The hostfile has the number of slots restricted
> to the number of cores, while the max-slots includes
> the hyperthreading cores (e.g. csclprd3-0-0 slots=6 
> max-slots=12).
> 
> The nodes are a mix of IBM x3550 nodes some
> are Sandybridges and others are older Xeons.
> 
> I would like to add that the submit node from
> which I am launching mpirun has the open files
> soft limit (ulimit -a) set to 1024, while the hard limit
> (ulimit -Ha) is set to 4096. I know open file limits
> were an issue w/an older version of OpenMPI. The
> compute nodes all have their hard open files limit
> and soft open files limits set to 4096.
> 
> Here's the output (csclprd3-0-13 is the last node
> listed in the hostfile hostfile-single):
> 
> [csclprd3-0-13:28765] Signal: Bus error (7)
> [csclprd3-0-13:28765] Signal code: Non-existant physical address (2)
> [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980
> [csclprd3-0-13:28766] *** Process received signal ***
> [csclprd3-0-13:28766] Signal: Bus error (7)
> [csclprd3-0-13:28766] Signal code: Non-existant physical address (2)
> [csclprd3-0-13:28766] Failing at address: 0x7fe137662880
> [csclprd3-0-13:28768] *** Process received signal ***
> [csclprd3-0-13:28768] Signal: Bus error (7)
> [csclprd3-0-13:28768] Signal code: Non-existant physical address (2)
> [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80
> [csclprd3-0-13:28770] *** Process received signal ***
> [csclprd3-0-13:28770] Signal: Bus error (7)
> [csclprd3-0-13:28770] Signal code: Non-existant physical address (2)
> [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00
> [csclprd3-0-13:28767] *** Process received signal ***
> [csclprd3-0-13:28767] Signal: Bus error (7)
> [csclprd3-0-13:28767] Signal code: Non-existant physical address (2)
> [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980
> [csclprd3-0-13:28764] *** Process received signal ***
> [csclprd3-0-13:28764] Signal: Bus error (7)
> [csclprd3-0-13:28764] Signal code: Non-existant physical address (2)
> [csclprd3-0-13:28765] Signal: Bus error (7)
> [csclprd3-0-13:28765] Signal code: Non-existant physical address (2)
> [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980
> [csclprd3-0-13:28766] *** Process received signal ***
> [csclprd3-0-13:28766] Signal: Bus error (7)
> [csclprd3-0-13:28766] Signal code: Non-existant physical address (2)
> [csclprd3-0-13:28766] Failing at address: 0x7fe137662880
> [csclprd3-0-13:28768] *** Process received signal ***
> [csclprd3-0-13:28768] Signal: Bus error (7)
> [csclprd3-0-13:28768] Signal code: Non-existant physical address (2)
> [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80
> [csclprd3-0-13:28770] *** Process received signal ***
> [csclprd3-0-13:28770] Signal: Bus error (7)
> [csclprd3-0-13:28770] Signal code: Non-existant physical address (2)
> [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00
> [csclprd3-0-13:28767] *** Process received signal ***
> [csclprd3-0-13:28767] Signal: Bus error (7)
> [csclprd3-0-13:28767] Signal code: Non-existant physical address (2)
> [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980
> [csclprd3-0-13:28764] *** Process received signal ***
> [csclprd3-0-13:28764] Signal: Bus error (7)
> [csclprd3-0-13:28764] Signal code: Non-existant physical address (2)
> [csclprd3-0-13:28768] [ 3] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f9b513ad110]
> [csclprd3-0-13:28768] [ 4] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7f0df77b6009]
> [csclprd3-0-13:28770] [ 3] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f0df77b6110]
> [csclprd3-0-13:28770] [ 4] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b5141d68e]
> [csclprd3-0-13:28768] [ 5] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b514f1715]
> [csclprd3-0-13:28768] [ 6] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f30115ea68e]
> [csclprd3-0-13:28765] [ 5] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f30116be715]
> [csclprd3-0-13:28765] [ 6] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b7bb3b68e]
> [csclprd3-0-13:28767] [ 5] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b7bc0f715]
> [csclprd3-0-13:28767] [ 6] [csclprd3-0-13:28764] [ 4] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fa946bb768e]
> [csclprd3-0-13:28764] [ 5] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fe146d4068e]
> [csclprd3-0-13:28766] [ 5] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f0df782668e]
> [csclprd3-0-13:28770] [ 5] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f0df78fa715]
> [csclprd3-0-13:28770] [ 6] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f0df77d0ad6]
> [csclprd3-0-13:28770] [ 7] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fe146e14715]
> [csclprd3-0-13:28766] [ 6] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fe146ceaad6]
> [csclprd3-0-13:28766] [ 7] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b513c7ad6]
> [csclprd3-0-13:28768] [ 7] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b513e6c60]
> [csclprd3-0-13:28768] [ 8] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
> [csclprd3-0-13:28768] [ 9] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b50dc7cdd]
> [csclprd3-0-13:28768] [10] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
> [csclprd3-0-13:28768] *** End of error message ***
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f3011594ad6]
> [csclprd3-0-13:28765] [ 7] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f30115b3c60]
> [csclprd3-0-13:28765] [ 8] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
> [csclprd3-0-13:28765] [ 9] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f3010f94cdd]
> [csclprd3-0-13:28765] [10] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
> [csclprd3-0-13:28765] *** End of error message ***
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b7bae5ad6]
> [csclprd3-0-13:28767] [ 7] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b7bb04c60]
> [csclprd3-0-13:28767] [ 8] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
> [csclprd3-0-13:28767] [ 9] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b7b4e5cdd]
> [csclprd3-0-13:28767] [10] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
> [csclprd3-0-13:28767] *** End of error message ***
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715]
> [csclprd3-0-13:28764] [ 6] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6]
> [csclprd3-0-13:28764] [ 7] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60]
> [csclprd3-0-13:28770] [ 8] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
> [csclprd3-0-13:28770] [ 9] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd]
> [csclprd3-0-13:28770] [10] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
> [csclprd3-0-13:28770] *** End of error message ***
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60]
> [csclprd3-0-13:28766] [ 8] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
> [csclprd3-0-13:28766] [ 9] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd]
> [csclprd3-0-13:28767] *** End of error message ***
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715]
> [csclprd3-0-13:28764] [ 6] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6]
> [csclprd3-0-13:28764] [ 7] 
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60]
> [csclprd3-0-13:28770] [ 8] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
> [csclprd3-0-13:28770] [ 9] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd]
> [csclprd3-0-13:28770] [10] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
> [csclprd3-0-13:28770] *** End of error message ***
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60]
> [csclprd3-0-13:28766] [ 8] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
> [csclprd3-0-13:28766] [ 9] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd]
> [csclprd3-0-13:28766] [10] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
> [csclprd3-0-13:28766] *** End of error message ***
> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fa946b80c60]
> [csclprd3-0-13:28764] [ 8] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
> [csclprd3-0-13:28764] [ 9] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa946561cdd]
> [csclprd3-0-13:28764] [10] 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
> [csclprd3-0-13:28764] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 126 with PID 0 on node csclprd3-0-13 exited 
> on signal 7 (Bus error).
> 
> Could a lack of the necessary NUMA libraries or the wrong version of NUMA
> libraries be contributing to this?
> IMPORTANT WARNING: This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by 
> applicable law. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is strictly prohibited. Thank you for your 
> cooperation. _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/06/27159.php 
> <http://www.open-mpi.org/community/lists/users/2015/06/27159.php>

Reply via email to