Good point

William: can you rebuild OMPI with —enable-debug and run this again so we can 
see where the code is breaking?

Thanks
Ralph


> On Jun 19, 2015, at 6:11 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
> Ralph,
> 
> I got that, but I cannot read the stack trace (optimized build)
> my best bet is to reproduce the issue, and then find how and why 
> ompi_free_list_t is segfault'ing.
> that's why I requested info about the environment
> 
> iirc, ompi_free_list_t are different between master and v1.8, so an incorrect 
> back port could be the root cause.
> 
> Cheers,
> 
> Gilles
> 
> On Friday, June 19, 2015, Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> Gilles
> 
> I was fooled too, but that isn’t the issue. The problem is that 
> ompi_free_list is segfaulting:
> 
>> [csclprd3-0-13:30901] *** Process received signal ***
>> [csclprd3-0-13:30901] Signal: Bus error (7)
>> [csclprd3-0-13:30901] Signal code: Non-existant physical address (2)
>> [csclprd3-0-13:30901] Failing at address: 0x7ff404351d80
>> [csclprd3-0-13:30901] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ff41453c500]
>> [csclprd3-0-13:30901] [ 1] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xd4fea)[0x7ff41481efea]
>> [csclprd3-0-13:30901] [ 2] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7ff41479f009]
>> [csclprd3-0-13:30901] [ 3] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7ff41479f110]
>> [csclprd3-0-13:30901] [ 4] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7ff41480f68e]
>> [csclprd3-0-13:30901] [ 5] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7ff4148e3715]
>> [csclprd3-0-13:30901] [ 6] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7ff4147b9ad6]
>> [csclprd3-0-13:30901] [ 7] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7ff4147d8c60]
>> [csclprd3-0-13:30901] [ 8] 
>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
>> [csclprd3-0-13:30901] [ 9] 
>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff4141b9cdd]
>> [csclprd3-0-13:30901] [10] 
>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
>> [csclprd3-0-13:30901] *** End of error message ***
> 
> 
> 
>> On Jun 19, 2015, at 5:52 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com 
>> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote:
>> 
>> Lane,
>> 
>> could you please describe your configuration ?
>> how many sockets per node ?
>> how many cores per socket ?
>> how many threads per core ?
>> what is the minimum number of nodes needed to reproduce the issue ?
>> do all the nodes have the same configuration ?
>> if yes, what happens without --hetero-nodes ?
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Friday, June 19, 2015, Lane, William <william.l...@cshs.org 
>> <javascript:_e(%7B%7D,'cvml','william.l...@cshs.org');>> wrote:
>> Ralph,
>> 
>> I created a hostfile that just has the names of the hosts while
>> specifying no slot information whatsoever (e.g. csclprd3-0-0)
>> and received the following errors:
>> 
>> mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ 
>> --hostfile hostfile-noslots --mca btl_tcp_if_include eth0 --hetero-nodes 
>> /hpc/home/lanew/mpi/openmpi/ProcessColors3
>> 
>> [csclprd3-6-5:14770] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket 
>> 0[core 1[hwt 0]]: [B/B][./.]
>> [csclprd3-6-5:14770] MCW rank 5 bound to socket 1[core 2[hwt 0]], socket 
>> 1[core 3[hwt 0]]: [./.][B/B]
>> [csclprd3-6-5:14770] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket 
>> 0[core 1[hwt 0]]: [B/B][./.]
>> [csclprd3-6-5:14770] MCW rank 7 bound to socket 1[core 2[hwt 0]], socket 
>> 1[core 3[hwt 0]]: [./.][B/B]
>> [csclprd3-0-1:16437] MCW rank 24 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-5:18925] MCW rank 48 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-1:16437] MCW rank 25 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-5:18925] MCW rank 49 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-1:16437] MCW rank 20 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-1:16437] MCW rank 21 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-5:18925] MCW rank 44 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-5:18925] MCW rank 45 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-1:16437] MCW rank 22 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-1:16437] MCW rank 23 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-5:18925] MCW rank 46 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-5:18925] MCW rank 47 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-3:15946] MCW rank 36 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-3:15946] MCW rank 37 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-3:15946] MCW rank 32 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-3:15946] MCW rank 33 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-3:15946] MCW rank 34 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-3:15946] MCW rank 35 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-12:09165] MCW rank 124 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-12:09165] MCW rank 125 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-12:09165] MCW rank 120 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-12:09165] MCW rank 121 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-12:09165] MCW rank 122 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-12:09165] MCW rank 123 is not bound (or bound to all available 
>> processors)
>> [csclprd3-6-1:27030] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 
>> 0[core 1[hwt 0]]: [B/B][./.]
>> [csclprd3-6-1:27030] MCW rank 1 bound to socket 1[core 2[hwt 0]], socket 
>> 1[core 3[hwt 0]]: [./.][B/B]
>> [csclprd3-6-1:27030] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 
>> 0[core 1[hwt 0]]: [B/B][./.]
>> [csclprd3-6-1:27030] MCW rank 3 bound to socket 1[core 2[hwt 0]], socket 
>> 1[core 3[hwt 0]]: [./.][B/B]
>> [csclprd3-0-2:07944] MCW rank 30 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-6:32510] MCW rank 54 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-2:07944] MCW rank 31 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-6:32510] MCW rank 55 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-2:07944] MCW rank 26 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-6:32510] MCW rank 50 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-6:32510] MCW rank 51 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-2:07944] MCW rank 27 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-2:07944] MCW rank 28 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-6:32510] MCW rank 52 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-6:32510] MCW rank 53 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-2:07944] MCW rank 29 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-0:00453] MCW rank 11 bound to socket 1[core 6[hwt 0]], socket 
>> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], 
>> socket1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
>> [csclprd3-0-0:00453] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket 
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 
>> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
>> [csclprd3-0-0:00453] MCW rank 13 bound to socket 1[core 6[hwt 0]], socket 
>> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 
>> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
>> [csclprd3-0-0:00453] MCW rank 14 bound to socket 0[core 0[hwt 0]], socket 
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 
>> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
>> [csclprd3-0-0:00453] MCW rank 15 bound to socket 1[core 6[hwt 0]], socket 
>> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 
>> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
>> [csclprd3-0-0:00453] MCW rank 16 bound to socket 0[core 0[hwt 0]], socket 
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 
>> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
>> [csclprd3-0-7:22146] MCW rank 64 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-7:22146] MCW rank 65 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-0:00453] MCW rank 17 bound to socket 1[core 6[hwt 0]], socket 
>> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 
>> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
>> [csclprd3-0-0:00453] MCW rank 18 bound to socket 0[core 0[hwt 0]], socket 
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 
>> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
>> [csclprd3-0-11:00885] MCW rank 116 bound to socket 0[core 0[hwt 0-1]], 
>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 
>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 
>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-11:00885] MCW rank 117 bound to socket 1[core 8[hwt 0-1]], 
>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 
>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 
>> 14[hwt 0-1]],socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-10:20752] MCW rank 100 bound to socket 0[core 0[hwt 0-1]], 
>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 
>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 
>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-10:20752] MCW rank 101 bound to socket 1[core 8[hwt 0-1]], 
>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 
>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 
>> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-0:00453] MCW rank 19 bound to socket 1[core 6[hwt 0]], socket 
>> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 
>> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
>> [csclprd3-0-7:22146] MCW rank 66 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-11:00885] MCW rank 118 bound to socket 0[core 0[hwt 0-1]], 
>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 
>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 
>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-0:00453] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 
>> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
>> [csclprd3-0-10:20752] MCW rank 102 bound to socket 0[core 0[hwt 0-1]], 
>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 
>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 
>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-0:00453] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket 
>> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 
>> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
>> [csclprd3-0-0:00453] MCW rank 10 bound to socket 0[core 0[hwt 0]], socket 
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 
>> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
>> [csclprd3-0-4:32449] MCW rank 42 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-4:32449] MCW rank 43 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-4:32449] MCW rank 38 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-4:32449] MCW rank 39 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-4:32449] MCW rank 40 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-4:32449] MCW rank 41 is not bound (or bound to all available 
>> processors)
>> [csclprd3-0-13:30897] MCW rank 126 bound to socket 0[core 0[hwt 0-1]], 
>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 
>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB][../../../../../..]
>> [csclprd3-0-8:17159] MCW rank 80 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-13:30897] MCW rank 127 bound to socket 1[core 6[hwt 0-1]], 
>> socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 
>> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
>> [../../../../../..][BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-8:17159] MCW rank 81 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 
>> 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
>> [../../../../../..][BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-8:17159] MCW rank 81 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-13:30897] MCW rank 128 bound to socket 0[core 0[hwt 0-1]], 
>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 
>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB][../../../../../..]
>> [csclprd3-0-8:17159] MCW rank 82 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-13:30897] MCW rank 129 bound to socket 1[core 6[hwt 0-1]], 
>> socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 
>> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
>> [../../../../../..][BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-8:17159] MCW rank 83 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-13:30897] MCW rank 130 bound to socket 0[core 0[hwt 0-1]], 
>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 
>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB][../../../../../..]
>> [csclprd3-0-13:30897] MCW rank 131 bound to socket 1[core 6[hwt 0-1]], 
>> socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 
>> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
>> [../../../../../..][BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-8:17159] MCW rank 84 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-8:17159] MCW rank 85 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-11:00885] MCW rank 119 bound to socket 1[core 8[hwt 0-1]], 
>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 
>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 
>> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-10:20752] MCW rank 103 bound to socket 1[core 8[hwt 0-1]], 
>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 
>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 
>> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-8:17159] MCW rank 86 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-7:22146] MCW rank 67 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-11:00885] MCW rank 104 bound to socket 0[core 0[hwt 0-1]], 
>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 
>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 
>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..][csclprd3-0-10:20752] MCW 
>> rank 88 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], 
>> socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 
>> 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 
>> 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-8:17159] MCW rank 87 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-11:00885] MCW rank 105 bound to socket 1[core 8[hwt 0-1]], 
>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 
>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 
>> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-10:20752] MCW rank 89 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-8:17159] MCW rank 72 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-7:22146] MCW rank 68 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-11:00885] MCW rank 106 bound to socket 0[core 0[hwt 0-1]], 
>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 
>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 
>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-10:20752] MCW rank 90 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-8:17159] MCW rank 73 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-11:00885] MCW rank 107 bound to socket 1[core 8[hwt 0-1]], 
>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 
>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 
>> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-7:22146] MCW rank 69 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-8:17159] MCW rank 74 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-11:00885] MCW rank 108 bound to socket 0[core 0[hwt 0-1]], 
>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 
>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 
>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-7:22146] MCW rank 57 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-11:00885] MCW rank 114 bound to socket 0[core 0[hwt 0-1]], 
>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 
>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 
>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-10:20752] MCW rank 98 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-11:00885] MCW rank 115 bound to socket 1[core 8[hwt 0-1]], 
>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 
>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 
>> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-7:22146] MCW rank 58 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-10:20752] MCW rank 99 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-7:22146] MCW rank 59 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-7:22146] MCW rank 60 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-7:22146] MCW rank 61 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-7:22146] MCW rank 62 bound to socket 0[core 0[hwt 0-1]], socket 
>> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
>> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
>> 0-1]], socket 0[core 7[hwt 0-1]]: 
>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
>> [csclprd3-0-7:22146] MCW rank 63 bound to socket 1[core 8[hwt 0-1]], socket 
>> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
>> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
>> 0-1]], socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
>> [csclprd3-0-13:30901] *** Process received signal ***
>> [csclprd3-0-13:30901] Signal: Bus error (7)
>> [csclprd3-0-13:30901] Signal code: Non-existant physical address (2)
>> [csclprd3-0-13:30901] Failing at address: 0x7ff404351d80
>> [csclprd3-0-13:30901] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ff41453c500]
>> [csclprd3-0-13:30901] [ 1] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xd4fea)[0x7ff41481efea]
>> [csclprd3-0-13:30901] [ 2] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7ff41479f009]
>> [csclprd3-0-13:30901] [ 3] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7ff41479f110]
>> [csclprd3-0-13:30901] [ 4] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7ff41480f68e]
>> [csclprd3-0-13:30901] [ 5] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7ff4148e3715]
>> [csclprd3-0-13:30901] [ 6] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7ff4147b9ad6]
>> [csclprd3-0-13:30901] [ 7] 
>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7ff4147d8c60]
>> [csclprd3-0-13:30901] [ 8] 
>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
>> [csclprd3-0-13:30901] [ 9] 
>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff4141b9cdd]
>> [csclprd3-0-13:30901] [10] 
>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
>> [csclprd3-0-13:30901] *** End of error message ***
>> 
>> From: users [users-boun...@open-mpi.org <>] on behalf of Ralph Castain 
>> [r...@open-mpi.org <>]
>> Sent: Thursday, June 18, 2015 5:26 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = crash
>> 
>> FWIW: I don’t think this actually has anything to do with the #procs you are 
>> trying to run. Instead, I expect it has to do with confusion over how many 
>> cores it can bind across. When you tell it to use-hwthread-cpus, you are 
>> asking us to map processes to hwthreads, and not cores. I don’t know which 
>> nodes are which, but it could be that we are getting incorrect info 
>> somewhere.
>> 
>> Given that you are limiting the number of procs to the number of cores, is 
>> there some reason why you are asking us to use-hwthread-cpus? Why not just 
>> leave it at the default core level?
>> 
>> I also suspect that you would have no problems if you —bind-to none - does 
>> that in fact work?
>> 
>> 
>>> On Jun 18, 2015, at 4:54 PM, Lane, William <william.l...@cshs.org <>> wrote:
>>> 
>>> I'm having a strange problem w/OpenMPI 1.8.6. If I run
>>> my OpenMPI test code (compiled against OpenMPI 1.8.6
>>> libraries) on < 131 slots I get no issues. Anything over 131
>>> errors out:
>>> 
>>> mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ 
>>> --hostfile hostfile-single --mca btl_tcp_if_include eth0 --hetero-nodes 
>>> --use-hwthread-cpus /hpc/home/lanew/mpi/openmpi/ProcessColors3
>>> 
>>> The hostfile has the number of slots restricted
>>> to the number of cores, while the max-slots includes
>>> the hyperthreading cores (e.g. csclprd3-0-0 slots=6 
>>> max-slots=12).
>>> 
>>> The nodes are a mix of IBM x3550 nodes some
>>> are Sandybridges and others are older Xeons.
>>> 
>>> I would like to add that the submit node from
>>> which I am launching mpirun has the open files
>>> soft limit (ulimit -a) set to 1024, while the hard limit
>>> (ulimit -Ha) is set to 4096. I know open file limits
>>> were an issue w/an older version of OpenMPI. The
>>> compute nodes all have their hard open files limit
>>> and soft open files limits set to 4096.
>>> 
>>> Here's the output (csclprd3-0-13 is the last node
>>> listed in the hostfile hostfile-single):
>>> 
>>> [csclprd3-0-13:28765] Signal: Bus error (7)
>>> [csclprd3-0-13:28765] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980
>>> [csclprd3-0-13:28766] *** Process received signal ***
>>> [csclprd3-0-13:28766] Signal: Bus error (7)
>>> [csclprd3-0-13:28766] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-13:28766] Failing at address: 0x7fe137662880
>>> [csclprd3-0-13:28768] *** Process received signal ***
>>> [csclprd3-0-13:28768] Signal: Bus error (7)
>>> [csclprd3-0-13:28768] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80
>>> [csclprd3-0-13:28770] *** Process received signal ***
>>> [csclprd3-0-13:28770] Signal: Bus error (7)
>>> [csclprd3-0-13:28770] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00
>>> [csclprd3-0-13:28767] *** Process received signal ***
>>> [csclprd3-0-13:28767] Signal: Bus error (7)
>>> [csclprd3-0-13:28767] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980
>>> [csclprd3-0-13:28764] *** Process received signal ***
>>> [csclprd3-0-13:28764] Signal: Bus error (7)
>>> [csclprd3-0-13:28764] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-13:28765] Signal: Bus error (7)
>>> [csclprd3-0-13:28765] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980
>>> [csclprd3-0-13:28766] *** Process received signal ***
>>> [csclprd3-0-13:28766] Signal: Bus error (7)
>>> [csclprd3-0-13:28766] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-13:28766] Failing at address: 0x7fe137662880
>>> [csclprd3-0-13:28768] *** Process received signal ***
>>> [csclprd3-0-13:28768] Signal: Bus error (7)
>>> [csclprd3-0-13:28768] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80
>>> [csclprd3-0-13:28770] *** Process received signal ***
>>> [csclprd3-0-13:28770] Signal: Bus error (7)
>>> [csclprd3-0-13:28770] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00
>>> [csclprd3-0-13:28767] *** Process received signal ***
>>> [csclprd3-0-13:28767] Signal: Bus error (7)
>>> [csclprd3-0-13:28767] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980
>>> [csclprd3-0-13:28764] *** Process received signal ***
>>> [csclprd3-0-13:28764] Signal: Bus error (7)
>>> [csclprd3-0-13:28764] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-13:28768] [ 3] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f9b513ad110]
>>> [csclprd3-0-13:28768] [ 4] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7f0df77b6009]
>>> [csclprd3-0-13:28770] [ 3] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f0df77b6110]
>>> [csclprd3-0-13:28770] [ 4] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b5141d68e]
>>> [csclprd3-0-13:28768] [ 5] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b514f1715]
>>> [csclprd3-0-13:28768] [ 6] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f30115ea68e]
>>> [csclprd3-0-13:28765] [ 5] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f30116be715]
>>> [csclprd3-0-13:28765] [ 6] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b7bb3b68e]
>>> [csclprd3-0-13:28767] [ 5] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b7bc0f715]
>>> [csclprd3-0-13:28767] [ 6] [csclprd3-0-13:28764] [ 4] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fa946bb768e]
>>> [csclprd3-0-13:28764] [ 5] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fe146d4068e]
>>> [csclprd3-0-13:28766] [ 5] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f0df782668e]
>>> [csclprd3-0-13:28770] [ 5] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f0df78fa715]
>>> [csclprd3-0-13:28770] [ 6] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f0df77d0ad6]
>>> [csclprd3-0-13:28770] [ 7] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fe146e14715]
>>> [csclprd3-0-13:28766] [ 6] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fe146ceaad6]
>>> [csclprd3-0-13:28766] [ 7] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b513c7ad6]
>>> [csclprd3-0-13:28768] [ 7] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b513e6c60]
>>> [csclprd3-0-13:28768] [ 8] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
>>> [csclprd3-0-13:28768] [ 9] 
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b50dc7cdd]
>>> [csclprd3-0-13:28768] [10] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
>>> [csclprd3-0-13:28768] *** End of error message ***
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f3011594ad6]
>>> [csclprd3-0-13:28765] [ 7] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f30115b3c60]
>>> [csclprd3-0-13:28765] [ 8] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
>>> [csclprd3-0-13:28765] [ 9] 
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f3010f94cdd]
>>> [csclprd3-0-13:28765] [10] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
>>> [csclprd3-0-13:28765] *** End of error message ***
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b7bae5ad6]
>>> [csclprd3-0-13:28767] [ 7] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b7bb04c60]
>>> [csclprd3-0-13:28767] [ 8] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
>>> [csclprd3-0-13:28767] [ 9] 
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b7b4e5cdd]
>>> [csclprd3-0-13:28767] [10] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
>>> [csclprd3-0-13:28767] *** End of error message ***
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715]
>>> [csclprd3-0-13:28764] [ 6] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6]
>>> [csclprd3-0-13:28764] [ 7] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60]
>>> [csclprd3-0-13:28770] [ 8] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
>>> [csclprd3-0-13:28770] [ 9] 
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd]
>>> [csclprd3-0-13:28770] [10] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
>>> [csclprd3-0-13:28770] *** End of error message ***
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60]
>>> [csclprd3-0-13:28766] [ 8] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
>>> [csclprd3-0-13:28766] [ 9] 
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd]
>>> [csclprd3-0-13:28767] *** End of error message ***
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715]
>>> [csclprd3-0-13:28764] [ 6] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6]
>>> [csclprd3-0-13:28764] [ 7] 
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60]
>>> [csclprd3-0-13:28770] [ 8] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
>>> [csclprd3-0-13:28770] [ 9] 
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd]
>>> [csclprd3-0-13:28770] [10] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
>>> [csclprd3-0-13:28770] *** End of error message ***
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60]
>>> [csclprd3-0-13:28766] [ 8] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
>>> [csclprd3-0-13:28766] [ 9] 
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd]
>>> [csclprd3-0-13:28766] [10] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
>>> [csclprd3-0-13:28766] *** End of error message ***
>>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fa946b80c60]
>>> [csclprd3-0-13:28764] [ 8] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
>>> [csclprd3-0-13:28764] [ 9] 
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa946561cdd]
>>> [csclprd3-0-13:28764] [10] 
>>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
>>> [csclprd3-0-13:28764] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 126 with PID 0 on node csclprd3-0-13 
>>> exited on signal 7 (Bus error).
>>> 
>>> Could a lack of the necessary NUMA libraries or the wrong version of NUMA
>>> libraries be contributing to this?
>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>> entity to which it is addressed and may contain information that is 
>>> privileged and confidential, the disclosure of which is governed by 
>>> applicable law. If the reader of this message is not the intended 
>>> recipient, or the employee or agent responsible for delivering it to the 
>>> intended recipient, you are hereby notified that any dissemination, 
>>> distribution or copying of this information is strictly prohibited. Thank 
>>> you for your cooperation. _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/06/27159.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/06/27159.php>
>> IMPORTANT WARNING: This message is intended for the use of the person or 
>> entity to which it is addressed and may contain information that is 
>> privileged and confidential, the disclosure of which is governed by 
>> applicable law. If the reader of this message is not the intended recipient, 
>> or the employee or agent responsible for delivering it to the intended 
>> recipient, you are hereby notified that any dissemination, distribution or 
>> copying of this information is strictly prohibited. Thank you for your 
>> cooperation.
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/06/27164.php 
>> <http://www.open-mpi.org/community/lists/users/2015/06/27164.php>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/06/27166.php

Reply via email to