Ralph, I got that, but I cannot read the stack trace (optimized build) my best bet is to reproduce the issue, and then find how and why ompi_free_list_t is segfault'ing. that's why I requested info about the environment
iirc, ompi_free_list_t are different between master and v1.8, so an incorrect back port could be the root cause. Cheers, Gilles On Friday, June 19, 2015, Ralph Castain <r...@open-mpi.org> wrote: > Gilles > > I was fooled too, but that isn’t the issue. The problem is that > ompi_free_list is segfaulting: > > [csclprd3-0-13:30901] *** Process received signal *** >> [csclprd3-0-13:30901] Signal: Bus error (7) >> [csclprd3-0-13:30901] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:30901] Failing at address: 0x7ff404351d80 >> [csclprd3-0-13:30901] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ff41453c500] >> [csclprd3-0-13:30901] [ 1] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xd4fea)[0x7ff41481efea] >> [csclprd3-0-13:30901] [ 2] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7ff41479f009] >> [csclprd3-0-13:30901] [ 3] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7ff41479f110] >> [csclprd3-0-13:30901] [ 4] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7ff41480f68e] >> [csclprd3-0-13:30901] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7ff4148e3715] >> [csclprd3-0-13:30901] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7ff4147b9ad6] >> [csclprd3-0-13:30901] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7ff4147d8c60] >> [csclprd3-0-13:30901] [ 8] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:30901] [ 9] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff4141b9cdd] >> [csclprd3-0-13:30901] [10] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:30901] *** End of error message *** >> > > > > On Jun 19, 2015, at 5:52 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com > <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote: > > Lane, > > could you please describe your configuration ? > how many sockets per node ? > how many cores per socket ? > how many threads per core ? > what is the minimum number of nodes needed to reproduce the issue ? > do all the nodes have the same configuration ? > if yes, what happens without --hetero-nodes ? > > Cheers, > > Gilles > > On Friday, June 19, 2015, Lane, William <william.l...@cshs.org > <javascript:_e(%7B%7D,'cvml','william.l...@cshs.org');>> wrote: > >> Ralph, >> >> I created a hostfile that just has the names of the hosts while >> specifying no slot information whatsoever (e.g. csclprd3-0-0) >> and received the following errors: >> >> mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ >> --hostfile hostfile-noslots --mca btl_tcp_if_include eth0 --hetero-nodes >> /hpc/home/lanew/mpi/openmpi/ProcessColors3 >> >> [csclprd3-6-5:14770] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]]: [B/B][./.] >> [csclprd3-6-5:14770] MCW rank 5 bound to socket 1[core 2[hwt 0]], socket >> 1[core 3[hwt 0]]: [./.][B/B] >> [csclprd3-6-5:14770] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]]: [B/B][./.] >> [csclprd3-6-5:14770] MCW rank 7 bound to socket 1[core 2[hwt 0]], socket >> 1[core 3[hwt 0]]: [./.][B/B] >> [csclprd3-0-1:16437] MCW rank 24 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:18925] MCW rank 48 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:16437] MCW rank 25 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:18925] MCW rank 49 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:16437] MCW rank 20 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:16437] MCW rank 21 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:18925] MCW rank 44 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:18925] MCW rank 45 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:16437] MCW rank 22 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:16437] MCW rank 23 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:18925] MCW rank 46 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:18925] MCW rank 47 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:15946] MCW rank 36 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:15946] MCW rank 37 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:15946] MCW rank 32 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:15946] MCW rank 33 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:15946] MCW rank 34 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:15946] MCW rank 35 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:09165] MCW rank 124 is not bound (or bound to all >> available processors) >> [csclprd3-0-12:09165] MCW rank 125 is not bound (or bound to all >> available processors) >> [csclprd3-0-12:09165] MCW rank 120 is not bound (or bound to all >> available processors) >> [csclprd3-0-12:09165] MCW rank 121 is not bound (or bound to all >> available processors) >> [csclprd3-0-12:09165] MCW rank 122 is not bound (or bound to all >> available processors) >> [csclprd3-0-12:09165] MCW rank 123 is not bound (or bound to all >> available processors) >> [csclprd3-6-1:27030] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]]: [B/B][./.] >> [csclprd3-6-1:27030] MCW rank 1 bound to socket 1[core 2[hwt 0]], socket >> 1[core 3[hwt 0]]: [./.][B/B] >> [csclprd3-6-1:27030] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]]: [B/B][./.] >> [csclprd3-6-1:27030] MCW rank 3 bound to socket 1[core 2[hwt 0]], socket >> 1[core 3[hwt 0]]: [./.][B/B] >> [csclprd3-0-2:07944] MCW rank 30 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:32510] MCW rank 54 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:07944] MCW rank 31 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:32510] MCW rank 55 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:07944] MCW rank 26 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:32510] MCW rank 50 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:32510] MCW rank 51 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:07944] MCW rank 27 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:07944] MCW rank 28 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:32510] MCW rank 52 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:32510] MCW rank 53 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:07944] MCW rank 29 is not bound (or bound to all available >> processors) >> [csclprd3-0-0:00453] MCW rank 11 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], >> socket1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: >> [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:00453] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-0:00453] MCW rank 13 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:00453] MCW rank 14 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-0:00453] MCW rank 15 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:00453] MCW rank 16 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-7:22146] MCW rank 64 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-7:22146] MCW rank 65 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-0:00453] MCW rank 17 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:00453] MCW rank 18 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-11:00885] MCW rank 116 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:00885] MCW rank 117 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]],socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-10:20752] MCW rank 100 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-10:20752] MCW rank 101 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-0:00453] MCW rank 19 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-7:22146] MCW rank 66 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:00885] MCW rank 118 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-0:00453] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-10:20752] MCW rank 102 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-0:00453] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:00453] MCW rank 10 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-4:32449] MCW rank 42 is not bound (or bound to all available >> processors) >> [csclprd3-0-4:32449] MCW rank 43 is not bound (or bound to all available >> processors) >> [csclprd3-0-4:32449] MCW rank 38 is not bound (or bound to all available >> processors) >> [csclprd3-0-4:32449] MCW rank 39 is not bound (or bound to all available >> processors) >> [csclprd3-0-4:32449] MCW rank 40 is not bound (or bound to all available >> processors) >> [csclprd3-0-4:32449] MCW rank 41 is not bound (or bound to all available >> processors) >> [csclprd3-0-13:30897] MCW rank 126 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB][../../../../../..] >> [csclprd3-0-8:17159] MCW rank 80 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-13:30897] MCW rank 127 bound to socket 1[core 6[hwt 0-1]], >> socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt >> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: >> [../../../../../..][BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 81 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], >> socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: >> [../../../../../..][BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 81 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-13:30897] MCW rank 128 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB][../../../../../..] >> [csclprd3-0-8:17159] MCW rank 82 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-13:30897] MCW rank 129 bound to socket 1[core 6[hwt 0-1]], >> socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt >> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: >> [../../../../../..][BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 83 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-13:30897] MCW rank 130 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB][../../../../../..] >> [csclprd3-0-13:30897] MCW rank 131 bound to socket 1[core 6[hwt 0-1]], >> socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt >> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: >> [../../../../../..][BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 84 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-8:17159] MCW rank 85 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-11:00885] MCW rank 119 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-10:20752] MCW rank 103 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 86 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-7:22146] MCW rank 67 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-11:00885] MCW rank 104 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..][csclprd3-0-10:20752] MCW >> rank 88 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], >> socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt >> 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core >> 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-8:17159] MCW rank 87 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-11:00885] MCW rank 105 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-10:20752] MCW rank 89 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 72 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-7:22146] MCW rank 68 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:00885] MCW rank 106 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-10:20752] MCW rank 90 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-8:17159] MCW rank 73 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-11:00885] MCW rank 107 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:22146] MCW rank 69 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 74 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:00885] MCW rank 108 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:22146] MCW rank 57 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-11:00885] MCW rank 114 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-10:20752] MCW rank 98 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:00885] MCW rank 115 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:22146] MCW rank 58 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-10:20752] MCW rank 99 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:22146] MCW rank 59 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:22146] MCW rank 60 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-7:22146] MCW rank 61 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:22146] MCW rank 62 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-7:22146] MCW rank 63 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-13:30901] *** Process received signal *** >> [csclprd3-0-13:30901] Signal: Bus error (7) >> [csclprd3-0-13:30901] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:30901] Failing at address: 0x7ff404351d80 >> [csclprd3-0-13:30901] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ff41453c500] >> [csclprd3-0-13:30901] [ 1] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xd4fea)[0x7ff41481efea] >> [csclprd3-0-13:30901] [ 2] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7ff41479f009] >> [csclprd3-0-13:30901] [ 3] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7ff41479f110] >> [csclprd3-0-13:30901] [ 4] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7ff41480f68e] >> [csclprd3-0-13:30901] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7ff4148e3715] >> [csclprd3-0-13:30901] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7ff4147b9ad6] >> [csclprd3-0-13:30901] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7ff4147d8c60] >> [csclprd3-0-13:30901] [ 8] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:30901] [ 9] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff4141b9cdd] >> [csclprd3-0-13:30901] [10] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:30901] *** End of error message *** >> >> ------------------------------ >> *From:* users [users-boun...@open-mpi.org] on behalf of Ralph Castain [ >> r...@open-mpi.org] >> *Sent:* Thursday, June 18, 2015 5:26 PM >> *To:* Open MPI Users >> *Subject:* Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = >> crash >> >> FWIW: I don’t think this actually has anything to do with the #procs you >> are trying to run. Instead, I expect it has to do with confusion over how >> many cores it can bind across. When you tell it to use-hwthread-cpus, you >> are asking us to map processes to hwthreads, and not cores. I don’t know >> which nodes are which, but it could be that we are getting incorrect info >> somewhere. >> >> Given that you are limiting the number of procs to the number of cores, >> is there some reason why you are asking us to use-hwthread-cpus? Why not >> just leave it at the default core level? >> >> I also suspect that you would have no problems if you —bind-to none - >> does that in fact work? >> >> >> On Jun 18, 2015, at 4:54 PM, Lane, William <william.l...@cshs.org> wrote: >> >> I'm having a strange problem w/OpenMPI 1.8.6. If I run >> my OpenMPI test code (compiled against OpenMPI 1.8.6 >> libraries) on < 131 slots I get no issues. Anything over 131 >> errors out: >> >> mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ >> --hostfile hostfile-single --mca btl_tcp_if_include eth0 --hetero-nodes >> --use-hwthread-cpus /hpc/home/lanew/mpi/openmpi/ProcessColors3 >> >> The hostfile has the number of slots restricted >> to the number of cores, while the max-slots includes >> the hyperthreading cores (e.g. csclprd3-0-0 slots=6 >> max-slots=12). >> >> The nodes are a mix of IBM x3550 nodes some >> are Sandybridges and others are older Xeons. >> >> I would like to add that the submit node from >> which I am launching mpirun has the open files >> soft limit (ulimit -a) set to 1024, while the hard limit >> (ulimit -Ha) is set to 4096. I know open file limits >> were an issue w/an older version of OpenMPI. The >> compute nodes all have their hard open files limit >> and soft open files limits set to 4096. >> >> Here's the output (csclprd3-0-13 is the last node >> listed in the hostfile hostfile-single): >> >> [csclprd3-0-13:28765] Signal: Bus error (7) >> [csclprd3-0-13:28765] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980 >> [csclprd3-0-13:28766] *** Process received signal *** >> [csclprd3-0-13:28766] Signal: Bus error (7) >> [csclprd3-0-13:28766] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:28766] Failing at address: 0x7fe137662880 >> [csclprd3-0-13:28768] *** Process received signal *** >> [csclprd3-0-13:28768] Signal: Bus error (7) >> [csclprd3-0-13:28768] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80 >> [csclprd3-0-13:28770] *** Process received signal *** >> [csclprd3-0-13:28770] Signal: Bus error (7) >> [csclprd3-0-13:28770] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00 >> [csclprd3-0-13:28767] *** Process received signal *** >> [csclprd3-0-13:28767] Signal: Bus error (7) >> [csclprd3-0-13:28767] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980 >> [csclprd3-0-13:28764] *** Process received signal *** >> [csclprd3-0-13:28764] Signal: Bus error (7) >> [csclprd3-0-13:28764] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:28765] Signal: Bus error (7) >> [csclprd3-0-13:28765] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980 >> [csclprd3-0-13:28766] *** Process received signal *** >> [csclprd3-0-13:28766] Signal: Bus error (7) >> [csclprd3-0-13:28766] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:28766] Failing at address: 0x7fe137662880 >> [csclprd3-0-13:28768] *** Process received signal *** >> [csclprd3-0-13:28768] Signal: Bus error (7) >> [csclprd3-0-13:28768] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80 >> [csclprd3-0-13:28770] *** Process received signal *** >> [csclprd3-0-13:28770] Signal: Bus error (7) >> [csclprd3-0-13:28770] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00 >> [csclprd3-0-13:28767] *** Process received signal *** >> [csclprd3-0-13:28767] Signal: Bus error (7) >> [csclprd3-0-13:28767] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980 >> [csclprd3-0-13:28764] *** Process received signal *** >> [csclprd3-0-13:28764] Signal: Bus error (7) >> [csclprd3-0-13:28764] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:28768] [ 3] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f9b513ad110] >> [csclprd3-0-13:28768] [ 4] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7f0df77b6009] >> [csclprd3-0-13:28770] [ 3] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f0df77b6110] >> [csclprd3-0-13:28770] [ 4] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b5141d68e] >> [csclprd3-0-13:28768] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b514f1715] >> [csclprd3-0-13:28768] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f30115ea68e] >> [csclprd3-0-13:28765] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f30116be715] >> [csclprd3-0-13:28765] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b7bb3b68e] >> [csclprd3-0-13:28767] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b7bc0f715] >> [csclprd3-0-13:28767] [ 6] [csclprd3-0-13:28764] [ 4] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fa946bb768e] >> [csclprd3-0-13:28764] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fe146d4068e] >> [csclprd3-0-13:28766] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f0df782668e] >> [csclprd3-0-13:28770] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f0df78fa715] >> [csclprd3-0-13:28770] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f0df77d0ad6] >> [csclprd3-0-13:28770] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fe146e14715] >> [csclprd3-0-13:28766] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fe146ceaad6] >> [csclprd3-0-13:28766] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b513c7ad6] >> [csclprd3-0-13:28768] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b513e6c60] >> [csclprd3-0-13:28768] [ 8] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:28768] [ 9] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b50dc7cdd] >> [csclprd3-0-13:28768] [10] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:28768] *** End of error message *** >> >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f3011594ad6] >> [csclprd3-0-13:28765] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f30115b3c60] >> [csclprd3-0-13:28765] [ 8] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:28765] [ 9] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f3010f94cdd] >> [csclprd3-0-13:28765] [10] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:28765] *** End of error message *** >> >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b7bae5ad6] >> [csclprd3-0-13:28767] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b7bb04c60] >> [csclprd3-0-13:28767] [ 8] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:28767] [ 9] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b7b4e5cdd] >> [csclprd3-0-13:28767] [10] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:28767] *** End of error message *** >> >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715] >> [csclprd3-0-13:28764] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6] >> [csclprd3-0-13:28764] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60] >> [csclprd3-0-13:28770] [ 8] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:28770] [ 9] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd] >> [csclprd3-0-13:28770] [10] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:28770] *** End of error message *** >> >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60] >> [csclprd3-0-13:28766] [ 8] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:28766] [ 9] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd] >> [csclprd3-0-13:28767] *** End of error message *** >> >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715] >> [csclprd3-0-13:28764] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6] >> [csclprd3-0-13:28764] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60] >> [csclprd3-0-13:28770] [ 8] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:28770] [ 9] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd] >> [csclprd3-0-13:28770] [10] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:28770] *** End of error message *** >> >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60] >> [csclprd3-0-13:28766] [ 8] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:28766] [ 9] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd] >> [csclprd3-0-13:28766] [10] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:28766] *** End of error message *** >> >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fa946b80c60] >> [csclprd3-0-13:28764] [ 8] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:28764] [ 9] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa946561cdd] >> [csclprd3-0-13:28764] [10] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:28764] *** End of error message *** >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 126 with PID 0 on node csclprd3-0-13 >> exited on signal 7 (Bus error). >> >> Could a lack of the necessary NUMA libraries or the wrong version of NUMA >> libraries be contributing to this? >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended >> recipient, or the employee or agent responsible for delivering it to the >> intended recipient, you are hereby notified that any dissemination, >> distribution or copying of this information is strictly prohibited. Thank >> you for your cooperation. _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27159.php >> >> >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended >> recipient, or the employee or agent responsible for delivering it to the >> intended recipient, you are hereby notified that any dissemination, >> distribution or copying of this information is strictly prohibited. Thank >> you for your cooperation. >> > _______________________________________________ > users mailing list > us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27164.php > > >