Good point William: can you rebuild OMPI with —enable-debug and run this again so we can see where the code is breaking?
Thanks Ralph > On Jun 19, 2015, at 6:11 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > > Ralph, > > I got that, but I cannot read the stack trace (optimized build) > my best bet is to reproduce the issue, and then find how and why > ompi_free_list_t is segfault'ing. > that's why I requested info about the environment > > iirc, ompi_free_list_t are different between master and v1.8, so an incorrect > back port could be the root cause. > > Cheers, > > Gilles > > On Friday, June 19, 2015, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > Gilles > > I was fooled too, but that isn’t the issue. The problem is that > ompi_free_list is segfaulting: > >> [csclprd3-0-13:30901] *** Process received signal *** >> [csclprd3-0-13:30901] Signal: Bus error (7) >> [csclprd3-0-13:30901] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:30901] Failing at address: 0x7ff404351d80 >> [csclprd3-0-13:30901] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ff41453c500] >> [csclprd3-0-13:30901] [ 1] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xd4fea)[0x7ff41481efea] >> [csclprd3-0-13:30901] [ 2] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7ff41479f009] >> [csclprd3-0-13:30901] [ 3] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7ff41479f110] >> [csclprd3-0-13:30901] [ 4] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7ff41480f68e] >> [csclprd3-0-13:30901] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7ff4148e3715] >> [csclprd3-0-13:30901] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7ff4147b9ad6] >> [csclprd3-0-13:30901] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7ff4147d8c60] >> [csclprd3-0-13:30901] [ 8] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:30901] [ 9] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff4141b9cdd] >> [csclprd3-0-13:30901] [10] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:30901] *** End of error message *** > > > >> On Jun 19, 2015, at 5:52 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com >> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote: >> >> Lane, >> >> could you please describe your configuration ? >> how many sockets per node ? >> how many cores per socket ? >> how many threads per core ? >> what is the minimum number of nodes needed to reproduce the issue ? >> do all the nodes have the same configuration ? >> if yes, what happens without --hetero-nodes ? >> >> Cheers, >> >> Gilles >> >> On Friday, June 19, 2015, Lane, William <william.l...@cshs.org >> <javascript:_e(%7B%7D,'cvml','william.l...@cshs.org');>> wrote: >> Ralph, >> >> I created a hostfile that just has the names of the hosts while >> specifying no slot information whatsoever (e.g. csclprd3-0-0) >> and received the following errors: >> >> mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ >> --hostfile hostfile-noslots --mca btl_tcp_if_include eth0 --hetero-nodes >> /hpc/home/lanew/mpi/openmpi/ProcessColors3 >> >> [csclprd3-6-5:14770] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]]: [B/B][./.] >> [csclprd3-6-5:14770] MCW rank 5 bound to socket 1[core 2[hwt 0]], socket >> 1[core 3[hwt 0]]: [./.][B/B] >> [csclprd3-6-5:14770] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]]: [B/B][./.] >> [csclprd3-6-5:14770] MCW rank 7 bound to socket 1[core 2[hwt 0]], socket >> 1[core 3[hwt 0]]: [./.][B/B] >> [csclprd3-0-1:16437] MCW rank 24 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:18925] MCW rank 48 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:16437] MCW rank 25 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:18925] MCW rank 49 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:16437] MCW rank 20 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:16437] MCW rank 21 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:18925] MCW rank 44 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:18925] MCW rank 45 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:16437] MCW rank 22 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:16437] MCW rank 23 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:18925] MCW rank 46 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:18925] MCW rank 47 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:15946] MCW rank 36 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:15946] MCW rank 37 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:15946] MCW rank 32 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:15946] MCW rank 33 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:15946] MCW rank 34 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:15946] MCW rank 35 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:09165] MCW rank 124 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:09165] MCW rank 125 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:09165] MCW rank 120 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:09165] MCW rank 121 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:09165] MCW rank 122 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:09165] MCW rank 123 is not bound (or bound to all available >> processors) >> [csclprd3-6-1:27030] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]]: [B/B][./.] >> [csclprd3-6-1:27030] MCW rank 1 bound to socket 1[core 2[hwt 0]], socket >> 1[core 3[hwt 0]]: [./.][B/B] >> [csclprd3-6-1:27030] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]]: [B/B][./.] >> [csclprd3-6-1:27030] MCW rank 3 bound to socket 1[core 2[hwt 0]], socket >> 1[core 3[hwt 0]]: [./.][B/B] >> [csclprd3-0-2:07944] MCW rank 30 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:32510] MCW rank 54 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:07944] MCW rank 31 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:32510] MCW rank 55 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:07944] MCW rank 26 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:32510] MCW rank 50 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:32510] MCW rank 51 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:07944] MCW rank 27 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:07944] MCW rank 28 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:32510] MCW rank 52 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:32510] MCW rank 53 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:07944] MCW rank 29 is not bound (or bound to all available >> processors) >> [csclprd3-0-0:00453] MCW rank 11 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], >> socket1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:00453] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-0:00453] MCW rank 13 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:00453] MCW rank 14 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-0:00453] MCW rank 15 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:00453] MCW rank 16 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-7:22146] MCW rank 64 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-7:22146] MCW rank 65 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-0:00453] MCW rank 17 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:00453] MCW rank 18 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-11:00885] MCW rank 116 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:00885] MCW rank 117 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >> 14[hwt 0-1]],socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-10:20752] MCW rank 100 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-10:20752] MCW rank 101 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-0:00453] MCW rank 19 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-7:22146] MCW rank 66 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:00885] MCW rank 118 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-0:00453] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-10:20752] MCW rank 102 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-0:00453] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:00453] MCW rank 10 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-4:32449] MCW rank 42 is not bound (or bound to all available >> processors) >> [csclprd3-0-4:32449] MCW rank 43 is not bound (or bound to all available >> processors) >> [csclprd3-0-4:32449] MCW rank 38 is not bound (or bound to all available >> processors) >> [csclprd3-0-4:32449] MCW rank 39 is not bound (or bound to all available >> processors) >> [csclprd3-0-4:32449] MCW rank 40 is not bound (or bound to all available >> processors) >> [csclprd3-0-4:32449] MCW rank 41 is not bound (or bound to all available >> processors) >> [csclprd3-0-13:30897] MCW rank 126 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB][../../../../../..] >> [csclprd3-0-8:17159] MCW rank 80 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-13:30897] MCW rank 127 bound to socket 1[core 6[hwt 0-1]], >> socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt >> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: >> [../../../../../..][BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 81 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core >> 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: >> [../../../../../..][BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 81 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-13:30897] MCW rank 128 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB][../../../../../..] >> [csclprd3-0-8:17159] MCW rank 82 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-13:30897] MCW rank 129 bound to socket 1[core 6[hwt 0-1]], >> socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt >> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: >> [../../../../../..][BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 83 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-13:30897] MCW rank 130 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB][../../../../../..] >> [csclprd3-0-13:30897] MCW rank 131 bound to socket 1[core 6[hwt 0-1]], >> socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt >> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: >> [../../../../../..][BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 84 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-8:17159] MCW rank 85 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-11:00885] MCW rank 119 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-10:20752] MCW rank 103 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 86 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-7:22146] MCW rank 67 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-11:00885] MCW rank 104 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..][csclprd3-0-10:20752] MCW >> rank 88 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], >> socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt >> 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core >> 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-8:17159] MCW rank 87 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-11:00885] MCW rank 105 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-10:20752] MCW rank 89 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 72 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-7:22146] MCW rank 68 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:00885] MCW rank 106 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-10:20752] MCW rank 90 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-8:17159] MCW rank 73 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-11:00885] MCW rank 107 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:22146] MCW rank 69 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-8:17159] MCW rank 74 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:00885] MCW rank 108 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:22146] MCW rank 57 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-11:00885] MCW rank 114 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-10:20752] MCW rank 98 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:00885] MCW rank 115 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:22146] MCW rank 58 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-10:20752] MCW rank 99 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:22146] MCW rank 59 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:22146] MCW rank 60 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-7:22146] MCW rank 61 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:22146] MCW rank 62 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-7:22146] MCW rank 63 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-13:30901] *** Process received signal *** >> [csclprd3-0-13:30901] Signal: Bus error (7) >> [csclprd3-0-13:30901] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:30901] Failing at address: 0x7ff404351d80 >> [csclprd3-0-13:30901] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ff41453c500] >> [csclprd3-0-13:30901] [ 1] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xd4fea)[0x7ff41481efea] >> [csclprd3-0-13:30901] [ 2] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7ff41479f009] >> [csclprd3-0-13:30901] [ 3] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7ff41479f110] >> [csclprd3-0-13:30901] [ 4] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7ff41480f68e] >> [csclprd3-0-13:30901] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7ff4148e3715] >> [csclprd3-0-13:30901] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7ff4147b9ad6] >> [csclprd3-0-13:30901] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7ff4147d8c60] >> [csclprd3-0-13:30901] [ 8] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:30901] [ 9] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff4141b9cdd] >> [csclprd3-0-13:30901] [10] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:30901] *** End of error message *** >> >> From: users [users-boun...@open-mpi.org <>] on behalf of Ralph Castain >> [r...@open-mpi.org <>] >> Sent: Thursday, June 18, 2015 5:26 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = crash >> >> FWIW: I don’t think this actually has anything to do with the #procs you are >> trying to run. Instead, I expect it has to do with confusion over how many >> cores it can bind across. When you tell it to use-hwthread-cpus, you are >> asking us to map processes to hwthreads, and not cores. I don’t know which >> nodes are which, but it could be that we are getting incorrect info >> somewhere. >> >> Given that you are limiting the number of procs to the number of cores, is >> there some reason why you are asking us to use-hwthread-cpus? Why not just >> leave it at the default core level? >> >> I also suspect that you would have no problems if you —bind-to none - does >> that in fact work? >> >> >>> On Jun 18, 2015, at 4:54 PM, Lane, William <william.l...@cshs.org <>> wrote: >>> >>> I'm having a strange problem w/OpenMPI 1.8.6. If I run >>> my OpenMPI test code (compiled against OpenMPI 1.8.6 >>> libraries) on < 131 slots I get no issues. Anything over 131 >>> errors out: >>> >>> mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ >>> --hostfile hostfile-single --mca btl_tcp_if_include eth0 --hetero-nodes >>> --use-hwthread-cpus /hpc/home/lanew/mpi/openmpi/ProcessColors3 >>> >>> The hostfile has the number of slots restricted >>> to the number of cores, while the max-slots includes >>> the hyperthreading cores (e.g. csclprd3-0-0 slots=6 >>> max-slots=12). >>> >>> The nodes are a mix of IBM x3550 nodes some >>> are Sandybridges and others are older Xeons. >>> >>> I would like to add that the submit node from >>> which I am launching mpirun has the open files >>> soft limit (ulimit -a) set to 1024, while the hard limit >>> (ulimit -Ha) is set to 4096. I know open file limits >>> were an issue w/an older version of OpenMPI. The >>> compute nodes all have their hard open files limit >>> and soft open files limits set to 4096. >>> >>> Here's the output (csclprd3-0-13 is the last node >>> listed in the hostfile hostfile-single): >>> >>> [csclprd3-0-13:28765] Signal: Bus error (7) >>> [csclprd3-0-13:28765] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980 >>> [csclprd3-0-13:28766] *** Process received signal *** >>> [csclprd3-0-13:28766] Signal: Bus error (7) >>> [csclprd3-0-13:28766] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28766] Failing at address: 0x7fe137662880 >>> [csclprd3-0-13:28768] *** Process received signal *** >>> [csclprd3-0-13:28768] Signal: Bus error (7) >>> [csclprd3-0-13:28768] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80 >>> [csclprd3-0-13:28770] *** Process received signal *** >>> [csclprd3-0-13:28770] Signal: Bus error (7) >>> [csclprd3-0-13:28770] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00 >>> [csclprd3-0-13:28767] *** Process received signal *** >>> [csclprd3-0-13:28767] Signal: Bus error (7) >>> [csclprd3-0-13:28767] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980 >>> [csclprd3-0-13:28764] *** Process received signal *** >>> [csclprd3-0-13:28764] Signal: Bus error (7) >>> [csclprd3-0-13:28764] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28765] Signal: Bus error (7) >>> [csclprd3-0-13:28765] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980 >>> [csclprd3-0-13:28766] *** Process received signal *** >>> [csclprd3-0-13:28766] Signal: Bus error (7) >>> [csclprd3-0-13:28766] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28766] Failing at address: 0x7fe137662880 >>> [csclprd3-0-13:28768] *** Process received signal *** >>> [csclprd3-0-13:28768] Signal: Bus error (7) >>> [csclprd3-0-13:28768] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80 >>> [csclprd3-0-13:28770] *** Process received signal *** >>> [csclprd3-0-13:28770] Signal: Bus error (7) >>> [csclprd3-0-13:28770] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00 >>> [csclprd3-0-13:28767] *** Process received signal *** >>> [csclprd3-0-13:28767] Signal: Bus error (7) >>> [csclprd3-0-13:28767] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980 >>> [csclprd3-0-13:28764] *** Process received signal *** >>> [csclprd3-0-13:28764] Signal: Bus error (7) >>> [csclprd3-0-13:28764] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28768] [ 3] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f9b513ad110] >>> [csclprd3-0-13:28768] [ 4] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7f0df77b6009] >>> [csclprd3-0-13:28770] [ 3] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f0df77b6110] >>> [csclprd3-0-13:28770] [ 4] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b5141d68e] >>> [csclprd3-0-13:28768] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b514f1715] >>> [csclprd3-0-13:28768] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f30115ea68e] >>> [csclprd3-0-13:28765] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f30116be715] >>> [csclprd3-0-13:28765] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b7bb3b68e] >>> [csclprd3-0-13:28767] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b7bc0f715] >>> [csclprd3-0-13:28767] [ 6] [csclprd3-0-13:28764] [ 4] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fa946bb768e] >>> [csclprd3-0-13:28764] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fe146d4068e] >>> [csclprd3-0-13:28766] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f0df782668e] >>> [csclprd3-0-13:28770] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f0df78fa715] >>> [csclprd3-0-13:28770] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f0df77d0ad6] >>> [csclprd3-0-13:28770] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fe146e14715] >>> [csclprd3-0-13:28766] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fe146ceaad6] >>> [csclprd3-0-13:28766] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b513c7ad6] >>> [csclprd3-0-13:28768] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b513e6c60] >>> [csclprd3-0-13:28768] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28768] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b50dc7cdd] >>> [csclprd3-0-13:28768] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28768] *** End of error message *** >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f3011594ad6] >>> [csclprd3-0-13:28765] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f30115b3c60] >>> [csclprd3-0-13:28765] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28765] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f3010f94cdd] >>> [csclprd3-0-13:28765] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28765] *** End of error message *** >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b7bae5ad6] >>> [csclprd3-0-13:28767] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b7bb04c60] >>> [csclprd3-0-13:28767] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28767] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b7b4e5cdd] >>> [csclprd3-0-13:28767] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28767] *** End of error message *** >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715] >>> [csclprd3-0-13:28764] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6] >>> [csclprd3-0-13:28764] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60] >>> [csclprd3-0-13:28770] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28770] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd] >>> [csclprd3-0-13:28770] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28770] *** End of error message *** >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60] >>> [csclprd3-0-13:28766] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28766] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd] >>> [csclprd3-0-13:28767] *** End of error message *** >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715] >>> [csclprd3-0-13:28764] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6] >>> [csclprd3-0-13:28764] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60] >>> [csclprd3-0-13:28770] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28770] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd] >>> [csclprd3-0-13:28770] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28770] *** End of error message *** >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60] >>> [csclprd3-0-13:28766] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28766] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd] >>> [csclprd3-0-13:28766] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28766] *** End of error message *** >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fa946b80c60] >>> [csclprd3-0-13:28764] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28764] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa946561cdd] >>> [csclprd3-0-13:28764] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28764] *** End of error message *** >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 126 with PID 0 on node csclprd3-0-13 >>> exited on signal 7 (Bus error). >>> >>> Could a lack of the necessary NUMA libraries or the wrong version of NUMA >>> libraries be contributing to this? >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is strictly prohibited. Thank >>> you for your cooperation. _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/06/27159.php >>> <http://www.open-mpi.org/community/lists/users/2015/06/27159.php> >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is strictly prohibited. Thank you for your >> cooperation. >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27164.php >> <http://www.open-mpi.org/community/lists/users/2015/06/27164.php> > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27166.php