You shouldn't need any special flags for mpicc or mpirun to replicate the problem. This will just let us see the line numbers associated with the crash so we can narrow down the problem. Once we get that, we may need to rerun with specific params to narrow it down further.
BTW: when you get the backtrace, could you check the other threads as well? There are several threads running underneath now, and it would help to get the backtrace for each of them just to ensure there isn't something funny going on. Thanks Ralph On Tue, Jun 23, 2015 at 12:19 PM, Lane, William <william.l...@cshs.org> wrote: > Ralph, > > I've had OpenMPI 1.8.6 installed on our cluster w/the --enable-debug > option. Here's what I think are the relevant flags returned from ompi_info: > > openMPI 1.8.6 build info > Fort MPI_SIZEOF: no > C profiling: yes > C++ profiling: yes > Fort mpif.h profiling: yes > Fort use mpi profiling: yes > Fort use mpi_f08 prof: no > C++ exceptions: no > Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI > progress: no, ORTE progress: yes, Event lib: yes) > Sparse Groups: no > Internal debug support: yes > MPI interface warnings: yes > MPI parameter check: runtime > Memory profiling support: no > Memory debugging support: no > dl support: yes > Heterogeneous support: no > mpirun default --prefix: no > > Do I need to compile my OpenMPI C test code w/any special > switches passed to mpicc? > > Are there any special switches should I use with mpirun to run my job? > > Thanks for your help w/these issues. > > -Bill L. > ------------------------------ > *From:* users [users-boun...@open-mpi.org] on behalf of Ralph Castain [ > r...@open-mpi.org] > *Sent:* Friday, June 19, 2015 6:40 AM > > *To:* Open MPI Users > *Subject:* Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = > crash > > Good point > > William: can you rebuild OMPI with —enable-debug and run this again so > we can see where the code is breaking? > > Thanks > Ralph > > > On Jun 19, 2015, at 6:11 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > Ralph, > > I got that, but I cannot read the stack trace (optimized build) > my best bet is to reproduce the issue, and then find how and why > ompi_free_list_t is segfault'ing. > that's why I requested info about the environment > > iirc, ompi_free_list_t are different between master and v1.8, so an > incorrect back port could be the root cause. > > Cheers, > > Gilles > > On Friday, June 19, 2015, Ralph Castain <r...@open-mpi.org> wrote: > >> Gilles >> >> I was fooled too, but that isn’t the issue. The problem is that >> ompi_free_list is segfaulting: >> >> [csclprd3-0-13:30901] *** Process received signal *** >>> [csclprd3-0-13:30901] Signal: Bus error (7) >>> [csclprd3-0-13:30901] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:30901] Failing at address: 0x7ff404351d80 >>> [csclprd3-0-13:30901] [ 0] >>> /lib64/libpthread.so.0(+0xf500)[0x7ff41453c500] >>> [csclprd3-0-13:30901] [ 1] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xd4fea)[0x7ff41481efea] >>> [csclprd3-0-13:30901] [ 2] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7ff41479f009] >>> [csclprd3-0-13:30901] [ 3] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7ff41479f110] >>> [csclprd3-0-13:30901] [ 4] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7ff41480f68e] >>> [csclprd3-0-13:30901] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7ff4148e3715] >>> [csclprd3-0-13:30901] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7ff4147b9ad6] >>> [csclprd3-0-13:30901] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7ff4147d8c60] >>> [csclprd3-0-13:30901] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:30901] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff4141b9cdd] >>> [csclprd3-0-13:30901] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:30901] *** End of error message *** >>> >> >> >> >> On Jun 19, 2015, at 5:52 AM, Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com <http://UrlBlockedError.aspx>> wrote: >> >> Lane, >> >> could you please describe your configuration ? >> how many sockets per node ? >> how many cores per socket ? >> how many threads per core ? >> what is the minimum number of nodes needed to reproduce the issue ? >> do all the nodes have the same configuration ? >> if yes, what happens without --hetero-nodes ? >> >> Cheers, >> >> Gilles >> >> On Friday, June 19, 2015, Lane, William <william.l...@cshs.org >> <http://UrlBlockedError.aspx>> wrote: >> >>> Ralph, >>> >>> I created a hostfile that just has the names of the hosts while >>> specifying no slot information whatsoever (e.g. csclprd3-0-0) >>> and received the following errors: >>> >>> mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ >>> --hostfile hostfile-noslots --mca btl_tcp_if_include eth0 --hetero-nodes >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3 >>> >>> [csclprd3-6-5:14770] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket >>> 0[core 1[hwt 0]]: [B/B][./.] >>> [csclprd3-6-5:14770] MCW rank 5 bound to socket 1[core 2[hwt 0]], socket >>> 1[core 3[hwt 0]]: [./.][B/B] >>> [csclprd3-6-5:14770] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket >>> 0[core 1[hwt 0]]: [B/B][./.] >>> [csclprd3-6-5:14770] MCW rank 7 bound to socket 1[core 2[hwt 0]], socket >>> 1[core 3[hwt 0]]: [./.][B/B] >>> [csclprd3-0-1:16437] MCW rank 24 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-5:18925] MCW rank 48 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-1:16437] MCW rank 25 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-5:18925] MCW rank 49 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-1:16437] MCW rank 20 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-1:16437] MCW rank 21 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-5:18925] MCW rank 44 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-5:18925] MCW rank 45 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-1:16437] MCW rank 22 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-1:16437] MCW rank 23 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-5:18925] MCW rank 46 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-5:18925] MCW rank 47 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-3:15946] MCW rank 36 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-3:15946] MCW rank 37 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-3:15946] MCW rank 32 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-3:15946] MCW rank 33 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-3:15946] MCW rank 34 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-3:15946] MCW rank 35 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-12:09165] MCW rank 124 is not bound (or bound to all >>> available processors) >>> [csclprd3-0-12:09165] MCW rank 125 is not bound (or bound to all >>> available processors) >>> [csclprd3-0-12:09165] MCW rank 120 is not bound (or bound to all >>> available processors) >>> [csclprd3-0-12:09165] MCW rank 121 is not bound (or bound to all >>> available processors) >>> [csclprd3-0-12:09165] MCW rank 122 is not bound (or bound to all >>> available processors) >>> [csclprd3-0-12:09165] MCW rank 123 is not bound (or bound to all >>> available processors) >>> [csclprd3-6-1:27030] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket >>> 0[core 1[hwt 0]]: [B/B][./.] >>> [csclprd3-6-1:27030] MCW rank 1 bound to socket 1[core 2[hwt 0]], socket >>> 1[core 3[hwt 0]]: [./.][B/B] >>> [csclprd3-6-1:27030] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket >>> 0[core 1[hwt 0]]: [B/B][./.] >>> [csclprd3-6-1:27030] MCW rank 3 bound to socket 1[core 2[hwt 0]], socket >>> 1[core 3[hwt 0]]: [./.][B/B] >>> [csclprd3-0-2:07944] MCW rank 30 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-6:32510] MCW rank 54 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-2:07944] MCW rank 31 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-6:32510] MCW rank 55 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-2:07944] MCW rank 26 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-6:32510] MCW rank 50 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-6:32510] MCW rank 51 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-2:07944] MCW rank 27 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-2:07944] MCW rank 28 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-6:32510] MCW rank 52 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-6:32510] MCW rank 53 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-2:07944] MCW rank 29 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-0:00453] MCW rank 11 bound to socket 1[core 6[hwt 0]], >>> socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], >>> socket1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: >>> [./././././.][B/B/B/B/B/B] >>> [csclprd3-0-0:00453] MCW rank 12 bound to socket 0[core 0[hwt 0]], >>> socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], >>> socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >>> [csclprd3-0-0:00453] MCW rank 13 bound to socket 1[core 6[hwt 0]], >>> socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], >>> socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: >>> [./././././.][B/B/B/B/B/B] >>> [csclprd3-0-0:00453] MCW rank 14 bound to socket 0[core 0[hwt 0]], >>> socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], >>> socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >>> [csclprd3-0-0:00453] MCW rank 15 bound to socket 1[core 6[hwt 0]], >>> socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], >>> socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: >>> [./././././.][B/B/B/B/B/B] >>> [csclprd3-0-0:00453] MCW rank 16 bound to socket 0[core 0[hwt 0]], >>> socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], >>> socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >>> [csclprd3-0-7:22146] MCW rank 64 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-7:22146] MCW rank 65 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-0:00453] MCW rank 17 bound to socket 1[core 6[hwt 0]], >>> socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], >>> socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: >>> [./././././.][B/B/B/B/B/B] >>> [csclprd3-0-0:00453] MCW rank 18 bound to socket 0[core 0[hwt 0]], >>> socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], >>> socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >>> [csclprd3-0-11:00885] MCW rank 116 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-11:00885] MCW rank 117 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]],socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-10:20752] MCW rank 100 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-10:20752] MCW rank 101 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-0:00453] MCW rank 19 bound to socket 1[core 6[hwt 0]], >>> socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], >>> socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: >>> [./././././.][B/B/B/B/B/B] >>> [csclprd3-0-7:22146] MCW rank 66 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-11:00885] MCW rank 118 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-0:00453] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >>> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >>> [csclprd3-0-10:20752] MCW rank 102 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-0:00453] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket >>> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >>> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >>> [csclprd3-0-0:00453] MCW rank 10 bound to socket 0[core 0[hwt 0]], >>> socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], >>> socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >>> [csclprd3-0-4:32449] MCW rank 42 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-4:32449] MCW rank 43 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-4:32449] MCW rank 38 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-4:32449] MCW rank 39 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-4:32449] MCW rank 40 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-4:32449] MCW rank 41 is not bound (or bound to all available >>> processors) >>> [csclprd3-0-13:30897] MCW rank 126 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB][../../../../../..] >>> [csclprd3-0-8:17159] MCW rank 80 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-13:30897] MCW rank 127 bound to socket 1[core 6[hwt 0-1]], >>> socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt >>> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: >>> [../../../../../..][BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-8:17159] MCW rank 81 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], >>> socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: >>> [../../../../../..][BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-8:17159] MCW rank 81 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-13:30897] MCW rank 128 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB][../../../../../..] >>> [csclprd3-0-8:17159] MCW rank 82 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-13:30897] MCW rank 129 bound to socket 1[core 6[hwt 0-1]], >>> socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt >>> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: >>> [../../../../../..][BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-8:17159] MCW rank 83 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-13:30897] MCW rank 130 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB][../../../../../..] >>> [csclprd3-0-13:30897] MCW rank 131 bound to socket 1[core 6[hwt 0-1]], >>> socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt >>> 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: >>> [../../../../../..][BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-8:17159] MCW rank 84 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-8:17159] MCW rank 85 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-11:00885] MCW rank 119 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-10:20752] MCW rank 103 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-8:17159] MCW rank 86 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-7:22146] MCW rank 67 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >>> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-11:00885] MCW rank 104 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..][csclprd3-0-10:20752] MCW >>> rank 88 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], >>> socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt >>> 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core >>> 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-8:17159] MCW rank 87 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-11:00885] MCW rank 105 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-10:20752] MCW rank 89 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-8:17159] MCW rank 72 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-7:22146] MCW rank 68 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-11:00885] MCW rank 106 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-10:20752] MCW rank 90 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-8:17159] MCW rank 73 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-11:00885] MCW rank 107 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-7:22146] MCW rank 69 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-8:17159] MCW rank 74 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-11:00885] MCW rank 108 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-7:22146] MCW rank 57 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-11:00885] MCW rank 114 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-10:20752] MCW rank 98 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-11:00885] MCW rank 115 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-7:22146] MCW rank 58 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-10:20752] MCW rank 99 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-7:22146] MCW rank 59 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-7:22146] MCW rank 60 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-7:22146] MCW rank 61 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-7:22146] MCW rank 62 bound to socket 0[core 0[hwt 0-1]], >>> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >>> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >>> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >>> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >>> [csclprd3-0-7:22146] MCW rank 63 bound to socket 1[core 8[hwt 0-1]], >>> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >>> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket >>> 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >>> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >>> [csclprd3-0-13:30901] *** Process received signal *** >>> [csclprd3-0-13:30901] Signal: Bus error (7) >>> [csclprd3-0-13:30901] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:30901] Failing at address: 0x7ff404351d80 >>> [csclprd3-0-13:30901] [ 0] >>> /lib64/libpthread.so.0(+0xf500)[0x7ff41453c500] >>> [csclprd3-0-13:30901] [ 1] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xd4fea)[0x7ff41481efea] >>> [csclprd3-0-13:30901] [ 2] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7ff41479f009] >>> [csclprd3-0-13:30901] [ 3] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7ff41479f110] >>> [csclprd3-0-13:30901] [ 4] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7ff41480f68e] >>> [csclprd3-0-13:30901] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7ff4148e3715] >>> [csclprd3-0-13:30901] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7ff4147b9ad6] >>> [csclprd3-0-13:30901] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7ff4147d8c60] >>> [csclprd3-0-13:30901] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:30901] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff4141b9cdd] >>> [csclprd3-0-13:30901] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:30901] *** End of error message *** >>> >>> ------------------------------ >>> *From:* users [users-boun...@open-mpi.org] on behalf of Ralph Castain [ >>> r...@open-mpi.org] >>> *Sent:* Thursday, June 18, 2015 5:26 PM >>> *To:* Open MPI Users >>> *Subject:* Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = >>> crash >>> >>> FWIW: I don’t think this actually has anything to do with the #procs >>> you are trying to run. Instead, I expect it has to do with confusion over >>> how many cores it can bind across. When you tell it to use-hwthread-cpus, >>> you are asking us to map processes to hwthreads, and not cores. I don’t >>> know which nodes are which, but it could be that we are getting incorrect >>> info somewhere. >>> >>> Given that you are limiting the number of procs to the number of >>> cores, is there some reason why you are asking us to use-hwthread-cpus? Why >>> not just leave it at the default core level? >>> >>> I also suspect that you would have no problems if you —bind-to none - >>> does that in fact work? >>> >>> >>> On Jun 18, 2015, at 4:54 PM, Lane, William <william.l...@cshs.org> >>> wrote: >>> >>> I'm having a strange problem w/OpenMPI 1.8.6. If I run >>> my OpenMPI test code (compiled against OpenMPI 1.8.6 >>> libraries) on < 131 slots I get no issues. Anything over 131 >>> errors out: >>> >>> mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ >>> --hostfile hostfile-single --mca btl_tcp_if_include eth0 --hetero-nodes >>> --use-hwthread-cpus /hpc/home/lanew/mpi/openmpi/ProcessColors3 >>> >>> The hostfile has the number of slots restricted >>> to the number of cores, while the max-slots includes >>> the hyperthreading cores (e.g. csclprd3-0-0 slots=6 >>> max-slots=12). >>> >>> The nodes are a mix of IBM x3550 nodes some >>> are Sandybridges and others are older Xeons. >>> >>> I would like to add that the submit node from >>> which I am launching mpirun has the open files >>> soft limit (ulimit -a) set to 1024, while the hard limit >>> (ulimit -Ha) is set to 4096. I know open file limits >>> were an issue w/an older version of OpenMPI. The >>> compute nodes all have their hard open files limit >>> and soft open files limits set to 4096. >>> >>> Here's the output (csclprd3-0-13 is the last node >>> listed in the hostfile hostfile-single): >>> >>> [csclprd3-0-13:28765] Signal: Bus error (7) >>> [csclprd3-0-13:28765] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980 >>> [csclprd3-0-13:28766] *** Process received signal *** >>> [csclprd3-0-13:28766] Signal: Bus error (7) >>> [csclprd3-0-13:28766] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28766] Failing at address: 0x7fe137662880 >>> [csclprd3-0-13:28768] *** Process received signal *** >>> [csclprd3-0-13:28768] Signal: Bus error (7) >>> [csclprd3-0-13:28768] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80 >>> [csclprd3-0-13:28770] *** Process received signal *** >>> [csclprd3-0-13:28770] Signal: Bus error (7) >>> [csclprd3-0-13:28770] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00 >>> [csclprd3-0-13:28767] *** Process received signal *** >>> [csclprd3-0-13:28767] Signal: Bus error (7) >>> [csclprd3-0-13:28767] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980 >>> [csclprd3-0-13:28764] *** Process received signal *** >>> [csclprd3-0-13:28764] Signal: Bus error (7) >>> [csclprd3-0-13:28764] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28765] Signal: Bus error (7) >>> [csclprd3-0-13:28765] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980 >>> [csclprd3-0-13:28766] *** Process received signal *** >>> [csclprd3-0-13:28766] Signal: Bus error (7) >>> [csclprd3-0-13:28766] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28766] Failing at address: 0x7fe137662880 >>> [csclprd3-0-13:28768] *** Process received signal *** >>> [csclprd3-0-13:28768] Signal: Bus error (7) >>> [csclprd3-0-13:28768] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80 >>> [csclprd3-0-13:28770] *** Process received signal *** >>> [csclprd3-0-13:28770] Signal: Bus error (7) >>> [csclprd3-0-13:28770] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00 >>> [csclprd3-0-13:28767] *** Process received signal *** >>> [csclprd3-0-13:28767] Signal: Bus error (7) >>> [csclprd3-0-13:28767] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980 >>> [csclprd3-0-13:28764] *** Process received signal *** >>> [csclprd3-0-13:28764] Signal: Bus error (7) >>> [csclprd3-0-13:28764] Signal code: Non-existant physical address (2) >>> [csclprd3-0-13:28768] [ 3] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f9b513ad110] >>> [csclprd3-0-13:28768] [ 4] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7f0df77b6009] >>> [csclprd3-0-13:28770] [ 3] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f0df77b6110] >>> [csclprd3-0-13:28770] [ 4] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b5141d68e] >>> [csclprd3-0-13:28768] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b514f1715] >>> [csclprd3-0-13:28768] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f30115ea68e] >>> [csclprd3-0-13:28765] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f30116be715] >>> [csclprd3-0-13:28765] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b7bb3b68e] >>> [csclprd3-0-13:28767] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b7bc0f715] >>> [csclprd3-0-13:28767] [ 6] [csclprd3-0-13:28764] [ 4] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fa946bb768e] >>> [csclprd3-0-13:28764] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fe146d4068e] >>> [csclprd3-0-13:28766] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f0df782668e] >>> [csclprd3-0-13:28770] [ 5] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f0df78fa715] >>> [csclprd3-0-13:28770] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f0df77d0ad6] >>> [csclprd3-0-13:28770] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fe146e14715] >>> [csclprd3-0-13:28766] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fe146ceaad6] >>> [csclprd3-0-13:28766] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b513c7ad6] >>> [csclprd3-0-13:28768] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b513e6c60] >>> [csclprd3-0-13:28768] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28768] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b50dc7cdd] >>> [csclprd3-0-13:28768] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28768] *** End of error message *** >>> >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f3011594ad6] >>> [csclprd3-0-13:28765] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f30115b3c60] >>> [csclprd3-0-13:28765] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28765] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f3010f94cdd] >>> [csclprd3-0-13:28765] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28765] *** End of error message *** >>> >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b7bae5ad6] >>> [csclprd3-0-13:28767] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b7bb04c60] >>> [csclprd3-0-13:28767] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28767] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b7b4e5cdd] >>> [csclprd3-0-13:28767] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28767] *** End of error message *** >>> >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715] >>> [csclprd3-0-13:28764] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6] >>> [csclprd3-0-13:28764] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60] >>> [csclprd3-0-13:28770] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28770] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd] >>> [csclprd3-0-13:28770] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28770] *** End of error message *** >>> >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60] >>> [csclprd3-0-13:28766] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28766] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd] >>> [csclprd3-0-13:28767] *** End of error message *** >>> >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715] >>> [csclprd3-0-13:28764] [ 6] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6] >>> [csclprd3-0-13:28764] [ 7] >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60] >>> [csclprd3-0-13:28770] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28770] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd] >>> [csclprd3-0-13:28770] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28770] *** End of error message *** >>> >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60] >>> [csclprd3-0-13:28766] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28766] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd] >>> [csclprd3-0-13:28766] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28766] *** End of error message *** >>> >>> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fa946b80c60] >>> [csclprd3-0-13:28764] [ 8] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >>> [csclprd3-0-13:28764] [ 9] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa946561cdd] >>> [csclprd3-0-13:28764] [10] >>> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >>> [csclprd3-0-13:28764] *** End of error message *** >>> >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 126 with PID 0 on node csclprd3-0-13 >>> exited on signal 7 (Bus error). >>> >>> Could a lack of the necessary NUMA libraries or the wrong version of NUMA >>> libraries be contributing to this? >>> IMPORTANT WARNING: This message is intended for the use of the person >>> or entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is strictly prohibited. Thank >>> you for your cooperation. _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/06/27159.php >>> >>> >>> IMPORTANT WARNING: This message is intended for the use of the person >>> or entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is strictly prohibited. Thank >>> you for your cooperation. >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <http://UrlBlockedError.aspx> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27164.php >> >> >> _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27166.php > > > IMPORTANT WARNING: This message is intended for the use of the person > or entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended > recipient, or the employee or agent responsible for delivering it to the > intended recipient, you are hereby notified that any dissemination, > distribution or copying of this information is strictly prohibited. Thank > you for your cooperation. > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27176.php >