Gilles,

The nodes do not all have the same configuration. There are probably 6 different
hardware configurations (as to memory, number of sockets populated, types of 
CPU).

Some of the systems are older dual core Xeons (5160 and L5240 CPU's) installed 
in a blade chassis (some
of these blades have as little as 4 GiB of memory and others have 16 GiB of 
memory. They
all have two Xeon CPU's per blade (for 4 cores on 2, separate sockets).

The newer systems are IBM X3550 servers. Some of these systems have single, 6 
core, Intel Xeon E5645's,
others feature the Intel server version of the Intel Sandybridge CPU. Some of 
them only have a single socket
populated, while others have two sockets populated. All these systems have 72 
GiB and up of memory.

The minimum number of requested slots (-np) to reproduce the issue seems to be 
anything > 131.

-Bill L.

-------------------------------------------------------------------

From: users [users-boun...@open-mpi.org] on behalf of Gilles Gouaillardet 
[gilles.gouaillar...@gmail.com]
Sent: Friday, June 19, 2015 5:52 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = crash

Lane,

could you please describe your configuration ?
how many sockets per node ?
how many cores per socket ?
how many threads per core ?
what is the minimum number of nodes needed to reproduce the issue ?
do all the nodes have the same configuration ?
if yes, what happens without --hetero-nodes ?

Cheers,

Gilles

On Friday, June 19, 2015, Lane, William 
<william.l...@cshs.org<mailto:william.l...@cshs.org>> wrote:
Ralph,

I created a hostfile that just has the names of the hosts while
specifying no slot information whatsoever (e.g. csclprd3-0-0)
and received the following errors:

mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ 
--hostfile hostfile-noslots --mca btl_tcp_if_include eth0 --hetero-nodes 
/hpc/home/lanew/mpi/openmpi/ProcessColors3

[csclprd3-6-5:14770] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B][./.]
[csclprd3-6-5:14770] MCW rank 5 bound to socket 1[core 2[hwt 0]], socket 1[core 
3[hwt 0]]: [./.][B/B]
[csclprd3-6-5:14770] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B][./.]
[csclprd3-6-5:14770] MCW rank 7 bound to socket 1[core 2[hwt 0]], socket 1[core 
3[hwt 0]]: [./.][B/B]
[csclprd3-0-1:16437] MCW rank 24 is not bound (or bound to all available 
processors)
[csclprd3-0-5:18925] MCW rank 48 is not bound (or bound to all available 
processors)
[csclprd3-0-1:16437] MCW rank 25 is not bound (or bound to all available 
processors)
[csclprd3-0-5:18925] MCW rank 49 is not bound (or bound to all available 
processors)
[csclprd3-0-1:16437] MCW rank 20 is not bound (or bound to all available 
processors)
[csclprd3-0-1:16437] MCW rank 21 is not bound (or bound to all available 
processors)
[csclprd3-0-5:18925] MCW rank 44 is not bound (or bound to all available 
processors)
[csclprd3-0-5:18925] MCW rank 45 is not bound (or bound to all available 
processors)
[csclprd3-0-1:16437] MCW rank 22 is not bound (or bound to all available 
processors)
[csclprd3-0-1:16437] MCW rank 23 is not bound (or bound to all available 
processors)
[csclprd3-0-5:18925] MCW rank 46 is not bound (or bound to all available 
processors)
[csclprd3-0-5:18925] MCW rank 47 is not bound (or bound to all available 
processors)
[csclprd3-0-3:15946] MCW rank 36 is not bound (or bound to all available 
processors)
[csclprd3-0-3:15946] MCW rank 37 is not bound (or bound to all available 
processors)
[csclprd3-0-3:15946] MCW rank 32 is not bound (or bound to all available 
processors)
[csclprd3-0-3:15946] MCW rank 33 is not bound (or bound to all available 
processors)
[csclprd3-0-3:15946] MCW rank 34 is not bound (or bound to all available 
processors)
[csclprd3-0-3:15946] MCW rank 35 is not bound (or bound to all available 
processors)
[csclprd3-0-12:09165] MCW rank 124 is not bound (or bound to all available 
processors)
[csclprd3-0-12:09165] MCW rank 125 is not bound (or bound to all available 
processors)
[csclprd3-0-12:09165] MCW rank 120 is not bound (or bound to all available 
processors)
[csclprd3-0-12:09165] MCW rank 121 is not bound (or bound to all available 
processors)
[csclprd3-0-12:09165] MCW rank 122 is not bound (or bound to all available 
processors)
[csclprd3-0-12:09165] MCW rank 123 is not bound (or bound to all available 
processors)
[csclprd3-6-1:27030] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B][./.]
[csclprd3-6-1:27030] MCW rank 1 bound to socket 1[core 2[hwt 0]], socket 1[core 
3[hwt 0]]: [./.][B/B]
[csclprd3-6-1:27030] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B][./.]
[csclprd3-6-1:27030] MCW rank 3 bound to socket 1[core 2[hwt 0]], socket 1[core 
3[hwt 0]]: [./.][B/B]
[csclprd3-0-2:07944] MCW rank 30 is not bound (or bound to all available 
processors)
[csclprd3-0-6:32510] MCW rank 54 is not bound (or bound to all available 
processors)
[csclprd3-0-2:07944] MCW rank 31 is not bound (or bound to all available 
processors)
[csclprd3-0-6:32510] MCW rank 55 is not bound (or bound to all available 
processors)
[csclprd3-0-2:07944] MCW rank 26 is not bound (or bound to all available 
processors)
[csclprd3-0-6:32510] MCW rank 50 is not bound (or bound to all available 
processors)
[csclprd3-0-6:32510] MCW rank 51 is not bound (or bound to all available 
processors)
[csclprd3-0-2:07944] MCW rank 27 is not bound (or bound to all available 
processors)
[csclprd3-0-2:07944] MCW rank 28 is not bound (or bound to all available 
processors)
[csclprd3-0-6:32510] MCW rank 52 is not bound (or bound to all available 
processors)
[csclprd3-0-6:32510] MCW rank 53 is not bound (or bound to all available 
processors)
[csclprd3-0-2:07944] MCW rank 29 is not bound (or bound to all available 
processors)
[csclprd3-0-0:00453] MCW rank 11 bound to socket 1[core 6[hwt 0]], socket 
1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], 
socket1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[csclprd3-0-0:00453] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket 
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 
0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[csclprd3-0-0:00453] MCW rank 13 bound to socket 1[core 6[hwt 0]], socket 
1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 
1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[csclprd3-0-0:00453] MCW rank 14 bound to socket 0[core 0[hwt 0]], socket 
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 
0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[csclprd3-0-0:00453] MCW rank 15 bound to socket 1[core 6[hwt 0]], socket 
1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 
1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[csclprd3-0-0:00453] MCW rank 16 bound to socket 0[core 0[hwt 0]], socket 
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 
0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[csclprd3-0-7:22146] MCW rank 64 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-7:22146] MCW rank 65 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-0:00453] MCW rank 17 bound to socket 1[core 6[hwt 0]], socket 
1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 
1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[csclprd3-0-0:00453] MCW rank 18 bound to socket 0[core 0[hwt 0]], socket 
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 
0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[csclprd3-0-11:00885] MCW rank 116 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-11:00885] MCW rank 117 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]],socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-10:20752] MCW rank 100 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-10:20752] MCW rank 101 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-0:00453] MCW rank 19 bound to socket 1[core 6[hwt 0]], socket 
1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 
1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[csclprd3-0-7:22146] MCW rank 66 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-11:00885] MCW rank 118 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-0:00453] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 
4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[csclprd3-0-10:20752] MCW rank 102 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-0:00453] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket 1[core 
7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 
10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B]
[csclprd3-0-0:00453] MCW rank 10 bound to socket 0[core 0[hwt 0]], socket 
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 
0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.]
[csclprd3-0-4:32449] MCW rank 42 is not bound (or bound to all available 
processors)
[csclprd3-0-4:32449] MCW rank 43 is not bound (or bound to all available 
processors)
[csclprd3-0-4:32449] MCW rank 38 is not bound (or bound to all available 
processors)
[csclprd3-0-4:32449] MCW rank 39 is not bound (or bound to all available 
processors)
[csclprd3-0-4:32449] MCW rank 40 is not bound (or bound to all available 
processors)
[csclprd3-0-4:32449] MCW rank 41 is not bound (or bound to all available 
processors)
[csclprd3-0-13:30897] MCW rank 126 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB][../../../../../..]
[csclprd3-0-8:17159] MCW rank 80 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-13:30897] MCW rank 127 bound to socket 1[core 6[hwt 0-1]], socket 
1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], 
socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
[../../../../../..][BB/BB/BB/BB/BB/BB]
[csclprd3-0-8:17159] MCW rank 81 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 
10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../..][BB/BB/BB/BB/BB/BB]
[csclprd3-0-8:17159] MCW rank 81 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-13:30897] MCW rank 128 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB][../../../../../..]
[csclprd3-0-8:17159] MCW rank 82 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-13:30897] MCW rank 129 bound to socket 1[core 6[hwt 0-1]], socket 
1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], 
socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
[../../../../../..][BB/BB/BB/BB/BB/BB]
[csclprd3-0-8:17159] MCW rank 83 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-13:30897] MCW rank 130 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB][../../../../../..]
[csclprd3-0-13:30897] MCW rank 131 bound to socket 1[core 6[hwt 0-1]], socket 
1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], 
socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
[../../../../../..][BB/BB/BB/BB/BB/BB]
[csclprd3-0-8:17159] MCW rank 84 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-8:17159] MCW rank 85 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-11:00885] MCW rank 119 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-10:20752] MCW rank 103 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-8:17159] MCW rank 86 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-7:22146] MCW rank 67 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-11:00885] MCW rank 104 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..][csclprd3-0-10:20752] MCW 
rank 88 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 
0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], 
socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 
0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-8:17159] MCW rank 87 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-11:00885] MCW rank 105 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-10:20752] MCW rank 89 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-8:17159] MCW rank 72 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-7:22146] MCW rank 68 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-11:00885] MCW rank 106 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-10:20752] MCW rank 90 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-8:17159] MCW rank 73 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-11:00885] MCW rank 107 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-7:22146] MCW rank 69 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-8:17159] MCW rank 74 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-11:00885] MCW rank 108 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-7:22146] MCW rank 57 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-11:00885] MCW rank 114 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-10:20752] MCW rank 98 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-11:00885] MCW rank 115 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-7:22146] MCW rank 58 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-10:20752] MCW rank 99 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-7:22146] MCW rank 59 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-7:22146] MCW rank 60 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-7:22146] MCW rank 61 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-7:22146] MCW rank 62 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], 
socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 
0-1]], socket 0[core 7[hwt 0-1]]: 
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[csclprd3-0-7:22146] MCW rank 63 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], 
socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 
0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[csclprd3-0-13:30901] *** Process received signal ***
[csclprd3-0-13:30901] Signal: Bus error (7)
[csclprd3-0-13:30901] Signal code: Non-existant physical address (2)
[csclprd3-0-13:30901] Failing at address: 0x7ff404351d80
[csclprd3-0-13:30901] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ff41453c500]
[csclprd3-0-13:30901] [ 1] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xd4fea)[0x7ff41481efea]
[csclprd3-0-13:30901] [ 2] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7ff41479f009]
[csclprd3-0-13:30901] [ 3] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7ff41479f110]
[csclprd3-0-13:30901] [ 4] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7ff41480f68e]
[csclprd3-0-13:30901] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7ff4148e3715]
[csclprd3-0-13:30901] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7ff4147b9ad6]
[csclprd3-0-13:30901] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7ff4147d8c60]
[csclprd3-0-13:30901] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:30901] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff4141b9cdd]
[csclprd3-0-13:30901] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:30901] *** End of error message ***

________________________________
From: users [users-boun...@open-mpi.org<UrlBlockedError.aspx>] on behalf of 
Ralph Castain [r...@open-mpi.org<UrlBlockedError.aspx>]
Sent: Thursday, June 18, 2015 5:26 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = crash

FWIW: I don't think this actually has anything to do with the #procs you are 
trying to run. Instead, I expect it has to do with confusion over how many 
cores it can bind across. When you tell it to use-hwthread-cpus, you are asking 
us to map processes to hwthreads, and not cores. I don't know which nodes are 
which, but it could be that we are getting incorrect info somewhere.

Given that you are limiting the number of procs to the number of cores, is 
there some reason why you are asking us to use-hwthread-cpus? Why not just 
leave it at the default core level?

I also suspect that you would have no problems if you -bind-to none - does that 
in fact work?


On Jun 18, 2015, at 4:54 PM, Lane, William 
<william.l...@cshs.org<UrlBlockedError.aspx>> wrote:

I'm having a strange problem w/OpenMPI 1.8.6. If I run
my OpenMPI test code (compiled against OpenMPI 1.8.6
libraries) on < 131 slots I get no issues. Anything over 131
errors out:

mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ 
--hostfile hostfile-single --mca btl_tcp_if_include eth0 --hetero-nodes 
--use-hwthread-cpus /hpc/home/lanew/mpi/openmpi/ProcessColors3

The hostfile has the number of slots restricted
to the number of cores, while the max-slots includes
the hyperthreading cores (e.g. csclprd3-0-0 slots=6
max-slots=12).

The nodes are a mix of IBM x3550 nodes some
are Sandybridges and others are older Xeons.

I would like to add that the submit node from
which I am launching mpirun has the open files
soft limit (ulimit -a) set to 1024, while the hard limit
(ulimit -Ha) is set to 4096. I know open file limits
were an issue w/an older version of OpenMPI. The
compute nodes all have their hard open files limit
and soft open files limits set to 4096.

Here's the output (csclprd3-0-13 is the last node
listed in the hostfile hostfile-single):

[csclprd3-0-13:28765] Signal: Bus error (7)
[csclprd3-0-13:28765] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28765] Failing at address: 0x7f30002a8980
[csclprd3-0-13:28766] *** Process received signal ***
[csclprd3-0-13:28766] Signal: Bus error (7)
[csclprd3-0-13:28766] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28766] Failing at address: 0x7fe137662880
[csclprd3-0-13:28768] *** Process received signal ***
[csclprd3-0-13:28768] Signal: Bus error (7)
[csclprd3-0-13:28768] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80
[csclprd3-0-13:28770] *** Process received signal ***
[csclprd3-0-13:28770] Signal: Bus error (7)
[csclprd3-0-13:28770] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00
[csclprd3-0-13:28767] *** Process received signal ***
[csclprd3-0-13:28767] Signal: Bus error (7)
[csclprd3-0-13:28767] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980
[csclprd3-0-13:28764] *** Process received signal ***
[csclprd3-0-13:28764] Signal: Bus error (7)
[csclprd3-0-13:28764] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28765] Signal: Bus error (7)
[csclprd3-0-13:28765] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28765] Failing at address: 0x7f30002a8980
[csclprd3-0-13:28766] *** Process received signal ***
[csclprd3-0-13:28766] Signal: Bus error (7)
[csclprd3-0-13:28766] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28766] Failing at address: 0x7fe137662880
[csclprd3-0-13:28768] *** Process received signal ***
[csclprd3-0-13:28768] Signal: Bus error (7)
[csclprd3-0-13:28768] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80
[csclprd3-0-13:28770] *** Process received signal ***
[csclprd3-0-13:28770] Signal: Bus error (7)
[csclprd3-0-13:28770] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00
[csclprd3-0-13:28767] *** Process received signal ***
[csclprd3-0-13:28767] Signal: Bus error (7)
[csclprd3-0-13:28767] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980
[csclprd3-0-13:28764] *** Process received signal ***
[csclprd3-0-13:28764] Signal: Bus error (7)
[csclprd3-0-13:28764] Signal code: Non-existant physical address (2)
[csclprd3-0-13:28768] [ 3] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f9b513ad110]
[csclprd3-0-13:28768] [ 4] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7f0df77b6009]
[csclprd3-0-13:28770] [ 3] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f0df77b6110]
[csclprd3-0-13:28770] [ 4] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b5141d68e]
[csclprd3-0-13:28768] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b514f1715]
[csclprd3-0-13:28768] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f30115ea68e]
[csclprd3-0-13:28765] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f30116be715]
[csclprd3-0-13:28765] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b7bb3b68e]
[csclprd3-0-13:28767] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b7bc0f715]
[csclprd3-0-13:28767] [ 6] [csclprd3-0-13:28764] [ 4] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fa946bb768e]
[csclprd3-0-13:28764] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fe146d4068e]
[csclprd3-0-13:28766] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f0df782668e]
[csclprd3-0-13:28770] [ 5] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f0df78fa715]
[csclprd3-0-13:28770] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f0df77d0ad6]
[csclprd3-0-13:28770] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fe146e14715]
[csclprd3-0-13:28766] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fe146ceaad6]
[csclprd3-0-13:28766] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b513c7ad6]
[csclprd3-0-13:28768] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b513e6c60]
[csclprd3-0-13:28768] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28768] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b50dc7cdd]
[csclprd3-0-13:28768] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28768] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f3011594ad6]
[csclprd3-0-13:28765] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f30115b3c60]
[csclprd3-0-13:28765] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28765] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f3010f94cdd]
[csclprd3-0-13:28765] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28765] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b7bae5ad6]
[csclprd3-0-13:28767] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b7bb04c60]
[csclprd3-0-13:28767] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28767] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b7b4e5cdd]
[csclprd3-0-13:28767] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28767] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715]
[csclprd3-0-13:28764] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6]
[csclprd3-0-13:28764] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60]
[csclprd3-0-13:28770] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28770] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd]
[csclprd3-0-13:28770] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28770] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60]
[csclprd3-0-13:28766] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28766] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd]
[csclprd3-0-13:28767] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715]
[csclprd3-0-13:28764] [ 6] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6]
[csclprd3-0-13:28764] [ 7] 
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60]
[csclprd3-0-13:28770] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28770] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd]
[csclprd3-0-13:28770] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28770] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60]
[csclprd3-0-13:28766] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28766] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd]
[csclprd3-0-13:28766] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28766] *** End of error message ***
/hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fa946b80c60]
[csclprd3-0-13:28764] [ 8] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0]
[csclprd3-0-13:28764] [ 9] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa946561cdd]
[csclprd3-0-13:28764] [10] /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999]
[csclprd3-0-13:28764] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 126 with PID 0 on node csclprd3-0-13 exited on 
signal 7 (Bus error).

Could a lack of the necessary NUMA libraries or the wrong version of NUMA
libraries be contributing to this?
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation. 
_______________________________________________
users mailing list
us...@open-mpi.org<UrlBlockedError.aspx>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27159.php

IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.

Reply via email to