Lane, could you please describe your configuration ? how many sockets per node ? how many cores per socket ? how many threads per core ? what is the minimum number of nodes needed to reproduce the issue ? do all the nodes have the same configuration ? if yes, what happens without --hetero-nodes ?
Cheers, Gilles On Friday, June 19, 2015, Lane, William <william.l...@cshs.org> wrote: > Ralph, > > I created a hostfile that just has the names of the hosts while > specifying no slot information whatsoever (e.g. csclprd3-0-0) > and received the following errors: > > mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ > --hostfile hostfile-noslots --mca btl_tcp_if_include eth0 --hetero-nodes > /hpc/home/lanew/mpi/openmpi/ProcessColors3 > > [csclprd3-6-5:14770] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]]: [B/B][./.] > [csclprd3-6-5:14770] MCW rank 5 bound to socket 1[core 2[hwt 0]], socket > 1[core 3[hwt 0]]: [./.][B/B] > [csclprd3-6-5:14770] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]]: [B/B][./.] > [csclprd3-6-5:14770] MCW rank 7 bound to socket 1[core 2[hwt 0]], socket > 1[core 3[hwt 0]]: [./.][B/B] > [csclprd3-0-1:16437] MCW rank 24 is not bound (or bound to all available > processors) > [csclprd3-0-5:18925] MCW rank 48 is not bound (or bound to all available > processors) > [csclprd3-0-1:16437] MCW rank 25 is not bound (or bound to all available > processors) > [csclprd3-0-5:18925] MCW rank 49 is not bound (or bound to all available > processors) > [csclprd3-0-1:16437] MCW rank 20 is not bound (or bound to all available > processors) > [csclprd3-0-1:16437] MCW rank 21 is not bound (or bound to all available > processors) > [csclprd3-0-5:18925] MCW rank 44 is not bound (or bound to all available > processors) > [csclprd3-0-5:18925] MCW rank 45 is not bound (or bound to all available > processors) > [csclprd3-0-1:16437] MCW rank 22 is not bound (or bound to all available > processors) > [csclprd3-0-1:16437] MCW rank 23 is not bound (or bound to all available > processors) > [csclprd3-0-5:18925] MCW rank 46 is not bound (or bound to all available > processors) > [csclprd3-0-5:18925] MCW rank 47 is not bound (or bound to all available > processors) > [csclprd3-0-3:15946] MCW rank 36 is not bound (or bound to all available > processors) > [csclprd3-0-3:15946] MCW rank 37 is not bound (or bound to all available > processors) > [csclprd3-0-3:15946] MCW rank 32 is not bound (or bound to all available > processors) > [csclprd3-0-3:15946] MCW rank 33 is not bound (or bound to all available > processors) > [csclprd3-0-3:15946] MCW rank 34 is not bound (or bound to all available > processors) > [csclprd3-0-3:15946] MCW rank 35 is not bound (or bound to all available > processors) > [csclprd3-0-12:09165] MCW rank 124 is not bound (or bound to all available > processors) > [csclprd3-0-12:09165] MCW rank 125 is not bound (or bound to all available > processors) > [csclprd3-0-12:09165] MCW rank 120 is not bound (or bound to all available > processors) > [csclprd3-0-12:09165] MCW rank 121 is not bound (or bound to all available > processors) > [csclprd3-0-12:09165] MCW rank 122 is not bound (or bound to all available > processors) > [csclprd3-0-12:09165] MCW rank 123 is not bound (or bound to all available > processors) > [csclprd3-6-1:27030] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]]: [B/B][./.] > [csclprd3-6-1:27030] MCW rank 1 bound to socket 1[core 2[hwt 0]], socket > 1[core 3[hwt 0]]: [./.][B/B] > [csclprd3-6-1:27030] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]]: [B/B][./.] > [csclprd3-6-1:27030] MCW rank 3 bound to socket 1[core 2[hwt 0]], socket > 1[core 3[hwt 0]]: [./.][B/B] > [csclprd3-0-2:07944] MCW rank 30 is not bound (or bound to all available > processors) > [csclprd3-0-6:32510] MCW rank 54 is not bound (or bound to all available > processors) > [csclprd3-0-2:07944] MCW rank 31 is not bound (or bound to all available > processors) > [csclprd3-0-6:32510] MCW rank 55 is not bound (or bound to all available > processors) > [csclprd3-0-2:07944] MCW rank 26 is not bound (or bound to all available > processors) > [csclprd3-0-6:32510] MCW rank 50 is not bound (or bound to all available > processors) > [csclprd3-0-6:32510] MCW rank 51 is not bound (or bound to all available > processors) > [csclprd3-0-2:07944] MCW rank 27 is not bound (or bound to all available > processors) > [csclprd3-0-2:07944] MCW rank 28 is not bound (or bound to all available > processors) > [csclprd3-0-6:32510] MCW rank 52 is not bound (or bound to all available > processors) > [csclprd3-0-6:32510] MCW rank 53 is not bound (or bound to all available > processors) > [csclprd3-0-2:07944] MCW rank 29 is not bound (or bound to all available > processors) > [csclprd3-0-0:00453] MCW rank 11 bound to socket 1[core 6[hwt 0]], socket > 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], > socket1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: > [./././././.][B/B/B/B/B/B] > [csclprd3-0-0:00453] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket > 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] > [csclprd3-0-0:00453] MCW rank 13 bound to socket 1[core 6[hwt 0]], socket > 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket > 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] > [csclprd3-0-0:00453] MCW rank 14 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket > 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] > [csclprd3-0-0:00453] MCW rank 15 bound to socket 1[core 6[hwt 0]], socket > 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket > 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] > [csclprd3-0-0:00453] MCW rank 16 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket > 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] > [csclprd3-0-7:22146] MCW rank 64 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-7:22146] MCW rank 65 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-0:00453] MCW rank 17 bound to socket 1[core 6[hwt 0]], socket > 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket > 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] > [csclprd3-0-0:00453] MCW rank 18 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket > 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] > [csclprd3-0-11:00885] MCW rank 116 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-11:00885] MCW rank 117 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]],socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-10:20752] MCW rank 100 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-10:20752] MCW rank 101 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-0:00453] MCW rank 19 bound to socket 1[core 6[hwt 0]], socket > 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket > 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] > [csclprd3-0-7:22146] MCW rank 66 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-11:00885] MCW rank 118 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-0:00453] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket > 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] > [csclprd3-0-10:20752] MCW rank 102 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-0:00453] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket > 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket > 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] > [csclprd3-0-0:00453] MCW rank 10 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket > 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] > [csclprd3-0-4:32449] MCW rank 42 is not bound (or bound to all available > processors) > [csclprd3-0-4:32449] MCW rank 43 is not bound (or bound to all available > processors) > [csclprd3-0-4:32449] MCW rank 38 is not bound (or bound to all available > processors) > [csclprd3-0-4:32449] MCW rank 39 is not bound (or bound to all available > processors) > [csclprd3-0-4:32449] MCW rank 40 is not bound (or bound to all available > processors) > [csclprd3-0-4:32449] MCW rank 41 is not bound (or bound to all available > processors) > [csclprd3-0-13:30897] MCW rank 126 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB][../../../../../..] > [csclprd3-0-8:17159] MCW rank 80 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-13:30897] MCW rank 127 bound to socket 1[core 6[hwt 0-1]], > socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt > 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: > [../../../../../..][BB/BB/BB/BB/BB/BB] > [csclprd3-0-8:17159] MCW rank 81 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], > socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: > [../../../../../..][BB/BB/BB/BB/BB/BB] > [csclprd3-0-8:17159] MCW rank 81 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-13:30897] MCW rank 128 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB][../../../../../..] > [csclprd3-0-8:17159] MCW rank 82 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-13:30897] MCW rank 129 bound to socket 1[core 6[hwt 0-1]], > socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt > 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: > [../../../../../..][BB/BB/BB/BB/BB/BB] > [csclprd3-0-8:17159] MCW rank 83 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-13:30897] MCW rank 130 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB][../../../../../..] > [csclprd3-0-13:30897] MCW rank 131 bound to socket 1[core 6[hwt 0-1]], > socket 1[core 7[hwt 0-1]], socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt > 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: > [../../../../../..][BB/BB/BB/BB/BB/BB] > [csclprd3-0-8:17159] MCW rank 84 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-8:17159] MCW rank 85 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-11:00885] MCW rank 119 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-10:20752] MCW rank 103 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-8:17159] MCW rank 86 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-7:22146] MCW rank 67 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core > 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-11:00885] MCW rank 104 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..][csclprd3-0-10:20752] MCW > rank 88 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], > socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt > 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core > 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-8:17159] MCW rank 87 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-11:00885] MCW rank 105 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-10:20752] MCW rank 89 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-8:17159] MCW rank 72 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-7:22146] MCW rank 68 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-11:00885] MCW rank 106 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-10:20752] MCW rank 90 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-8:17159] MCW rank 73 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-11:00885] MCW rank 107 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-7:22146] MCW rank 69 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-8:17159] MCW rank 74 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-11:00885] MCW rank 108 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-7:22146] MCW rank 57 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-11:00885] MCW rank 114 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-10:20752] MCW rank 98 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-11:00885] MCW rank 115 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-7:22146] MCW rank 58 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-10:20752] MCW rank 99 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-7:22146] MCW rank 59 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-7:22146] MCW rank 60 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-7:22146] MCW rank 61 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-7:22146] MCW rank 62 bound to socket 0[core 0[hwt 0-1]], > socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt > 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core > 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: > [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] > [csclprd3-0-7:22146] MCW rank 63 bound to socket 1[core 8[hwt 0-1]], > socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt > 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket > 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: > [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] > [csclprd3-0-13:30901] *** Process received signal *** > [csclprd3-0-13:30901] Signal: Bus error (7) > [csclprd3-0-13:30901] Signal code: Non-existant physical address (2) > [csclprd3-0-13:30901] Failing at address: 0x7ff404351d80 > [csclprd3-0-13:30901] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ff41453c500] > [csclprd3-0-13:30901] [ 1] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xd4fea)[0x7ff41481efea] > [csclprd3-0-13:30901] [ 2] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7ff41479f009] > [csclprd3-0-13:30901] [ 3] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7ff41479f110] > [csclprd3-0-13:30901] [ 4] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7ff41480f68e] > [csclprd3-0-13:30901] [ 5] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7ff4148e3715] > [csclprd3-0-13:30901] [ 6] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7ff4147b9ad6] > [csclprd3-0-13:30901] [ 7] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7ff4147d8c60] > [csclprd3-0-13:30901] [ 8] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] > [csclprd3-0-13:30901] [ 9] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff4141b9cdd] > [csclprd3-0-13:30901] [10] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] > [csclprd3-0-13:30901] *** End of error message *** > > ------------------------------ > *From:* users [users-boun...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','users-boun...@open-mpi.org');>] on behalf > of Ralph Castain [r...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>] > *Sent:* Thursday, June 18, 2015 5:26 PM > *To:* Open MPI Users > *Subject:* Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = > crash > > FWIW: I don’t think this actually has anything to do with the #procs you > are trying to run. Instead, I expect it has to do with confusion over how > many cores it can bind across. When you tell it to use-hwthread-cpus, you > are asking us to map processes to hwthreads, and not cores. I don’t know > which nodes are which, but it could be that we are getting incorrect info > somewhere. > > Given that you are limiting the number of procs to the number of cores, > is there some reason why you are asking us to use-hwthread-cpus? Why not > just leave it at the default core level? > > I also suspect that you would have no problems if you —bind-to none - > does that in fact work? > > > On Jun 18, 2015, at 4:54 PM, Lane, William <william.l...@cshs.org > <javascript:_e(%7B%7D,'cvml','william.l...@cshs.org');>> wrote: > > I'm having a strange problem w/OpenMPI 1.8.6. If I run > my OpenMPI test code (compiled against OpenMPI 1.8.6 > libraries) on < 131 slots I get no issues. Anything over 131 > errors out: > > mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ > --hostfile hostfile-single --mca btl_tcp_if_include eth0 --hetero-nodes > --use-hwthread-cpus /hpc/home/lanew/mpi/openmpi/ProcessColors3 > > The hostfile has the number of slots restricted > to the number of cores, while the max-slots includes > the hyperthreading cores (e.g. csclprd3-0-0 slots=6 > max-slots=12). > > The nodes are a mix of IBM x3550 nodes some > are Sandybridges and others are older Xeons. > > I would like to add that the submit node from > which I am launching mpirun has the open files > soft limit (ulimit -a) set to 1024, while the hard limit > (ulimit -Ha) is set to 4096. I know open file limits > were an issue w/an older version of OpenMPI. The > compute nodes all have their hard open files limit > and soft open files limits set to 4096. > > Here's the output (csclprd3-0-13 is the last node > listed in the hostfile hostfile-single): > > [csclprd3-0-13:28765] Signal: Bus error (7) > [csclprd3-0-13:28765] Signal code: Non-existant physical address (2) > [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980 > [csclprd3-0-13:28766] *** Process received signal *** > [csclprd3-0-13:28766] Signal: Bus error (7) > [csclprd3-0-13:28766] Signal code: Non-existant physical address (2) > [csclprd3-0-13:28766] Failing at address: 0x7fe137662880 > [csclprd3-0-13:28768] *** Process received signal *** > [csclprd3-0-13:28768] Signal: Bus error (7) > [csclprd3-0-13:28768] Signal code: Non-existant physical address (2) > [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80 > [csclprd3-0-13:28770] *** Process received signal *** > [csclprd3-0-13:28770] Signal: Bus error (7) > [csclprd3-0-13:28770] Signal code: Non-existant physical address (2) > [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00 > [csclprd3-0-13:28767] *** Process received signal *** > [csclprd3-0-13:28767] Signal: Bus error (7) > [csclprd3-0-13:28767] Signal code: Non-existant physical address (2) > [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980 > [csclprd3-0-13:28764] *** Process received signal *** > [csclprd3-0-13:28764] Signal: Bus error (7) > [csclprd3-0-13:28764] Signal code: Non-existant physical address (2) > [csclprd3-0-13:28765] Signal: Bus error (7) > [csclprd3-0-13:28765] Signal code: Non-existant physical address (2) > [csclprd3-0-13:28765] Failing at address: 0x7f30002a8980 > [csclprd3-0-13:28766] *** Process received signal *** > [csclprd3-0-13:28766] Signal: Bus error (7) > [csclprd3-0-13:28766] Signal code: Non-existant physical address (2) > [csclprd3-0-13:28766] Failing at address: 0x7fe137662880 > [csclprd3-0-13:28768] *** Process received signal *** > [csclprd3-0-13:28768] Signal: Bus error (7) > [csclprd3-0-13:28768] Signal code: Non-existant physical address (2) > [csclprd3-0-13:28768] Failing at address: 0x7f9b40228a80 > [csclprd3-0-13:28770] *** Process received signal *** > [csclprd3-0-13:28770] Signal: Bus error (7) > [csclprd3-0-13:28770] Signal code: Non-existant physical address (2) > [csclprd3-0-13:28770] Failing at address: 0x7f0de7f2bb00 > [csclprd3-0-13:28767] *** Process received signal *** > [csclprd3-0-13:28767] Signal: Bus error (7) > [csclprd3-0-13:28767] Signal code: Non-existant physical address (2) > [csclprd3-0-13:28767] Failing at address: 0x7f9b6c2e8980 > [csclprd3-0-13:28764] *** Process received signal *** > [csclprd3-0-13:28764] Signal: Bus error (7) > [csclprd3-0-13:28764] Signal code: Non-existant physical address (2) > [csclprd3-0-13:28768] [ 3] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f9b513ad110] > [csclprd3-0-13:28768] [ 4] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x219)[0x7f0df77b6009] > [csclprd3-0-13:28770] [ 3] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0x40)[0x7f0df77b6110] > [csclprd3-0-13:28770] [ 4] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b5141d68e] > [csclprd3-0-13:28768] [ 5] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b514f1715] > [csclprd3-0-13:28768] [ 6] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f30115ea68e] > [csclprd3-0-13:28765] [ 5] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f30116be715] > [csclprd3-0-13:28765] [ 6] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f9b7bb3b68e] > [csclprd3-0-13:28767] [ 5] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f9b7bc0f715] > [csclprd3-0-13:28767] [ 6] [csclprd3-0-13:28764] [ 4] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fa946bb768e] > [csclprd3-0-13:28764] [ 5] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7fe146d4068e] > [csclprd3-0-13:28766] [ 5] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0xc568e)[0x7f0df782668e] > [csclprd3-0-13:28770] [ 5] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7f0df78fa715] > [csclprd3-0-13:28770] [ 6] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f0df77d0ad6] > [csclprd3-0-13:28770] [ 7] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fe146e14715] > [csclprd3-0-13:28766] [ 6] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fe146ceaad6] > [csclprd3-0-13:28766] [ 7] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b513c7ad6] > [csclprd3-0-13:28768] [ 7] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b513e6c60] > [csclprd3-0-13:28768] [ 8] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] > [csclprd3-0-13:28768] [ 9] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b50dc7cdd] > [csclprd3-0-13:28768] [10] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] > [csclprd3-0-13:28768] *** End of error message *** > > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f3011594ad6] > [csclprd3-0-13:28765] [ 7] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f30115b3c60] > [csclprd3-0-13:28765] [ 8] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] > [csclprd3-0-13:28765] [ 9] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f3010f94cdd] > [csclprd3-0-13:28765] [10] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] > [csclprd3-0-13:28765] *** End of error message *** > > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7f9b7bae5ad6] > [csclprd3-0-13:28767] [ 7] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f9b7bb04c60] > [csclprd3-0-13:28767] [ 8] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] > [csclprd3-0-13:28767] [ 9] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9b7b4e5cdd] > [csclprd3-0-13:28767] [10] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] > [csclprd3-0-13:28767] *** End of error message *** > > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715] > [csclprd3-0-13:28764] [ 6] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6] > [csclprd3-0-13:28764] [ 7] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60] > [csclprd3-0-13:28770] [ 8] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] > [csclprd3-0-13:28770] [ 9] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd] > [csclprd3-0-13:28770] [10] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] > [csclprd3-0-13:28770] *** End of error message *** > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60] > [csclprd3-0-13:28766] [ 8] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] > [csclprd3-0-13:28766] [ 9] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd] > [csclprd3-0-13:28767] *** End of error message *** > > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xd5)[0x7fa946c8b715] > [csclprd3-0-13:28764] [ 6] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0x8d6)[0x7fa946b61ad6] > [csclprd3-0-13:28764] [ 7] > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7f0df77efc60] > [csclprd3-0-13:28770] [ 8] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] > [csclprd3-0-13:28770] [ 9] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0df71d0cdd] > [csclprd3-0-13:28770] [10] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] > [csclprd3-0-13:28770] *** End of error message *** > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fe146d09c60] > [csclprd3-0-13:28766] [ 8] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] > [csclprd3-0-13:28766] [ 9] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe1466eacdd] > [csclprd3-0-13:28766] [10] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] > [csclprd3-0-13:28766] *** End of error message *** > /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x170)[0x7fa946b80c60] > [csclprd3-0-13:28764] [ 8] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] > [csclprd3-0-13:28764] [ 9] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa946561cdd] > [csclprd3-0-13:28764] [10] > /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] > [csclprd3-0-13:28764] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 126 with PID 0 on node csclprd3-0-13 > exited on signal 7 (Bus error). > > Could a lack of the necessary NUMA libraries or the wrong version of NUMA > libraries be contributing to this? > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended > recipient, or the employee or agent responsible for delivering it to the > intended recipient, you are hereby notified that any dissemination, > distribution or copying of this information is strictly prohibited. Thank > you for your cooperation. _______________________________________________ > users mailing list > us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27159.php > > > IMPORTANT WARNING: This message is intended for the use of the person > or entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended > recipient, or the employee or agent responsible for delivering it to the > intended recipient, you are hereby notified that any dissemination, > distribution or copying of this information is strictly prohibited. Thank > you for your cooperation. >