Hi Ralph, sorry for intersecting post.

Your advice about -hetero-nodes in other thread gives me a hint.

I already put "orte_hetero_nodes = 1" in my mca-params.conf, because
you told me a month ago that my environment would need this option.

Removing this line from mca-params.conf, then it works.
In other word, you can replicate it by adding -hetero-nodes as
shown below.

qsub: job 8364.manage.cluster completed
[mishima@manage mpi]$ qsub -I -l nodes=2:ppn=8
qsub: waiting for job 8365.manage.cluster to start
qsub: job 8365.manage.cluster ready

[mishima@node11 ~]$ ompi_info --all | grep orte_hetero_nodes
                MCA orte: parameter "orte_hetero_nodes" (current value:
"false", data source: default, level: 9 dev/all,
 type: bool)
[mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
[mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
myprog
[node11.cluster:27895] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
[node11.cluster:27895] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
[node12.cluster:24891] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket
1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
[node12.cluster:24891] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
Hello world from process 0 of 4
Hello world from process 1 of 4
Hello world from process 2 of 4
Hello world from process 3 of 4
[mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
-hetero-nodes myprog
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:         CORE
   Node:            node12
   #processes:  2
   #cpus:          1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------


As far as I checked, data->num_bound seems to become bad in bind_downwards,
when I put "-hetero-nodes". I hope you can clear the problem.

Regards,
Tetsuya Mishima


> Yes, it's very strange. But I don't think there's any chance that
> I have < 8 actual cores on the node. I guess that you cat replicate
> it with SLURM, please try it again.
>
> I changed to use node10 and node11, then I got the warning against
> node11.
>
> Furthermore, just as an information for you, I tried to add
> "-bind-to core:overload-allowed", then it worked as shown below.
> But I think node11 is never overloaded because it has 8 cores.
>
> qsub: job 8342.manage.cluster completed
> [mishima@manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8
> qsub: waiting for job 8343.manage.cluster to start
> qsub: job 8343.manage.cluster ready
>
> [mishima@node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima@node10 demos]$ cat $PBS_NODEFILE
> node10
> node10
> node10
> node10
> node10
> node10
> node10
> node10
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> myprog
>
--------------------------------------------------------------------------
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>
> Bind to:         CORE
> Node:            node11
> #processes:  2
> #cpus:          1
>
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
>
--------------------------------------------------------------------------
> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> -bind-to core:overload-allowed myprog
> [node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> [node10.cluster:27020] MCW rank 1 bound to socket 1[core 4[hwt 0]],
socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node11.cluster:26597] MCW rank 3 bound to socket 1[core 4[hwt 0]],
socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node11.cluster:26597] MCW rank 2 bound to socket 0[core 0[hwt 0]],
socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> Hello world from process 1 of 4
> Hello world from process 0 of 4
> Hello world from process 3 of 4
> Hello world from process 2 of 4
>
> Regards,
> Tetsuya Mishima
>
>
> > Very strange - I can't seem to replicate it. Is there any chance that
you
> have < 8 actual cores on node12?
> >
> >
> > On Dec 18, 2013, at 4:53 PM, tmish...@jcity.maeda.co.jp wrote:
> >
> > >
> > >
> > > Hi Ralph, sorry for confusing you.
> > >
> > > At that time, I cut and paste the part of "cat $PBS_NODEFILE".
> > > I guess I didn't paste the last line by my mistake.
> > >
> > > I retried the test and below one is exactly what I got when I did the
> test.
> > >
> > > [mishima@manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8
> > > qsub: waiting for job 8338.manage.cluster to start
> > > qsub: job 8338.manage.cluster ready
> > >
> > > [mishima@node11 ~]$ cat $PBS_NODEFILE
> > > node11
> > > node11
> > > node11
> > > node11
> > > node11
> > > node11
> > > node11
> > > node11
> > > node12
> > > node12
> > > node12
> > > node12
> > > node12
> > > node12
> > > node12
> > > node12
> > > [mishima@node11 ~]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> myprog
> > >
>
--------------------------------------------------------------------------
> > > A request was made to bind to that would result in binding more
> > > processes than cpus on a resource:
> > >
> > >   Bind to:         CORE
> > >   Node:            node12
> > >   #processes:  2
> > >   #cpus:          1
> > >
> > > You can override this protection by adding the "overload-allowed"
> > > option to your binding directive.
> > >
>
--------------------------------------------------------------------------
> > >
> > > Regards,
> > >
> > > Tetsuya Mishima
> > >
> > >> I removed the debug in #2 - thanks for reporting it
> > >>
> > >> For #1, it actually looks to me like this is correct. If you look at
> your
> > > allocation, there are only 7 slots being allocated on node12, yet you
> have
> > > asked for 8 cpus to be assigned (2 procs with 2
> > >> cpus/proc). So the warning is in fact correct
> > >>
> > >>
> > >> On Dec 18, 2013, at 4:04 PM, tmish...@jcity.maeda.co.jp wrote:
> > >>
> > >>>
> > >>>
> > >>> Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So
I'd
> > > like
> > >>> to report
> > >>> 3 issues mainly regarding -cpus-per-proc.
> > >>>
> > >>> 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2
> > > sockets X
> > >>> 4 cores/socket),
> > >>> it starts to produce the error again as shown below. At least,
> > >>> openmpi-1.7.4a1r29646 did
> > >>> work well.
> > >>>
> > >>> [mishima@manage ~]$ qsub -I -l nodes=2:ppn=8
> > >>> qsub: waiting for job 8336.manage.cluster to start
> > >>> qsub: job 8336.manage.cluster ready
> > >>>
> > >>> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> > >>> [mishima@node11 demos]$ cat $PBS_NODEFILE
> > >>> node11
> > >>> node11
> > >>> node11
> > >>> node11
> > >>> node11
> > >>> node11
> > >>> node11
> > >>> node11
> > >>> node12
> > >>> node12
> > >>> node12
> > >>> node12
> > >>> node12
> > >>> node12
> > >>> node12
> > >>> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4
> -report-bindings
> > >>> myprog
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>> A request was made to bind to that would result in binding more
> > >>> processes than cpus on a resource:
> > >>>
> > >>>  Bind to:         CORE
> > >>>  Node:            node12
> > >>>  #processes:  2
> > >>>  #cpus:          1
> > >>>
> > >>> You can override this protection by adding the "overload-allowed"
> > >>> option to your binding directive.
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>
> > >>> Of course it works well using only one node.
> > >>>
> > >>> [mishima@node11 demos]$ mpirun -np 2 -cpus-per-proc 4
> -report-bindings
> > >>> myprog
> > >>> [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > > socket
> > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> > >>> [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> > > socket
> > >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> > >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> > >>> Hello world from process 1 of 2
> > >>> Hello world from process 0 of 2
> > >>>
> > >>>
> > >>> 2) Adding "-bind-to numa", it works but the message "bind:upward
> target
> > >>> NUMANode type NUMANode" appears.
> > >>> As far as I remember, I didn't see such a kind of message before.
> > >>>
> > >>> mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4
-report-bindings
> > >>> -bind-to numa myprog
> > >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
type
> > >>> NUMANode
> > >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
type
> > >>> NUMANode
> > >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
type
> > >>> NUMANode
> > >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
type
> > >>> NUMANode
> > >>> [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > > socket
> > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> > >>> [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> > > socket
> > >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> > >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> > >>> [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]],
> > > socket
> > >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> > >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> > >>> [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]],
> > > socket
> > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> > >>> Hello world from process 1 of 4
> > >>> Hello world from process 0 of 4
> > >>> Hello world from process 3 of 4
> > >>> Hello world from process 2 of 4
> > >>>
> > >>>
> > >>> 3) I use PGI compiler. It can not accept compiler switch
> > >>> "-Wno-variadic-macros", which is
> > >>> included in configure script.
> > >>>
> > >>>         btl_usnic_CFLAGS="-Wno-variadic-macros"
> > >>>
> > >>> I removed this switch, then I could continue to build 1.7.4rc1.
> > >>>
> > >>> Regards,
> > >>> Tetsuya Mishima
> > >>>
> > >>>
> > >>>> Hmmm...okay, I understand the scenario. Must be something in the
> algo
> > >>> when it only has one node, so it shouldn't be too hard to track
down.
> > >>>>
> > >>>> I'm off on travel for a few days, but will return to this when I
get
> > >>> back.
> > >>>>
> > >>>> Sorry for delay - will try to look at this while I'm gone, but
can't
> > >>> promise anything :-(
> > >>>>
> > >>>>
> > >>>> On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote:
> > >>>>
> > >>>>>
> > >>>>>
> > >>>>> Hi Ralph, sorry for confusing.
> > >>>>>
> > >>>>> We usually logon to "manage", which is our control node.
> > >>>>> From manage, we submit job or enter a remote node such as
> > >>>>> node03 by torque interactive mode(qsub -I).
> > >>>>>
> > >>>>> At that time, instead of torque, I just did rsh to node03 from
> manage
> > >>>>> and ran myprog on the node. I hope you could understand what I
did.
> > >>>>>
> > >>>>> Now, I retried with "-host node03", which still causes the
problem:
> > >>>>> (I comfirmed local run on manage caused the same problem too)
> > >>>>>
> > >>>>> [mishima@manage ~]$ rsh node03
> > >>>>> Last login: Wed Dec 11 11:38:57 from manage
> > >>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> > >>>>> [mishima@node03 demos]$
> > >>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03
-report-bindings
> > >>>>> -cpus-per-proc 4 -map-by socket myprog
> > >>>>>
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>>> A request was made to bind to that would result in binding more
> > >>>>> processes than cpus on a resource:
> > >>>>>
> > >>>>> Bind to:         CORE
> > >>>>> Node:            node03
> > >>>>> #processes:  2
> > >>>>> #cpus:          1
> > >>>>>
> > >>>>> You can override this protection by adding the "overload-allowed"
> > >>>>> option to your binding directive.
> > >>>>>
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>>>
> > >>>>> It' strange, but I have to report that "-map-by socket:span"
worked
> > >>> well.
> > >>>>>
> > >>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03
-report-bindings
> > >>>>> -cpus-per-proc 4 -map-by socket:span myprog
> > >>>>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt
0]],
> > >>> socket
> > >>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > >>>>> ocket 1[core 11[hwt 0]]:
> > >>>>>
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > >>>>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt
> 0]],
> > >>> socket
> > >>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > >>>>> socket 1[core 15[hwt 0]]:
> > >>>>>
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > >>>>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt
> 0]],
> > >>> socket
> > >>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > >>>>> socket 2[core 19[hwt 0]]:
> > >>>>>
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > >>>>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt
> 0]],
> > >>> socket
> > >>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > >>>>> socket 2[core 23[hwt 0]]:
> > >>>>>
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > >>>>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt
> 0]],
> > >>> socket
> > >>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > >>>>> socket 3[core 27[hwt 0]]:
> > >>>>>
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > >>>>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt
> 0]],
> > >>> socket
> > >>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > >>>>> socket 3[core 31[hwt 0]]:
> > >>>>>
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> > >>>>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt
0]],
> > >>> socket
> > >>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>>>> cket 0[core 3[hwt 0]]:
> > >>>>>
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > >>>>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt
0]],
> > >>> socket
> > >>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > >>>>> cket 0[core 7[hwt 0]]:
> > >>>>>
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> > >>>>> Hello world from process 2 of 8
> > >>>>> Hello world from process 6 of 8
> > >>>>> Hello world from process 3 of 8
> > >>>>> Hello world from process 7 of 8
> > >>>>> Hello world from process 1 of 8
> > >>>>> Hello world from process 5 of 8
> > >>>>> Hello world from process 0 of 8
> > >>>>> Hello world from process 4 of 8
> > >>>>>
> > >>>>> Regards,
> > >>>>> Tetsuya Mishima
> > >>>>>
> > >>>>>
> > >>>>>> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote:
> > >>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Hi Ralph,
> > >>>>>>>
> > >>>>>>> I tried again with -cpus-per-proc 2 as shown below.
> > >>>>>>> Here, I found that "-map-by socket:span" worked well.
> > >>>>>>>
> > >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
> > > -cpus-per-proc
> > >>> 2
> > >>>>>>> -map-by socket:span myprog
> > >>>>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
> > >>>>>>> /./././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 2[core 17[hwt 0]]: [./././././././.][./././.
> > >>>>>>> /./././.][B/B/./././././.][./././././././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 2[core 19[hwt 0]]: [./././././././.][./././.
> > >>>>>>> /./././.][././B/B/./././.][./././././././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 3[core 25[hwt 0]]: [./././././././.][./././.
> > >>>>>>> /./././.][./././././././.][B/B/./././././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 3[core 27[hwt 0]]: [./././././././.][./././.
> > >>>>>>> /./././.][./././././././.][././B/B/./././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> Hello world from process 1 of 8
> > >>>>>>> Hello world from process 0 of 8
> > >>>>>>> Hello world from process 4 of 8
> > >>>>>>> Hello world from process 2 of 8
> > >>>>>>> Hello world from process 7 of 8
> > >>>>>>> Hello world from process 6 of 8
> > >>>>>>> Hello world from process 5 of 8> >>>>>>> Hello world from
process 3 of 8
> > >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
> > > -cpus-per-proc
> > >>> 2
> > >>>>>>> -map-by socket myprog
> > >>>>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 7[hwt 0]]: [././././././B/B][././././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
> > >>>>>>> /./././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 1[core 13[hwt 0]]: [./././././././.][./././.
> > >>>>>>> /B/B/./.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 1[core 15[hwt 0]]: [./././././././.][./././.
> > >>>>>>> /././B/B][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> Hello world from process 5 of 8
> > >>>>>>> Hello world from process 1 of 8
> > >>>>>>> Hello world from process 6 of 8
> > >>>>>>> Hello world from process 4 of 8
> > >>>>>>> Hello world from process 2 of 8
> > >>>>>>> Hello world from process 0 of 8
> > >>>>>>> Hello world from process 7 of 8
> > >>>>>>> Hello world from process 3 of 8
> > >>>>>>>
> > >>>>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets.
> > >>>>>>> In this case, I guess "-map-by socket:span" and "-map-by
socket"
> > > has
> > >>>>> same
> > >>>>>>> meaning.
> > >>>>>>> Therefore, there's no problem about that. Sorry for distubing.
> > >>>>>>
> > >>>>>> No problem - glad you could clear that up :-)
> > >>>>>>
> > >>>>>>>
> > >>>>>>> By the way, through this test, I found another problem.
> > >>>>>>> Without torque manager and just using rsh, it causes the same
> error
> > >>>>> like
> > >>>>>>> below:
> > >>>>>>>
> > >>>>>>> [mishima@manage openmpi-1.7]$ rsh node03
> > >>>>>>> Last login: Wed Dec 11 09:42:02 from manage
> > >>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> > >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
> > > -cpus-per-proc
> > >>> 4
> > >>>>>>> -map-by socket myprog
> > >>>>>>
> > >>>>>> I don't understand the difference here - you are simply starting
> it
> > >>> from>>>>> a different node? It looks like everything is expected to
> run local
> > > to
> > >>>>> mpirun, yes? So there is no rsh actually involved here.
> > >>>>>> Are you still running in an allocation?
> > >>>>>>
> > >>>>>> If you run this with "-host node03" on the cmd line, do you see
> the
> > >>> same
> > >>>>> problem?
> > >>>>>>
> > >>>>>>
> > >>>>>>>
> > >>>>>
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>>>>> A request was made to bind to that would result in binding more
> > >>>>>>> processes than cpus on a resource:
> > >>>>>>>
> > >>>>>>> Bind to:         CORE
> > >>>>>>> Node:            node03
> > >>>>>>> #processes:  2
> > >>>>>>> #cpus:          1
> > >>>>>>>
> > >>>>>>> You can override this protection by adding the
"overload-allowed"
> > >>>>>>> option to your binding directive.
> > >>>>>>>
> > >>>>>
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>>>>> [mishima@node03 demos]$
> > >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
> > > -cpus-per-proc
> > >>> 4
> > >>>>>>> myprog
> > >>>>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > >>>>>>> ocket 1[core 11[hwt 0]]:
> > >>>>>>>
> > > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > >>>>>>> socket 1[core 15[hwt 0]]:
> > >>>>>>>
> > > [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > >>>>>>> socket 2[core 19[hwt 0]]:
> > >>>>>>>
> > > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > >>>>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > >>>>>>> socket 2[core 23[hwt 0]]:
> > >>>>>>>
> > > [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > >>>>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > >>>>>>> socket 3[core 27[hwt 0]]:>>>>>
> > > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > >>>>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > >>>>>>> socket 3[core 31[hwt 0]]:
> > >>>>>>>
> > > [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> > >>>>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>>>>>> cket 0[core 3[hwt 0]]:
> > >>>>>>>
> > > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > >>>>>>> cket 0[core 7[hwt 0]]:
> > >>>>>>>
> > > [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> > >>>>>>> Hello world from process 4 of 8
> > >>>>>>> Hello world from process 2 of 8
> > >>>>>>> Hello world from process 6 of 8
> > >>>>>>> Hello world from process 5 of 8
> > >>>>>>> Hello world from process 3 of 8
> > >>>>>>> Hello world from process 7 of 8
> > >>>>>>> Hello world from process 0 of 8
> > >>>>>>> Hello world from process 1 of 8
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Tetsuya Mishima
> > >>>>>>>
> > >>>>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but
> let
> > >>> me
> > >>>>>>> poke around a bit and see what might be happening.
> > >>>>>>>>
> > >>>>>>>> On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote:
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Hi Ralph,
> > >>>>>>>>>
> > >>>>>>>>> Thanks. I didn't know the meaning of "socket:span".
> > >>>>>>>>>
> > >>>>>>>>> But it still causes the problem, which seems socket:span
> doesn't
> > >>>>> work.
> > >>>>>>>>>
> > >>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32
> > >>>>>>>>> qsub: waiting for job 8265.manage.cluster to start
> > >>>>>>>>> qsub: job 8265.manage.cluster ready
> > >>>>>>>>>
> > >>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> > >>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
> > >>> -cpus-per-proc
> > >>>>> 4
> > >>>>>>>>> -map-by socket:span myprog
> > >>>>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8
[hwt
> > > 0]],
> > >>>>>>> socket
> > >>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > >>>>>>>>> ocket 1[core 11[hwt 0]]:
> > >>>>>>>>>
> > >>>
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12
[hwt
> > >>> 0]],
> > >>>>>>> socket
> > >>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > >>>>>>>>> socket 1[core 15[hwt 0]]:
> > >>>>>>>>>
> > >>>
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16
[hwt
> > >>> 0]],
> > >>>>>>> socket
> > >>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > >>>>>>>>> socket 2[core 19[hwt 0]]:
> > >>>>>>>>>
> > >>>
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20
[hwt
> > >>> 0]],
> > >>>>>>> socket
> > >>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > >>>>>>>>> socket 2[core 23[hwt 0]]:
> > >>>>>>>>>
> > >>>
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24
[hwt
> > >>> 0]],
> > >>>>>>> socket
> > >>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > >>>>>>>>> socket 3[core 27[hwt 0]]:
> > >>>>>>>>>
> > >>>
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28
[hwt
> > >>> 0]],
> > >>>>>>> socket
> > >>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > >>>>>>>>> socket 3[core 31[hwt 0]]:
> > >>>>>>>>>
> > >>>
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0
[hwt
> > > 0]],
> > >>>>>>> socket
> > >>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>>>>>>>> cket 0[core 3[hwt 0]]:
> > >>>>>>>>>
> > >>>
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4
[hwt
> > > 0]],
> > >>>>>>> socket
> > >>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > >>>>>>>>> cket 0[core 7[hwt 0]]:
> > >>>>>>>>>
> > >>>
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>> Hello world from process 0 of 8
> > >>>>>>>>> Hello world from process 3 of 8
> > >>>>>>>>> Hello world from process 1 of 8
> > >>>>>>>>> Hello world from process 4 of 8
> > >>>>>>>>> Hello world from process 6 of 8
> > >>>>>>>>> Hello world from process 5 of 8
> > >>>>>>>>> Hello world from process 2 of 8
> > >>>>>>>>> Hello world from process 7 of 8
> > >>>>>>>>>
> > >>>>>>>>> Regards,
> > >>>>>>>>> Tetsuya Mishima
> > >>>>>>>>>
> > >>>>>>>>>> No, that is actually correct. We map a socket until full,
then
> > >>> move
> > >>>>> to
> > >>>>>>>>> the next. What you want is --map-by socket:span
> > >>>>>>>>>>
> > >>>>>>>>>> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp
wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi Ralph,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I had a time to try your patch yesterday using
> > >>>>> openmpi-1.7.4a1r29646.
> > >>>>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by
> > >>> socket" itself
> > >>>>>>>>> didn't
> > >>>>>>>>>>> work
> > >>>>>>>>>>> well as shown bellow:
> > >>>>>>>>>>>
> > >>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32
> > >>>>>>>>>>> qsub: waiting for job 8260.manage.cluster to start
> > >>>>>>>>>>> qsub: job 8260.manage.cluster ready
> > >>>>>>>>>>>
> > >>>>>>>>>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> > >>>>>>>>>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings
> > >>>>> -cpus-per-proc
> > >>>>>>> 4
> > >>>>>>>>>>> -map-by socket myprog
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8
> [hwt
> > >>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > >>>>>>>>>>> ocket 1[core 11[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12
> [hwt
> > >>>>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > >>>>>>>>>>> socket 1[core 15[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16
> [hwt
> > >>>>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > >>>>>>>>>>> socket 2[core 19[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20
> [hwt
> > >>>>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > >>>>>>>>>>> socket 2[core 23[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24
> [hwt
> > >>>>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > >>>>>>>>>>> socket 3[core 27[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28
> [hwt
> > >>>>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > >>>>>>>>>>> socket 3[core 31[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0
> [hwt
> > >>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>>>>>>>>>> cket 0[core 3[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4
> [hwt
> > >>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > >>>>>>>>>>> cket 0[core 7[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>>>> Hello world from process 2 of 8
> > >>>>>>>>>>> Hello world from process 1 of 8
> > >>>>>>>>>>> Hello world from process 3 of 8
> > >>>>>>>>>>> Hello world from process 0 of 8
> > >>>>>>>>>>> Hello world from process 6 of 8
> > >>>>>>>>>>> Hello world from process 5 of 8
> > >>>>>>>>>>> Hello world from process 4 of 8
> > >>>>>>>>>>> Hello world from process 7 of 8
> > >>>>>>>>>>>
> > >>>>>>>>>>> I think this should be like this:
> > >>>>>>>>>>>
> > >>>>>>>>>>> rank 00
> > >>>>>>>>>>>
> > >>>>>
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>>>> rank 01
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > >>>>>>>>>>> rank 02
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > >>>>>>>>>>> ...
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regards,
> > >>>>>>>>>>> Tetsuya Mishima
> > >>>>>>>>>>>
> > >>>>>>>>>>>> I fixed this under the trunk (was an issue regardless of
RM)
> > > and
> > >>>>>>> have
> > >>>>>>>>>>> scheduled it for 1.7.4.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks!
> > >>>>>>>>>>>> Ralph
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp
> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Ralph,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thank you very much for your quick response.>
>>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'm afraid to say that I found one more issuse...
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> It's not so serious. Please check it when you have a lot
of
> > >>> time.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The problem is cpus-per-proc with -map-by option under
> Torque
> > >>>>>>>>> manager.
> > >>>>>>>>>>>>> It doesn't work as shown below. I guess you can get the
> same
> > >>>>>>>>>>>>> behaviour under Slurm manager.

> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Of course, if I remove -map-by option, it works quite
well.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32
> > >>>>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start
> > >>>>>>>>>>>>> qsub: job 8116.manage.cluster ready
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2
> > >>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings
> > >>>>>>>>> -cpus-per-proc
> > >>>>>>>>>>> 4
> > >>>>>>>>>>>>> -map-by socket mPre
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>>>>>>>>>>> A request was made to bind to that would result in
binding
> > > more
> > >>>>>>>>>>>>> processes than cpus on a resource:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Bind to:         CORE
> > >>>>>>>>>>>>> Node:            node03>>>>>>> #processes:  2
> > >>>>>>>>>>>>> #cpus:          1
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> You can override this protection by adding the
> > >>> "overload-allowed"
> > >>>>>>>>>>>>> option to your binding directive.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings
> > >>>>>>>>> -cpus-per-proc
> > >>>>>>>>>>> 4
> > >>>>>>>>>>>>> mPre
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core
8
> > > [hwt
> > >>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > >>>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core
12
> > > [hwt
> > >>>>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > >>>>>>>>>>>>> socket 1[core 15[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core
16
> > > [hwt
> > >>>>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > >>>>>>>>>>>>> socket 2[core 19[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core
20
> > > [hwt
> > >>>>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > >>>>>>>>>>>>> socket 2[core 23[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core
24
> > > [hwt
> > >>>>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > >>>>>>>>>>>>> socket 3[core 27[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core
28
> > > [hwt
> > >>>>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > >>>>>>>>>>>>> socket 3[core 31[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > >
>
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]>>>>>>>>>>>>>

> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0
> > > [hwt
> > >>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>>>>>>>>>>>> cket 0[core 3[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core
4
> > > [hwt
> > >>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > >>>>>>>>>>>>> cket 0[core 7[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> Regards,
> > >>>>>>>>>>>>> Tetsuya Mishima
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain
> > > <r...@open-mpi.org>
> > >>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks! That's precisely where I was going to look when
I
> > > had
> > >>>>>>>>> time :-)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I'll update tomorrow.
> > >>>>>>>>>>>>>> Ralph
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM,
> > >>>>>>> <tmish...@jcity.maeda.co.jp>wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi Ralph,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> This is the continuous story of "Segmentation fault in
> > >>> oob_tcp.c
> > >>>>>>> of
> > >>>>>>>>>>>>>> openmpi-1.7.4a1r29646".
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I found the cause.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine
> can
> > >>> not.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Your host file:
> > >>>>>>>>>>>>>> cat hosts
> > >>>>>>>>>>>>>> bend001 slots=12
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> My host file:
> > >>>>>>>>>>>>>> cat hosts
> > >>>>>>>>>>>>>> node08
> > >>>>>>>>>>>>>> node08
> > >>>>>>>>>>>>>> ...(total 8 lines)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I modified my script file to add "slots=1" to each line
of
> > > my
> > >>>>>>>>> hostfile
> > >>>>>>>>>>>>>> just before launching mpirun. Then it worked.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> My host file(modified):
> > >>>>>>>>>>>>>> cat hosts
> > >>>>>>>>>>>>>> node08 slots=1
> > >>>>>>>>>>>>>> node08 slots=1
> > >>>>>>>>>>>>>> ...(total 8 lines)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Secondary, I confirmed that there's a slight difference
> > >>> between
> > >>>>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of
> > >>>>> 1.7.4a1r29646.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> $ diff
> > >>>>>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>
> > >
hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
> > >>>>>>>>>>>>>> 394,401c394,399
> > >>>>>>>>>>>>>> <     if (got_count) {
> > >>>>>>>>>>>>>> <         node->slots_given = true;
> > >>>>>>>>>>>>>> <     } else if (got_max) {
> > >>>>>>>>>>>>>> <         node->slots = node->slots_max;
> > >>>>>>>>>>>>>> <         node->slots_given = true;
> > >>>>>>>>>>>>>> <     } else {
> > >>>>>>>>>>>>>> <         /* should be set by obj_new, but just to be
> clear
> > > */
> > >>>>>>>>>>>>>> <         node->slots_given = false;
> > >>>>>>>>>>>>>> ---
> > >>>>>>>>>>>>>>> if (!got_count) {
> > >>>>>>>>>>>>>>>   if (got_max) {
> > >>>>>>>>>>>>>>>       node->slots = node->slots_max;
> > >>>>>>>>>>>>>>>   } else {
> > >>>>>>>>>>>>>>>       ++node->slots;>>>>>>>>>>>>>    }
> > >>>>>>>>>>>>>> ....
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Finally, I added the line 402 below just as a tentative
> > > trial.
> > >>>>>>>>>>>>>> Then, it worked.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c:
> > >>>>>>>>>>>>>> ...
> > >>>>>>>>>>>>>> 394      if (got_count) {
> > >>>>>>>>>>>>>> 395          node->slots_given = true;
> > >>>>>>>>>>>>>> 396      } else if (got_max) {
> > >>>>>>>>>>>>>> 397          node->slots = node->slots_max;
> > >>>>>>>>>>>>>> 398          node->slots_given = true;
> > >>>>>>>>>>>>>> 399      } else {
> > >>>>>>>>>>>>>> 400          /* should be set by obj_new, but just to be
> > > clear
> > >>>>> */
> > >>>>>>>>>>>>>> 401          node->slots_given
> > > = false;
> > >>>>>>>>>>>>>> 402          ++node->slots; /* added by tmishima */
> > >>>>>>>>>>>>>> 403      }
> > >>>>>>>>>>>>>> ...
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Please fix the problem properly, because it's just based
> on
> > > my
> > >>>>>>>>>>>>>> random guess. It's related to the treatment of hostfile
> > > where
> > >>>>>>> slots
> > >>>>>>>>>>>>>> information is not given.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>> Tetsuya Mishima
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> _______________________________________________
> > >>>>>>>>>>>>>> users mailing list
> > >>>>>>>>>>>>>> us...@open-mpi.org
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>
> > >
>
http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________

>
> > >
> > >>>
> > >>>>>
> > >>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> users mailing list
> > >>>>>>>>>>>>>>
> > >>>>>>>
> > > users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> _______________________________________________
> > >>>>>>>>>>>>> users mailing list
> > >>>>>>>>>>>>> us...@open-mpi.org
> > >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> _______________________________________________
> > >>>>>>>>>>>> users mailing list
> > >>>>>>>>>>>> us...@open-mpi.org
> > >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>>>>>
> > >>>>>>>>>>> _______________________________________________
> > >>>>>>>>>>> users mailing list
> > >>>>>>>>>>> us...@open-mpi.org
> > >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>>>>
> > >>>>>>>>>> _______________________________________________
> > >>>>>>>>>> users mailing list
> > >>>>>>>>>> us...@open-mpi.org
> > >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>>>
> > >>>>>>>>> _______________________________________________
> > >>>>>>>>> users mailing list
> > >>>>>>>>> us...@open-mpi.org
> > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>>
> > >>>>>>>> _______________________________________________
> > >>>>>>>> users mailing list
> > >>>>>>>> us...@open-mpi.org
> > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>
> > >>>>>>> _______________________________________________
> > >>>>>>> users mailing list
> > >>>>>>> us...@open-mpi.org
> > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>
> > >>>>>> _______________________________________________
> > >>>>>> users mailing list
> > >>>>>> us...@open-mpi.org
> > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> users mailing list
> > >>>>> us...@open-mpi.org
> > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>
> > >>>> _______________________________________________
> > >>>> users mailing list
> > >>>> us...@open-mpi.org
> > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>
> > >>> _______________________________________________
> > >>> users mailing list
> > >>> us...@open-mpi.org
> > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>
> > >> _______________________________________________
> > >> users mailing list
> > >> us...@open-mpi.org
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to