Hi Ralph,

Thank you for your fix. It works for me.

Tetsuya Mishima


> Actually, it looks like it would happen with hetero-nodes set - only
required that at least two nodes have the same architecture. So you might
want to give the trunk a shot as it may well now be
> fixed.
>
>
> On Dec 19, 2013, at 8:35 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> > Hmmm...not having any luck tracking this down yet. If anything, based
on what I saw in the code, I would have expected it to fail when
hetero-nodes was false, not the other way around.
> >
> > I'll keep poking around - just wanted to provide an update.
> >
> > On Dec 19, 2013, at 12:54 AM, tmish...@jcity.maeda.co.jp wrote:
> >
> >>
> >>
> >> Hi Ralph, sorry for intersecting post.
> >>
> >> Your advice about -hetero-nodes in other thread gives me a hint.
> >>
> >> I already put "orte_hetero_nodes = 1" in my mca-params.conf, because
> >> you told me a month ago that my environment would need this option.
> >>
> >> Removing this line from mca-params.conf, then it works.
> >> In other word, you can replicate it by adding -hetero-nodes as
> >> shown below.
> >>
> >> qsub: job 8364.manage.cluster completed
> >> [mishima@manage mpi]$ qsub -I -l nodes=2:ppn=8
> >> qsub: waiting for job 8365.manage.cluster to start
> >> qsub: job 8365.manage.cluster ready
> >>
> >> [mishima@node11 ~]$ ompi_info --all | grep orte_hetero_nodes
> >>               MCA orte: parameter "orte_hetero_nodes" (current value:
> >> "false", data source: default, level: 9 dev/all,
> >> type: bool)
> >> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> >> myprog
> >> [node11.cluster:27895] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >> [node11.cluster:27895] MCW rank 1 bound to socket 1[core 4[hwt 0]],
socket
> >> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >> [node12.cluster:24891] MCW rank 3 bound to socket 1[core 4[hwt 0]],
socket
> >> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >> [node12.cluster:24891] MCW rank 2 bound to socket 0[core 0[hwt 0]],
socket
> >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >> Hello world from process 0 of 4
> >> Hello world from process 1 of 4
> >> Hello world from process 2 of 4
> >> Hello world from process 3 of 4
> >> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> >> -hetero-nodes myprog
> >>
--------------------------------------------------------------------------
> >> A request was made to bind to that would result in binding more
> >> processes than cpus on a resource:
> >>
> >>  Bind to:         CORE
> >>  Node:            node12
> >>  #processes:  2
> >>  #cpus:          1
> >>
> >> You can override this protection by adding the "overload-allowed"
> >> option to your binding directive.
> >>
--------------------------------------------------------------------------
> >>
> >>
> >> As far as I checked, data->num_bound seems to become bad in
bind_downwards,
> >> when I put "-hetero-nodes". I hope you can clear the problem.
> >>
> >> Regards,
> >> Tetsuya Mishima
> >>
> >>
> >>> Yes, it's very strange. But I don't think there's any chance that
> >>> I have < 8 actual cores on the node. I guess that you cat replicate
> >>> it with SLURM, please try it again.
> >>>
> >>> I changed to use node10 and node11, then I got the warning against
> >>> node11.
> >>>
> >>> Furthermore, just as an information for you, I tried to add
> >>> "-bind-to core:overload-allowed", then it worked as shown below.
> >>> But I think node11 is never overloaded because it has 8 cores.
> >>>
> >>> qsub: job 8342.manage.cluster completed
> >>> [mishima@manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8
> >>> qsub: waiting for job 8343.manage.cluster to start
> >>> qsub: job 8343.manage.cluster ready
> >>>
> >>> [mishima@node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>> [mishima@node10 demos]$ cat $PBS_NODEFILE
> >>> node10
> >>> node10
> >>> node10
> >>> node10
> >>> node10
> >>> node10
> >>> node10
> >>> node10
> >>> node11
> >>> node11
> >>> node11
> >>> node11
> >>> node11
> >>> node11
> >>> node11
> >>> node11
> >>> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4
-report-bindings
> >>> myprog
> >>>
> >>
--------------------------------------------------------------------------
> >>> A request was made to bind to that would result in binding more
> >>> processes than cpus on a resource:
> >>>
> >>> Bind to:         CORE
> >>> Node:            node11
> >>> #processes:  2
> >>> #cpus:          1
> >>>
> >>> You can override this protection by adding the "overload-allowed"
> >>> option to your binding directive.
> >>>
> >>
--------------------------------------------------------------------------
> >>> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4
-report-bindings
> >>> -bind-to core:overload-allowed myprog
> >>> [node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> >> socket
> >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >>> [node10.cluster:27020] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> >> socket
> >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >>> [node11.cluster:26597] MCW rank 3 bound to socket 1[core 4[hwt 0]],
> >> socket
> >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >>> [node11.cluster:26597] MCW rank 2 bound to socket 0[core 0[hwt 0]],
> >> socket
> >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >>> Hello world from process 1 of 4
> >>> Hello world from process 0 of 4
> >>> Hello world from process 3 of 4
> >>> Hello world from process 2 of 4
> >>>
> >>> Regards,
> >>> Tetsuya Mishima
> >>>
> >>>
> >>>> Very strange - I can't seem to replicate it. Is there any chance
that
> >> you
> >>> have < 8 actual cores on node12?
> >>>>
> >>>>
> >>>> On Dec 18, 2013, at 4:53 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> Hi Ralph, sorry for confusing you.
> >>>>>
> >>>>> At that time, I cut and paste the part of "cat $PBS_NODEFILE".
> >>>>> I guess I didn't paste the last line by my mistake.
> >>>>>
> >>>>> I retried the test and below one is exactly what I got when I did
the
> >>> test.
> >>>>>
> >>>>> [mishima@manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8
> >>>>> qsub: waiting for job 8338.manage.cluster to start
> >>>>> qsub: job 8338.manage.cluster ready
> >>>>>
> >>>>> [mishima@node11 ~]$ cat $PBS_NODEFILE
> >>>>> node11
> >>>>> node11
> >>>>> node11
> >>>>> node11
> >>>>> node11
> >>>>> node11
> >>>>> node11
> >>>>> node11
> >>>>> node12
> >>>>> node12
> >>>>> node12
> >>>>> node12
> >>>>> node12
> >>>>> node12
> >>>>> node12
> >>>>> node12
> >>>>> [mishima@node11 ~]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> >>> myprog
> >>>>>
> >>>
> >>
--------------------------------------------------------------------------
> >>>>> A request was made to bind to that would result in binding more
> >>>>> processes than cpus on a resource:
> >>>>>
> >>>>> Bind to:         CORE
> >>>>> Node:            node12
> >>>>> #processes:  2
> >>>>> #cpus:          1
> >>>>>
> >>>>> You can override this protection by adding the "overload-allowed"
> >>>>> option to your binding directive.
> >>>>>
> >>>
> >>
--------------------------------------------------------------------------
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> Tetsuya Mishima
> >>>>>
> >>>>>> I removed the debug in #2 - thanks for reporting it
> >>>>>>
> >>>>>> For #1, it actually looks to me like this is correct. If you look
at
> >>> your
> >>>>> allocation, there are only 7 slots being allocated on node12, yet
you
> >>> have
> >>>>> asked for 8 cpus to be assigned (2 procs with 2
> >>>>>> cpus/proc). So the warning is in fact correct
> >>>>>>
> >>>>>>
> >>>>>> On Dec 18, 2013, at 4:04 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So
> >> I'd
> >>>>> like
> >>>>>>> to report
> >>>>>>> 3 issues mainly regarding -cpus-per-proc.
> >>>>>>>
> >>>>>>> 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2
> >>>>> sockets X
> >>>>>>> 4 cores/socket),
> >>>>>>> it starts to produce the error again as shown below. At least,
> >>>>>>> openmpi-1.7.4a1r29646 did
> >>>>>>> work well.
> >>>>>>>
> >>>>>>> [mishima@manage ~]$ qsub -I -l nodes=2:ppn=8
> >>>>>>> qsub: waiting for job 8336.manage.cluster to start
> >>>>>>> qsub: job 8336.manage.cluster ready
> >>>>>>>
> >>>>>>> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>>>>>> [mishima@node11 demos]$ cat $PBS_NODEFILE
> >>>>>>> node11
> >>>>>>> node11
> >>>>>>> node11
> >>>>>>> node11
> >>>>>>> node11
> >>>>>>> node11
> >>>>>>> node11
> >>>>>>> node11
> >>>>>>> node12
> >>>>>>> node12
> >>>>>>> node12
> >>>>>>> node12
> >>>>>>> node12
> >>>>>>> node12
> >>>>>>> node12
> >>>>>>> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4
> >>> -report-bindings
> >>>>>>> myprog
> >>>>>>>
> >>>>>
> >>>
> >>
--------------------------------------------------------------------------
> >>>>>>> A request was made to bind to that would result in binding more
> >>>>>>> processes than cpus on a resource:
> >>>>>>>
> >>>>>>> Bind to:         CORE
> >>>>>>> Node:            node12
> >>>>>>> #processes:  2
> >>>>>>> #cpus:          1
> >>>>>>>
> >>>>>>> You can override this protection by adding the "overload-allowed"
> >>>>>>> option to your binding directive.
> >>>>>>>
> >>>>>
> >>>
> >>
--------------------------------------------------------------------------
> >>>>>>>
> >>>>>>> Of course it works well using only one node.
> >>>>>>>
> >>>>>>> [mishima@node11 demos]$ mpirun -np 2 -cpus-per-proc 4
> >>> -report-bindings
> >>>>>>> myprog
> >>>>>>> [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >>>>>>> [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt
0]],
> >>>>> socket
> >>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >>>>>>> Hello world from process 1 of 2
> >>>>>>> Hello world from process 0 of 2
> >>>>>>>
> >>>>>>>
> >>>>>>> 2) Adding "-bind-to numa", it works but the message "bind:upward
> >>> target
> >>>>>>> NUMANode type NUMANode" appears.
> >>>>>>> As far as I remember, I didn't see such a kind of message before.
> >>>>>>>
> >>>>>>> mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4
> >> -report-bindings
> >>>>>>> -bind-to numa myprog
> >>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
> >> type
> >>>>>>> NUMANode
> >>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
> >> type
> >>>>>>> NUMANode
> >>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
> >> type
> >>>>>>> NUMANode
> >>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
> >> type
> >>>>>>> NUMANode
> >>>>>>> [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >>>>>>> [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt
0]],
> >>>>> socket
> >>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >>>>>>> [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt
0]],
> >>>>> socket
> >>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >>>>>>> [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >>>>>>> Hello world from process 1 of 4
> >>>>>>> Hello world from process 0 of 4
> >>>>>>> Hello world from process 3 of 4
> >>>>>>> Hello world from process 2 of 4
> >>>>>>>
> >>>>>>>
> >>>>>>> 3) I use PGI compiler. It can not accept compiler switch
> >>>>>>> "-Wno-variadic-macros", which is
> >>>>>>> included in configure script.
> >>>>>>>
> >>>>>>>       btl_usnic_CFLAGS="-Wno-variadic-macros"
> >>>>>>>
> >>>>>>> I removed this switch, then I could continue to build 1.7.4rc1.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Tetsuya Mishima
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hmmm...okay, I understand the scenario. Must be something in the
> >>> algo
> >>>>>>> when it only has one node, so it shouldn't be too hard to track
> >> down.
> >>>>>>>>
> >>>>>>>> I'm off on travel for a few days, but will return to this when I
> >> get
> >>>>>>> back.
> >>>>>>>>
> >>>>>>>> Sorry for delay - will try to look at this while I'm gone, but
> >> can't
> >>>>>>> promise anything :-(
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hi Ralph, sorry for confusing.
> >>>>>>>>>
> >>>>>>>>> We usually logon to "manage", which is our control node.
> >>>>>>>>> From manage, we submit job or enter a remote node such as
> >>>>>>>>> node03 by torque interactive mode(qsub -I).
> >>>>>>>>>
> >>>>>>>>> At that time, instead of torque, I just did rsh to node03 from
> >>> manage
> >>>>>>>>> and ran myprog on the node. I hope you could understand what I
> >> did.
> >>>>>>>>>
> >>>>>>>>> Now, I retried with "-host node03", which still causes the
> >> problem:
> >>>>>>>>> (I comfirmed local run on manage caused the same problem too)
> >>>>>>>>>
> >>>>>>>>> [mishima@manage ~]$ rsh node03
> >>>>>>>>> Last login: Wed Dec 11 11:38:57 from manage
> >>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>>>>>>>> [mishima@node03 demos]$
> >>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03
> >> -report-bindings
> >>>>>>>>> -cpus-per-proc 4 -map-by socket myprog
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
--------------------------------------------------------------------------
> >>>>>>>>> A request was made to bind to that would result in binding more
> >>>>>>>>> processes than cpus on a resource:
> >>>>>>>>>
> >>>>>>>>> Bind to:         CORE
> >>>>>>>>> Node:            node03
> >>>>>>>>> #processes:  2
> >>>>>>>>> #cpus:          1
> >>>>>>>>>
> >>>>>>>>> You can override this protection by adding the
"overload-allowed"
> >>>>>>>>> option to your binding directive.
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
--------------------------------------------------------------------------
> >>>>>>>>>
> >>>>>>>>> It' strange, but I have to report that "-map-by socket:span"
> >> worked
> >>>>>>> well.
> >>>>>>>>>
> >>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03
> >> -report-bindings
> >>>>>>>>> -cpus-per-proc 4 -map-by socket:span myprog
> >>>>>>>>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt
> >> 0]],
> >>>>>>> socket
> >>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>>>>>> ocket 1[core 11[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>>>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>>>>>> socket 1[core 15[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>>>>>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt
> >>> 0]],
> >>>>>>> socket>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>>>>>> socket 2[core 19[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>>>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>>>>>> socket 2[core 23[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>>>>>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>>>>>> socket 3[core 27[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>>>>>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>>>>>> socket 3[core 31[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> >>>>>>>>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt
> >> 0]],
> >>>>>>> socket
> >>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>>>> cket 0[core 3[hwt 0]]:
> >>>>>>>>>
> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>>>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt
> >> 0]],
> >>>>>>> socket
> >>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>>>>>> cket 0[core 7[hwt 0]]:
> >>>>>>>>>
> >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>>>>>> Hello world from process 2 of 8
> >>>>>>>>> Hello world from process 6 of 8
> >>>>>>>>> Hello world from process 3 of 8
> >>>>>>>>> Hello world from process 7 of 8
> >>>>>>>>> Hello world from process 1 of 8
> >>>>>>>>> Hello world from process 5 of 8
> >>>>>>>>> Hello world from process 0 of 8
> >>>>>>>>> Hello world from process 4 of 8
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Tetsuya Mishima
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>
> >>>>>>>>>>> I tried again with -cpus-per-proc 2 as shown below.
> >>>>>>>>>>> Here, I found that "-map-by socket:span" worked well.
> >>>>>>>>>>>
> >>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
> >>>>> -cpus-per-proc
> >>>>>>> 2
> >>>>>>>>>>> -map-by socket:span myprog
> >>>>>>>>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> >>>>>>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
> >>>>>>>>>>> /./././.][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 2[core 17[hwt 0]]: [./././././././.][./././.
> >>>>>>>>>>> /./././.][B/B/./././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 2[core 19[hwt 0]]: [./././././././.][./././.
> >>>>>>>>>>> /./././.][././B/B/./././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 3[core 25[hwt 0]]: [./././././././.][./././.
> >>>>>>>>>>> /./././.][./././././././.][B/B/./././././.]
> >>>>>>>>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 3[core 27[hwt 0]]: [./././././././.][./././.
> >>>>>>>>>>> /./././.][./././././././.][././B/B/./././.]
> >>>>>>>>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> >>>>>>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> >>>>>>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>>>>>> Hello world from process 1 of 8
> >>>>>>>>>>> Hello world from process 0 of 8
> >>>>>>>>>>> Hello world from process 4 of 8
> >>>>>>>>>>> Hello world from process 2 of 8
> >>>>>>>>>>> Hello world from process 7 of 8
> >>>>>>>>>>> Hello world from process 6 of 8
> >>>>>>>>>>> Hello world from process 5 of 8> >>>>>>> Hello world from
> >> process 3 of 8
> >>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
> >>>>> -cpus-per-proc
> >>>>>>> 2
> >>>>>>>>>>> -map-by socket myprog
> >>>>>>>>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././.
> >>>>>>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 0[core 7[hwt 0]]: [././././././B/B][././././.
> >>>>>>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> >>>>>>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
> >>>>>>>>>>> /./././.][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 1[core 13[hwt 0]]: [./././././././.][./././.
> >>>>>>>>>>> /B/B/./.][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 1[core 15[hwt 0]]: [./././././././.][./././.
> >>>>>>>>>>> /././B/B][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> >>>>>>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> >>>>>>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>>>>>> Hello world from process 5 of 8
> >>>>>>>>>>> Hello world from process 1 of 8
> >>>>>>>>>>> Hello world from process 6 of 8
> >>>>>>>>>>> Hello world from process 4 of 8
> >>>>>>>>>>> Hello world from process 2 of 8
> >>>>>>>>>>> Hello world from process 0 of 8
> >>>>>>>>>>> Hello world from process 7 of 8
> >>>>>>>>>>> Hello world from process 3 of 8
> >>>>>>>>>>>
> >>>>>>>>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets.
> >>>>>>>>>>> In this case, I guess "-map-by socket:span" and "-map-by
> >> socket"
> >>>>> has
> >>>>>>>>> same
> >>>>>>>>>>> meaning.
> >>>>>>>>>>> Therefore, there's no problem about that. Sorry for
distubing.
> >>>>>>>>>>
> >>>>>>>>>> No problem - glad you could clear that up :-)
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> By the way, through this test, I found another problem.
> >>>>>>>>>>> Without torque manager and just using rsh, it causes the same
> >>> error
> >>>>>>>>> like
> >>>>>>>>>>> below:
> >>>>>>>>>>>
> >>>>>>>>>>> [mishima@manage openmpi-1.7]$ rsh node03
> >>>>>>>>>>> Last login: Wed Dec 11 09:42:02 from manage
> >>>>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
> >>>>> -cpus-per-proc
> >>>>>>> 4
> >>>>>>>>>>> -map-by socket myprog
> >>>>>>>>>>
> >>>>>>>>>> I don't understand the difference here - you are simply
starting
> >>> it
> >>>>>>> from>>>>> a different node? It looks like everything is expected
to
> >>> run local
> >>>>> to
> >>>>>>>>> mpirun, yes? So there is no rsh actually involved here.
> >>>>>>>>>> Are you still running in an allocation?
> >>>>>>>>>>
> >>>>>>>>>> If you run this with "-host node03" on the cmd line, do you
see
> >>> the
> >>>>>>> same
> >>>>>>>>> problem?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
--------------------------------------------------------------------------
> >>>>>>>>>>> A request was made to bind to that would result in binding
more
> >>>>>>>>>>> processes than cpus on a resource:
> >>>>>>>>>>>
> >>>>>>>>>>> Bind to:         CORE
> >>>>>>>>>>> Node:            node03
> >>>>>>>>>>> #processes:  2
> >>>>>>>>>>> #cpus:          1
> >>>>>>>>>>>
> >>>>>>>>>>> You can override this protection by adding the
> >> "overload-allowed"
> >>>>>>>>>>> option to your binding directive.
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
--------------------------------------------------------------------------
> >>>>>>>>>>> [mishima@node03 demos]$
> >>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
> >>>>> -cpus-per-proc
> >>>>>>> 4
> >>>>>>>>>>> myprog
> >>>>>>>>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>>>>>>>> ocket 1[core 11[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>>>>>>>> socket 1[core 15[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>>>>>>>> socket 2[core 19[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>>>>>>>> socket 2[core 23[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>>>>>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>>>>>>>> socket 3[core 27[hwt 0]]:>>>>>
> >>>>>
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>>>>>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>>>>>>>> socket 3[core 31[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> >>>>>>>>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>>>>>> cket 0[core 3[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>>>>>>>> cket 0[core 7[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>> Hello world from process 4 of 8
> >>>>>>>>>>> Hello world from process 2 of 8
> >>>>>>>>>>> Hello world from process 6 of 8
> >>>>>>>>>>> Hello world from process 5 of 8
> >>>>>>>>>>> Hello world from process 3 of 8
> >>>>>>>>>>> Hello world from process 7 of 8
> >>>>>>>>>>> Hello world from process 0 of 8
> >>>>>>>>>>> Hello world from process 1 of 8
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>
> >>>>>>>>>>>> Hmmm...that's strange. I only have 2 sockets on my system,
but
> >>> let
> >>>>>>> me
> >>>>>>>>>>> poke around a bit and see what might be happening.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp
wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks. I didn't know the meaning of "socket:span".
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> But it still causes the problem, which seems socket:span
> >>> doesn't
> >>>>>>>>> work.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32
> >>>>>>>>>>>>> qsub: waiting for job 8265.manage.cluster to start
> >>>>>>>>>>>>> qsub: job 8265.manage.cluster ready
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>>>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
> >>>>>>> -cpus-per-proc
> >>>>>>>>> 4
> >>>>>>>>>>>>> -map-by socket:span myprog
> >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8
> >> [hwt
> >>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> >> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12
> >> [hwt
> >>>>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>>>>>>>>>> socket 1[core 15[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> >> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16
> >> [hwt
> >>>>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>>>>>>>>>> socket 2[core 19[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> >> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20
> >> [hwt
> >>>>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>>>>>>>>>> socket 2[core 23[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> >> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24
> >> [hwt
> >>>>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>>>>>>>>>> socket 3[core 27[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> >> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28
> >> [hwt
> >>>>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>>>>>>>>>> socket 3[core 31[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> >> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0
> >> [hwt
> >>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>>>>>>>> cket 0[core 3[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> >> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4
> >> [hwt
> >>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>>>>>>>>>> cket 0[core 7[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> >> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>>>> Hello world from process 0 of 8>>>>>>>>>>>>> Hello world
from process 3 of 8
> >>>>>>>>>>>>> Hello world from process 1 of 8
> >>>>>>>>>>>>> Hello world from process 4 of 8
> >>>>>>>>>>>>> Hello world from process 6 of 8
> >>>>>>>>>>>>> Hello world from process 5 of 8
> >>>>>>>>>>>>> Hello world from process 2 of 8
> >>>>>>>>>>>>> Hello world from process 7 of 8
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> No, that is actually correct. We map a socket until full,
> >> then
> >>>>>>> move
> >>>>>>>>> to
> >>>>>>>>>>>>> the next. What you want is --map-by socket:span
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Dec 10, 2013, at 3:42 PM, tmish
i...@jcity.maeda.co.jp
> >> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I had a time to try your patch yesterday using
> >>>>>>>>> openmpi-1.7.4a1r29646.
> >>>>>>>>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by
> >>>>>>> socket" itself
> >>>>>>>>>>>>> didn't
> >>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>> well as shown bellow:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32
> >>>>>>>>>>>>>>> qsub: waiting for job 8260.manage.cluster to start
> >>>>>>>>>>>>>>> qsub: job 8260.manage.cluster ready
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>>>>>>>>>>>>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings
> >>>>>>>>> -cpus-per-proc
> >>>>>>>>>>> 4
> >>>>>>>>>>>>>>> -map-by socket myprog
> >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core
8
> >>> [hwt
> >>>>>>> 0]],
> >>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core
12
> >>> [hwt
> >>>>>>>>> 0]],
> >>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>>>>>>>>>>>> socket 1[core 15[hwt 0]]:
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core
16
> >>> [hwt
> >>>>>>>>> 0]],
> >>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>>>>>>>>>>>> socket 2[core 19[hwt 0]]:
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core
20
> >>> [hwt
> >>>>>>>>> 0]],
> >>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>>>>>>>>>>>> socket 2[core 23[hwt 0]]:
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core
24
> >>> [hwt
> >>>>>>>>> 0]],
> >>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>>>>>>>>>>>> socket 3[core 27[hwt 0]]:
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core
28
> >>> [hwt
> >>>>>>>>> 0]],
> >>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>>>>>>>>>>>> socket 3[core 31[hwt 0]]:
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core
0
> >>> [hwt
> >>>>>>> 0]],
> >>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>>>>>>>>>> cket 0[core 3[hwt 0]]:
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core
4
> >>> [hwt
> >>>>>>> 0]],
> >>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>>>>>>>>>>>> cket 0[core 7[hwt 0]]:
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>>>>>> Hello world from process 2 of 8
> >>>>>>>>>>>>>>> Hello world from process 1 of 8
> >>>>>>>>>>>>>>> Hello world from process 3 of 8
> >>>>>>>>>>>>>>> Hello world from process 0 of 8
> >>>>>>>>>>>>>>> Hello world from process 6 of 8
> >>>>>>>>>>>>>>> Hello world from process 5 of 8
> >>>>>>>>>>>>>>> Hello world from process 4 of 8
> >>>>>>>>>>>>>>> Hello world from process 7 of 8
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think this should be like this:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> rank 00
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>>>>>> rank 01
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>>>>>>>>>> rank 02
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I fixed this under the trunk (was an issue regardless of
> >> RM)
> >>>>> and
> >>>>>>>>>>> have
> >>>>>>>>>>>>>>> scheduled it for 1.7.4.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>>> Ralph
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp
> >>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thank you very much for your quick response.>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'm afraid to say that I found one more issuse...
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> It's not so serious. Please check it when you have a
lot
> >> of
> >>>>>>> time.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> The problem is cpus-per-proc with -map-by option under
> >>> Torque
> >>>>>>>>>>>>> manager.
> >>>>>>>>>>>>>>>>> It doesn't work as shown below. I guess you can get the
> >>> same
> >>>>>>>>>>>>>>>>> behaviour under Slurm manager.
> >>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Of course, if I remove -map-by option, it works quite
> >> well.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32
> >>>>>>>>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start
> >>>>>>>>>>>>>>>>> qsub: job 8116.manage.cluster ready
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2
> >>>>>>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8
-report-bindings
> >>>>>>>>>>>>> -cpus-per-proc
> >>>>>>>>>>>>>>> 4
> >>>>>>>>>>>>>>>>> -map-by socket mPre
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
--------------------------------------------------------------------------
> >>>>>>>>>>>>>>>>> A request was made to bind to that would result in
> >> binding
> >>>>> more
> >>>>>>>>>>>>>>>>> processes than cpus on a resource:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Bind to:         CORE
> >>>>>>>>>>>>>>>>> Node:            node03>>>>>>> #processes:  2
> >>>>>>>>>>>>>>>>> #cpus:          1
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> You can override this protection by adding the
> >>>>>>> "overload-allowed"
> >>>>>>>>>>>>>>>>> option to your binding directive.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
--------------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8
-report-bindings
> >>>>>>>>>>>>> -cpus-per-proc
> >>>>>>>>>>>>>>> 4
> >>>>>>>>>>>>>>>>> mPre
> >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1
[core
> >> 8
> >>>>> [hwt
> >>>>>>>>> 0]],
> >>>>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1
[core
> >> 12
> >>>>> [hwt
> >>>>>>>>>>> 0]],
> >>>>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>>>>>>>>>>>>>> socket 1[core 15[hwt 0]]:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2
[core
> >> 16
> >>>>> [hwt
> >>>>>>>>>>> 0]],
> >>>>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>>>>>>>>>>>>>> socket 2[core 19[hwt 0]]:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2
[core
> >> 20
> >>>>> [hwt
> >>>>>>>>>>> 0]],
> >>>>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>>>>>>>>>>>>>> socket 2[core 23[hwt 0]]:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3
[core
> >> 24
> >>>>> [hwt
> >>>>>>>>>>> 0]],
> >>>>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>>>>>>>>>>>>>> socket 3[core 27[hwt 0]]:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3
[core
> >> 28
> >>>>> [hwt
> >>>>>>>>>>> 0]],
> >>>>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>>>>>>>>>>>>>> socket 3[core 31[hwt 0]]:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> >>
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]>>>>>>>>>>>>>

> >>
> >>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0
> >>>>> [hwt
> >>>>>>>>> 0]],
> >>>>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>>>>>>>>>>>> cket 0[core 3[hwt 0]]:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0
[core
> >> 4
> >>>>> [hwt
> >>>>>>>>> 0]],
> >>>>>>>>>>>>>>> socket
> >>>>>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>>>>>>>>>>>>>> cket 0[core 7[hwt 0]]:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>> Regards,
> >>>>>>>>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again!
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain
> >>>>> <r...@open-mpi.org>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks! That's precisely where I was going to look
when
> >> I
> >>>>> had
> >>>>>>>>>>>>> time :-)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I'll update tomorrow.
> >>>>>>>>>>>>>>>>>> Ralph
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM,
> >>>>>>>>>>> <tmish...@jcity.maeda.co.jp>wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> This is the continuous story of "Segmentation fault in
> >>>>>>> oob_tcp.c
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> openmpi-1.7.4a1r29646".
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I found the cause.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Firstly, I noticed that your hostfile can work and
mine
> >>> can
> >>>>>>> not.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Your host file:
> >>>>>>>>>>>>>>>>>> cat hosts
> >>>>>>>>>>>>>>>>>> bend001 slots=12
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> My host file:
> >>>>>>>>>>>>>>>>>> cat hosts
> >>>>>>>>>>>>>>>>>> node08
> >>>>>>>>>>>>>>>>>> node08
> >>>>>>>>>>>>>>>>>> ...(total 8 lines)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I modified my script file to add "slots=1" to each
line
> >> of
> >>>>> my
> >>>>>>>>>>>>> hostfile
> >>>>>>>>>>>>>>>>>> just before launching mpirun. Then it worked.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> My host file(modified):
> >>>>>>>>>>>>>>>>>> cat hosts
> >>>>>>>>>>>>>>>>>> node08 slots=1
> >>>>>>>>>>>>>>>>>> node08 slots=1
> >>>>>>>>>>>>>>>>>> ...(total 8 lines)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Secondary, I confirmed that there's a slight
difference
> >>>>>>> between
> >>>>>>>>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of
> >>>>>>>>> 1.7.4a1r29646.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> $ diff
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>
> >>>>>
> >> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
> >>>>>>>>>>>>>>>>>> 394,401c394,399
> >>>>>>>>>>>>>>>>>> <     if (got_count) {
> >>>>>>>>>>>>>>>>>> <         node->slots_given = true;
> >>>>>>>>>>>>>>>>>> <     } else if (got_max) {
> >>>>>>>>>>>>>>>>>> <         node->slots = node->slots_max;
> >>>>>>>>>>>>>>>>>> <         node->slots_given = true;
> >>>>>>>>>>>>>>>>>> <     } else {
> >>>>>>>>>>>>>>>>>> <         /* should be set by obj_new, but just to be
> >>> clear
> >>>>> */
> >>>>>>>>>>>>>>>>>> <         node->slots_given = false;
> >>>>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>>>> if (!got_count) {
> >>>>>>>>>>>>>>>>>>> if (got_max) {
> >>>>>>>>>>>>>>>>>>>     node->slots = node->slots_max;
> >>>>>>>>>>>>>>>>>>> } else {
> >>>>>>>>>>>>>>>>>>>     ++node->slots;>>>>>>>>>>>>>    }
> >>>>>>>>>>>>>>>>>> ....
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Finally, I added the line 402 below just as a
tentative
> >>>>> trial.
> >>>>>>>>>>>>>>>>>> Then, it worked.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c:
> >>>>>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>>>> 394      if (got_count) {
> >>>>>>>>>>>>>>>>>> 395          node->slots_given = true;
> >>>>>>>>>>>>>>>>>> 396      } else if (got_max) {
> >>>>>>>>>>>>>>>>>> 397          node->slots = node->slots_max;
> >>>>>>>>>>>>>>>>>> 398          node->slots_given = true;
> >>>>>>>>>>>>>>>>>> 399      } else {
> >>>>>>>>>>>>>>>>>> 400          /* should be set by obj_new, but just to
be
> >>>>> clear
> >>>>>>>>> */
> >>>>>>>>>>>>>>>>>> 401          node->slots_given
> >>>>> = false;
> >>>>>>>>>>>>>>>>>> 402          ++node->slots; /* added by tmishima */
> >>>>>>>>>>>>>>>>>> 403      }
> >>>>>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Please fix the problem properly, because it's just
based
> >>> on
> >>>>> my
> >>>>>>>>>>>>>>>>>> random guess. It's related to the treatment of
hostfile
> >>>>> where
> >>>>>>>>>>> slots
> >>>>>>>>>>>>>>>>>> information is not given.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________

> >>
> >>>
> >>>>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> users mailing list>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> users mailing list
> >>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> users mailing list
> >>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> us...@open-mpi.org
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> us...@open-mpi.org
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> us...@open-mpi.org
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> us...@open-mpi.org
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to