Hi Ralph,
Thank you for your fix. It works for me. Tetsuya Mishima > Actually, it looks like it would happen with hetero-nodes set - only required that at least two nodes have the same architecture. So you might want to give the trunk a shot as it may well now be > fixed. > > > On Dec 19, 2013, at 8:35 AM, Ralph Castain <r...@open-mpi.org> wrote: > > > Hmmm...not having any luck tracking this down yet. If anything, based on what I saw in the code, I would have expected it to fail when hetero-nodes was false, not the other way around. > > > > I'll keep poking around - just wanted to provide an update. > > > > On Dec 19, 2013, at 12:54 AM, tmish...@jcity.maeda.co.jp wrote: > > > >> > >> > >> Hi Ralph, sorry for intersecting post. > >> > >> Your advice about -hetero-nodes in other thread gives me a hint. > >> > >> I already put "orte_hetero_nodes = 1" in my mca-params.conf, because > >> you told me a month ago that my environment would need this option. > >> > >> Removing this line from mca-params.conf, then it works. > >> In other word, you can replicate it by adding -hetero-nodes as > >> shown below. > >> > >> qsub: job 8364.manage.cluster completed > >> [mishima@manage mpi]$ qsub -I -l nodes=2:ppn=8 > >> qsub: waiting for job 8365.manage.cluster to start > >> qsub: job 8365.manage.cluster ready > >> > >> [mishima@node11 ~]$ ompi_info --all | grep orte_hetero_nodes > >> MCA orte: parameter "orte_hetero_nodes" (current value: > >> "false", data source: default, level: 9 dev/all, > >> type: bool) > >> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > >> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > >> myprog > >> [node11.cluster:27895] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > >> [node11.cluster:27895] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket > >> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > >> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > >> [node12.cluster:24891] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket > >> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > >> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > >> [node12.cluster:24891] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket > >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > >> Hello world from process 0 of 4 > >> Hello world from process 1 of 4 > >> Hello world from process 2 of 4 > >> Hello world from process 3 of 4 > >> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > >> -hetero-nodes myprog > >> -------------------------------------------------------------------------- > >> A request was made to bind to that would result in binding more > >> processes than cpus on a resource: > >> > >> Bind to: CORE > >> Node: node12 > >> #processes: 2 > >> #cpus: 1 > >> > >> You can override this protection by adding the "overload-allowed" > >> option to your binding directive. > >> -------------------------------------------------------------------------- > >> > >> > >> As far as I checked, data->num_bound seems to become bad in bind_downwards, > >> when I put "-hetero-nodes". I hope you can clear the problem. > >> > >> Regards, > >> Tetsuya Mishima > >> > >> > >>> Yes, it's very strange. But I don't think there's any chance that > >>> I have < 8 actual cores on the node. I guess that you cat replicate > >>> it with SLURM, please try it again. > >>> > >>> I changed to use node10 and node11, then I got the warning against > >>> node11. > >>> > >>> Furthermore, just as an information for you, I tried to add > >>> "-bind-to core:overload-allowed", then it worked as shown below. > >>> But I think node11 is never overloaded because it has 8 cores. > >>> > >>> qsub: job 8342.manage.cluster completed > >>> [mishima@manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8 > >>> qsub: waiting for job 8343.manage.cluster to start > >>> qsub: job 8343.manage.cluster ready > >>> > >>> [mishima@node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > >>> [mishima@node10 demos]$ cat $PBS_NODEFILE > >>> node10 > >>> node10 > >>> node10 > >>> node10 > >>> node10 > >>> node10 > >>> node10 > >>> node10 > >>> node11 > >>> node11 > >>> node11 > >>> node11 > >>> node11 > >>> node11 > >>> node11 > >>> node11 > >>> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > >>> myprog > >>> > >> -------------------------------------------------------------------------- > >>> A request was made to bind to that would result in binding more > >>> processes than cpus on a resource: > >>> > >>> Bind to: CORE > >>> Node: node11 > >>> #processes: 2 > >>> #cpus: 1 > >>> > >>> You can override this protection by adding the "overload-allowed" > >>> option to your binding directive. > >>> > >> -------------------------------------------------------------------------- > >>> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > >>> -bind-to core:overload-allowed myprog > >>> [node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]], > >> socket > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > >>> [node10.cluster:27020] MCW rank 1 bound to socket 1[core 4[hwt 0]], > >> socket > >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > >>> [node11.cluster:26597] MCW rank 3 bound to socket 1[core 4[hwt 0]], > >> socket > >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > >>> [node11.cluster:26597] MCW rank 2 bound to socket 0[core 0[hwt 0]], > >> socket > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > >>> Hello world from process 1 of 4 > >>> Hello world from process 0 of 4 > >>> Hello world from process 3 of 4 > >>> Hello world from process 2 of 4 > >>> > >>> Regards, > >>> Tetsuya Mishima > >>> > >>> > >>>> Very strange - I can't seem to replicate it. Is there any chance that > >> you > >>> have < 8 actual cores on node12? > >>>> > >>>> > >>>> On Dec 18, 2013, at 4:53 PM, tmish...@jcity.maeda.co.jp wrote: > >>>> > >>>>> > >>>>> > >>>>> Hi Ralph, sorry for confusing you. > >>>>> > >>>>> At that time, I cut and paste the part of "cat $PBS_NODEFILE". > >>>>> I guess I didn't paste the last line by my mistake. > >>>>> > >>>>> I retried the test and below one is exactly what I got when I did the > >>> test. > >>>>> > >>>>> [mishima@manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8 > >>>>> qsub: waiting for job 8338.manage.cluster to start > >>>>> qsub: job 8338.manage.cluster ready > >>>>> > >>>>> [mishima@node11 ~]$ cat $PBS_NODEFILE > >>>>> node11 > >>>>> node11 > >>>>> node11 > >>>>> node11 > >>>>> node11 > >>>>> node11 > >>>>> node11 > >>>>> node11 > >>>>> node12 > >>>>> node12 > >>>>> node12 > >>>>> node12 > >>>>> node12 > >>>>> node12 > >>>>> node12 > >>>>> node12 > >>>>> [mishima@node11 ~]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > >>> myprog > >>>>> > >>> > >> -------------------------------------------------------------------------- > >>>>> A request was made to bind to that would result in binding more > >>>>> processes than cpus on a resource: > >>>>> > >>>>> Bind to: CORE > >>>>> Node: node12 > >>>>> #processes: 2 > >>>>> #cpus: 1 > >>>>> > >>>>> You can override this protection by adding the "overload-allowed" > >>>>> option to your binding directive. > >>>>> > >>> > >> -------------------------------------------------------------------------- > >>>>> > >>>>> Regards, > >>>>> > >>>>> Tetsuya Mishima > >>>>> > >>>>>> I removed the debug in #2 - thanks for reporting it > >>>>>> > >>>>>> For #1, it actually looks to me like this is correct. If you look at > >>> your > >>>>> allocation, there are only 7 slots being allocated on node12, yet you > >>> have > >>>>> asked for 8 cpus to be assigned (2 procs with 2 > >>>>>> cpus/proc). So the warning is in fact correct > >>>>>> > >>>>>> > >>>>>> On Dec 18, 2013, at 4:04 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So > >> I'd > >>>>> like > >>>>>>> to report > >>>>>>> 3 issues mainly regarding -cpus-per-proc. > >>>>>>> > >>>>>>> 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2 > >>>>> sockets X > >>>>>>> 4 cores/socket), > >>>>>>> it starts to produce the error again as shown below. At least, > >>>>>>> openmpi-1.7.4a1r29646 did > >>>>>>> work well. > >>>>>>> > >>>>>>> [mishima@manage ~]$ qsub -I -l nodes=2:ppn=8 > >>>>>>> qsub: waiting for job 8336.manage.cluster to start > >>>>>>> qsub: job 8336.manage.cluster ready > >>>>>>> > >>>>>>> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > >>>>>>> [mishima@node11 demos]$ cat $PBS_NODEFILE > >>>>>>> node11 > >>>>>>> node11 > >>>>>>> node11 > >>>>>>> node11 > >>>>>>> node11 > >>>>>>> node11 > >>>>>>> node11 > >>>>>>> node11 > >>>>>>> node12 > >>>>>>> node12 > >>>>>>> node12 > >>>>>>> node12 > >>>>>>> node12 > >>>>>>> node12 > >>>>>>> node12 > >>>>>>> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 > >>> -report-bindings > >>>>>>> myprog > >>>>>>> > >>>>> > >>> > >> -------------------------------------------------------------------------- > >>>>>>> A request was made to bind to that would result in binding more > >>>>>>> processes than cpus on a resource: > >>>>>>> > >>>>>>> Bind to: CORE > >>>>>>> Node: node12 > >>>>>>> #processes: 2 > >>>>>>> #cpus: 1 > >>>>>>> > >>>>>>> You can override this protection by adding the "overload-allowed" > >>>>>>> option to your binding directive. > >>>>>>> > >>>>> > >>> > >> -------------------------------------------------------------------------- > >>>>>>> > >>>>>>> Of course it works well using only one node. > >>>>>>> > >>>>>>> [mishima@node11 demos]$ mpirun -np 2 -cpus-per-proc 4 > >>> -report-bindings > >>>>>>> myprog > >>>>>>> [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]], > >>>>> socket > >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > >>>>>>> [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]], > >>>>> socket > >>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > >>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > >>>>>>> Hello world from process 1 of 2 > >>>>>>> Hello world from process 0 of 2 > >>>>>>> > >>>>>>> > >>>>>>> 2) Adding "-bind-to numa", it works but the message "bind:upward > >>> target > >>>>>>> NUMANode type NUMANode" appears. > >>>>>>> As far as I remember, I didn't see such a kind of message before. > >>>>>>> > >>>>>>> mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 > >> -report-bindings > >>>>>>> -bind-to numa myprog > >>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode > >> type > >>>>>>> NUMANode > >>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode > >> type > >>>>>>> NUMANode > >>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode > >> type > >>>>>>> NUMANode > >>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode > >> type > >>>>>>> NUMANode > >>>>>>> [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]], > >>>>> socket > >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > >>>>>>> [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]], > >>>>> socket > >>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > >>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > >>>>>>> [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]], > >>>>> socket > >>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > >>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > >>>>>>> [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]], > >>>>> socket > >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > >>>>>>> Hello world from process 1 of 4 > >>>>>>> Hello world from process 0 of 4 > >>>>>>> Hello world from process 3 of 4 > >>>>>>> Hello world from process 2 of 4 > >>>>>>> > >>>>>>> > >>>>>>> 3) I use PGI compiler. It can not accept compiler switch > >>>>>>> "-Wno-variadic-macros", which is > >>>>>>> included in configure script. > >>>>>>> > >>>>>>> btl_usnic_CFLAGS="-Wno-variadic-macros" > >>>>>>> > >>>>>>> I removed this switch, then I could continue to build 1.7.4rc1. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Tetsuya Mishima > >>>>>>> > >>>>>>> > >>>>>>>> Hmmm...okay, I understand the scenario. Must be something in the > >>> algo > >>>>>>> when it only has one node, so it shouldn't be too hard to track > >> down. > >>>>>>>> > >>>>>>>> I'm off on travel for a few days, but will return to this when I > >> get > >>>>>>> back. > >>>>>>>> > >>>>>>>> Sorry for delay - will try to look at this while I'm gone, but > >> can't > >>>>>>> promise anything :-( > >>>>>>>> > >>>>>>>> > >>>>>>>> On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Hi Ralph, sorry for confusing. > >>>>>>>>> > >>>>>>>>> We usually logon to "manage", which is our control node. > >>>>>>>>> From manage, we submit job or enter a remote node such as > >>>>>>>>> node03 by torque interactive mode(qsub -I). > >>>>>>>>> > >>>>>>>>> At that time, instead of torque, I just did rsh to node03 from > >>> manage > >>>>>>>>> and ran myprog on the node. I hope you could understand what I > >> did. > >>>>>>>>> > >>>>>>>>> Now, I retried with "-host node03", which still causes the > >> problem: > >>>>>>>>> (I comfirmed local run on manage caused the same problem too) > >>>>>>>>> > >>>>>>>>> [mishima@manage ~]$ rsh node03 > >>>>>>>>> Last login: Wed Dec 11 11:38:57 from manage > >>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > >>>>>>>>> [mishima@node03 demos]$ > >>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03 > >> -report-bindings > >>>>>>>>> -cpus-per-proc 4 -map-by socket myprog > >>>>>>>>> > >>>>>>> > >>>>> > >>> > >> -------------------------------------------------------------------------- > >>>>>>>>> A request was made to bind to that would result in binding more > >>>>>>>>> processes than cpus on a resource: > >>>>>>>>> > >>>>>>>>> Bind to: CORE > >>>>>>>>> Node: node03 > >>>>>>>>> #processes: 2 > >>>>>>>>> #cpus: 1 > >>>>>>>>> > >>>>>>>>> You can override this protection by adding the "overload-allowed" > >>>>>>>>> option to your binding directive. > >>>>>>>>> > >>>>>>> > >>>>> > >>> > >> -------------------------------------------------------------------------- > >>>>>>>>> > >>>>>>>>> It' strange, but I have to report that "-map-by socket:span" > >> worked > >>>>>>> well. > >>>>>>>>> > >>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03 > >> -report-bindings > >>>>>>>>> -cpus-per-proc 4 -map-by socket:span myprog > >>>>>>>>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt > >> 0]], > >>>>>>> socket > >>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > >>>>>>>>> ocket 1[core 11[hwt 0]]: > >>>>>>>>> > >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > >>>>>>>>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt > >>> 0]], > >>>>>>> socket > >>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > >>>>>>>>> socket 1[core 15[hwt 0]]: > >>>>>>>>> > >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > >>>>>>>>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt > >>> 0]], > >>>>>>> socket>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > >>>>>>>>> socket 2[core 19[hwt 0]]: > >>>>>>>>> > >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > >>>>>>>>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt > >>> 0]], > >>>>>>> socket > >>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > >>>>>>>>> socket 2[core 23[hwt 0]]: > >>>>>>>>> > >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > >>>>>>>>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt > >>> 0]], > >>>>>>> socket > >>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > >>>>>>>>> socket 3[core 27[hwt 0]]: > >>>>>>>>> > >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > >>>>>>>>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt > >>> 0]], > >>>>>>> socket > >>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > >>>>>>>>> socket 3[core 31[hwt 0]]: > >>>>>>>>> > >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > >>>>>>>>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt > >> 0]], > >>>>>>> socket > >>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>>>>>> cket 0[core 3[hwt 0]]: > >>>>>>>>> > >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > >>>>>>>>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt > >> 0]], > >>>>>>> socket > >>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > >>>>>>>>> cket 0[core 7[hwt 0]]: > >>>>>>>>> > >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > >>>>>>>>> Hello world from process 2 of 8 > >>>>>>>>> Hello world from process 6 of 8 > >>>>>>>>> Hello world from process 3 of 8 > >>>>>>>>> Hello world from process 7 of 8 > >>>>>>>>> Hello world from process 1 of 8 > >>>>>>>>> Hello world from process 5 of 8 > >>>>>>>>> Hello world from process 0 of 8 > >>>>>>>>> Hello world from process 4 of 8 > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Tetsuya Mishima > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Hi Ralph, > >>>>>>>>>>> > >>>>>>>>>>> I tried again with -cpus-per-proc 2 as shown below. > >>>>>>>>>>> Here, I found that "-map-by socket:span" worked well. > >>>>>>>>>>> > >>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > >>>>> -cpus-per-proc > >>>>>>> 2 > >>>>>>>>>>> -map-by socket:span myprog > >>>>>>>>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. > >>>>>>>>>>> /././.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B > >>>>>>>>>>> /./././.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 2[core 17[hwt 0]]: [./././././././.][./././. > >>>>>>>>>>> /./././.][B/B/./././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 2[core 19[hwt 0]]: [./././././././.][./././. > >>>>>>>>>>> /./././.][././B/B/./././.][./././././././.] > >>>>>>>>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 3[core 25[hwt 0]]: [./././././././.][./././. > >>>>>>>>>>> /./././.][./././././././.][B/B/./././././.] > >>>>>>>>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 3[core 27[hwt 0]]: [./././././././.][./././. > >>>>>>>>>>> /./././.][./././././././.][././B/B/./././.] > >>>>>>>>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. > >>>>>>>>>>> /././.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. > >>>>>>>>>>> /././.][./././././././.][./././././././.] > >>>>>>>>>>> Hello world from process 1 of 8 > >>>>>>>>>>> Hello world from process 0 of 8 > >>>>>>>>>>> Hello world from process 4 of 8 > >>>>>>>>>>> Hello world from process 2 of 8 > >>>>>>>>>>> Hello world from process 7 of 8 > >>>>>>>>>>> Hello world from process 6 of 8 > >>>>>>>>>>> Hello world from process 5 of 8> >>>>>>> Hello world from > >> process 3 of 8 > >>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > >>>>> -cpus-per-proc > >>>>>>> 2 > >>>>>>>>>>> -map-by socket myprog > >>>>>>>>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././. > >>>>>>>>>>> /././.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 0[core 7[hwt 0]]: [././././././B/B][././././. > >>>>>>>>>>> /././.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. > >>>>>>>>>>> /././.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B > >>>>>>>>>>> /./././.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 1[core 13[hwt 0]]: [./././././././.][./././. > >>>>>>>>>>> /B/B/./.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 1[core 15[hwt 0]]: [./././././././.][./././. > >>>>>>>>>>> /././B/B][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. > >>>>>>>>>>> /././.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. > >>>>>>>>>>> /././.][./././././././.][./././././././.] > >>>>>>>>>>> Hello world from process 5 of 8 > >>>>>>>>>>> Hello world from process 1 of 8 > >>>>>>>>>>> Hello world from process 6 of 8 > >>>>>>>>>>> Hello world from process 4 of 8 > >>>>>>>>>>> Hello world from process 2 of 8 > >>>>>>>>>>> Hello world from process 0 of 8 > >>>>>>>>>>> Hello world from process 7 of 8 > >>>>>>>>>>> Hello world from process 3 of 8 > >>>>>>>>>>> > >>>>>>>>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets. > >>>>>>>>>>> In this case, I guess "-map-by socket:span" and "-map-by > >> socket" > >>>>> has > >>>>>>>>> same > >>>>>>>>>>> meaning. > >>>>>>>>>>> Therefore, there's no problem about that. Sorry for distubing. > >>>>>>>>>> > >>>>>>>>>> No problem - glad you could clear that up :-) > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> By the way, through this test, I found another problem. > >>>>>>>>>>> Without torque manager and just using rsh, it causes the same > >>> error > >>>>>>>>> like > >>>>>>>>>>> below: > >>>>>>>>>>> > >>>>>>>>>>> [mishima@manage openmpi-1.7]$ rsh node03 > >>>>>>>>>>> Last login: Wed Dec 11 09:42:02 from manage > >>>>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > >>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > >>>>> -cpus-per-proc > >>>>>>> 4 > >>>>>>>>>>> -map-by socket myprog > >>>>>>>>>> > >>>>>>>>>> I don't understand the difference here - you are simply starting > >>> it > >>>>>>> from>>>>> a different node? It looks like everything is expected to > >>> run local > >>>>> to > >>>>>>>>> mpirun, yes? So there is no rsh actually involved here. > >>>>>>>>>> Are you still running in an allocation? > >>>>>>>>>> > >>>>>>>>>> If you run this with "-host node03" on the cmd line, do you see > >>> the > >>>>>>> same > >>>>>>>>> problem? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >>> > >> -------------------------------------------------------------------------- > >>>>>>>>>>> A request was made to bind to that would result in binding more > >>>>>>>>>>> processes than cpus on a resource: > >>>>>>>>>>> > >>>>>>>>>>> Bind to: CORE > >>>>>>>>>>> Node: node03 > >>>>>>>>>>> #processes: 2 > >>>>>>>>>>> #cpus: 1 > >>>>>>>>>>> > >>>>>>>>>>> You can override this protection by adding the > >> "overload-allowed" > >>>>>>>>>>> option to your binding directive. > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >>> > >> -------------------------------------------------------------------------- > >>>>>>>>>>> [mishima@node03 demos]$ > >>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > >>>>> -cpus-per-proc > >>>>>>> 4 > >>>>>>>>>>> myprog > >>>>>>>>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > >>>>>>>>>>> ocket 1[core 11[hwt 0]]: > >>>>>>>>>>> > >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > >>>>>>>>>>> socket 1[core 15[hwt 0]]: > >>>>>>>>>>> > >>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > >>>>>>>>>>> socket 2[core 19[hwt 0]]: > >>>>>>>>>>> > >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > >>>>>>>>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > >>>>>>>>>>> socket 2[core 23[hwt 0]]: > >>>>>>>>>>> > >>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > >>>>>>>>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > >>>>>>>>>>> socket 3[core 27[hwt 0]]:>>>>> > >>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > >>>>>>>>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > >>>>>>>>>>> socket 3[core 31[hwt 0]]: > >>>>>>>>>>> > >>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > >>>>>>>>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>>>>>>>> cket 0[core 3[hwt 0]]: > >>>>>>>>>>> > >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > >>>>>>>>>>> cket 0[core 7[hwt 0]]: > >>>>>>>>>>> > >>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > >>>>>>>>>>> Hello world from process 4 of 8 > >>>>>>>>>>> Hello world from process 2 of 8 > >>>>>>>>>>> Hello world from process 6 of 8 > >>>>>>>>>>> Hello world from process 5 of 8 > >>>>>>>>>>> Hello world from process 3 of 8 > >>>>>>>>>>> Hello world from process 7 of 8 > >>>>>>>>>>> Hello world from process 0 of 8 > >>>>>>>>>>> Hello world from process 1 of 8 > >>>>>>>>>>> > >>>>>>>>>>> Regards, > >>>>>>>>>>> Tetsuya Mishima > >>>>>>>>>>> > >>>>>>>>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but > >>> let > >>>>>>> me > >>>>>>>>>>> poke around a bit and see what might be happening. > >>>>>>>>>>>> > >>>>>>>>>>>> On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hi Ralph, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks. I didn't know the meaning of "socket:span". > >>>>>>>>>>>>> > >>>>>>>>>>>>> But it still causes the problem, which seems socket:span > >>> doesn't > >>>>>>>>> work. > >>>>>>>>>>>>> > >>>>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32 > >>>>>>>>>>>>> qsub: waiting for job 8265.manage.cluster to start > >>>>>>>>>>>>> qsub: job 8265.manage.cluster ready > >>>>>>>>>>>>> > >>>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > >>>>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > >>>>>>> -cpus-per-proc > >>>>>>>>> 4 > >>>>>>>>>>>>> -map-by socket:span myprog > >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8 > >> [hwt > >>>>> 0]], > >>>>>>>>>>> socket > >>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > >>>>>>>>>>>>> ocket 1[core 11[hwt 0]]: > >>>>>>>>>>>>> > >>>>>>> > >> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12 > >> [hwt > >>>>>>> 0]], > >>>>>>>>>>> socket > >>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > >>>>>>>>>>>>> socket 1[core 15[hwt 0]]: > >>>>>>>>>>>>> > >>>>>>> > >> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16 > >> [hwt > >>>>>>> 0]], > >>>>>>>>>>> socket > >>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > >>>>>>>>>>>>> socket 2[core 19[hwt 0]]: > >>>>>>>>>>>>> > >>>>>>> > >> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20 > >> [hwt > >>>>>>> 0]], > >>>>>>>>>>> socket > >>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > >>>>>>>>>>>>> socket 2[core 23[hwt 0]]: > >>>>>>>>>>>>> > >>>>>>> > >> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24 > >> [hwt > >>>>>>> 0]], > >>>>>>>>>>> socket > >>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > >>>>>>>>>>>>> socket 3[core 27[hwt 0]]: > >>>>>>>>>>>>> > >>>>>>> > >> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28 > >> [hwt > >>>>>>> 0]], > >>>>>>>>>>> socket > >>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > >>>>>>>>>>>>> socket 3[core 31[hwt 0]]: > >>>>>>>>>>>>> > >>>>>>> > >> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0 > >> [hwt > >>>>> 0]], > >>>>>>>>>>> socket > >>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>>>>>>>>>> cket 0[core 3[hwt 0]]: > >>>>>>>>>>>>> > >>>>>>> > >> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4 > >> [hwt > >>>>> 0]], > >>>>>>>>>>> socket > >>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > >>>>>>>>>>>>> cket 0[core 7[hwt 0]]: > >>>>>>>>>>>>> > >>>>>>> > >> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > >>>>>>>>>>>>> Hello world from process 0 of 8>>>>>>>>>>>>> Hello world from process 3 of 8 > >>>>>>>>>>>>> Hello world from process 1 of 8 > >>>>>>>>>>>>> Hello world from process 4 of 8 > >>>>>>>>>>>>> Hello world from process 6 of 8 > >>>>>>>>>>>>> Hello world from process 5 of 8 > >>>>>>>>>>>>> Hello world from process 2 of 8 > >>>>>>>>>>>>> Hello world from process 7 of 8 > >>>>>>>>>>>>> > >>>>>>>>>>>>> Regards, > >>>>>>>>>>>>> Tetsuya Mishima > >>>>>>>>>>>>> > >>>>>>>>>>>>>> No, that is actually correct. We map a socket until full, > >> then > >>>>>>> move > >>>>>>>>> to > >>>>>>>>>>>>> the next. What you want is --map-by socket:span > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Dec 10, 2013, at 3:42 PM, tmish i...@jcity.maeda.co.jp > >> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi Ralph, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I had a time to try your patch yesterday using > >>>>>>>>> openmpi-1.7.4a1r29646. > >>>>>>>>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by > >>>>>>> socket" itself > >>>>>>>>>>>>> didn't > >>>>>>>>>>>>>>> work > >>>>>>>>>>>>>>> well as shown bellow: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32 > >>>>>>>>>>>>>>> qsub: waiting for job 8260.manage.cluster to start > >>>>>>>>>>>>>>> qsub: job 8260.manage.cluster ready > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > >>>>>>>>>>>>>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings > >>>>>>>>> -cpus-per-proc > >>>>>>>>>>> 4 > >>>>>>>>>>>>>>> -map-by socket myprog > >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8 > >>> [hwt > >>>>>>> 0]], > >>>>>>>>>>>>> socket > >>>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > >>>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]: > >>>>>>>>>>>>>>> > >>>>>>>>> > >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12 > >>> [hwt > >>>>>>>>> 0]], > >>>>>>>>>>>>> socket > >>>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > >>>>>>>>>>>>>>> socket 1[core 15[hwt 0]]: > >>>>>>>>>>>>>>> > >>>>>>>>> > >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16 > >>> [hwt > >>>>>>>>> 0]], > >>>>>>>>>>>>> socket > >>>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > >>>>>>>>>>>>>>> socket 2[core 19[hwt 0]]: > >>>>>>>>>>>>>>> > >>>>>>>>> > >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20 > >>> [hwt > >>>>>>>>> 0]], > >>>>>>>>>>>>> socket > >>>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > >>>>>>>>>>>>>>> socket 2[core 23[hwt 0]]: > >>>>>>>>>>>>>>> > >>>>>>>>> > >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24 > >>> [hwt > >>>>>>>>> 0]], > >>>>>>>>>>>>> socket > >>>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > >>>>>>>>>>>>>>> socket 3[core 27[hwt 0]]: > >>>>>>>>>>>>>>> > >>>>>>>>> > >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28 > >>> [hwt > >>>>>>>>> 0]], > >>>>>>>>>>>>> socket > >>>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > >>>>>>>>>>>>>>> socket 3[core 31[hwt 0]]: > >>>>>>>>>>>>>>> > >>>>>>>>> > >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0 > >>> [hwt > >>>>>>> 0]], > >>>>>>>>>>>>> socket > >>>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>>>>>>>>>>>> cket 0[core 3[hwt 0]]: > >>>>>>>>>>>>>>> > >>>>>>>>> > >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4 > >>> [hwt > >>>>>>> 0]], > >>>>>>>>>>>>> socket > >>>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > >>>>>>>>>>>>>>> cket 0[core 7[hwt 0]]: > >>>>>>>>>>>>>>> > >>>>>>>>> > >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > >>>>>>>>>>>>>>> Hello world from process 2 of 8 > >>>>>>>>>>>>>>> Hello world from process 1 of 8 > >>>>>>>>>>>>>>> Hello world from process 3 of 8 > >>>>>>>>>>>>>>> Hello world from process 0 of 8 > >>>>>>>>>>>>>>> Hello world from process 6 of 8 > >>>>>>>>>>>>>>> Hello world from process 5 of 8 > >>>>>>>>>>>>>>> Hello world from process 4 of 8 > >>>>>>>>>>>>>>> Hello world from process 7 of 8 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I think this should be like this: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> rank 00 > >>>>>>>>>>>>>>> > >>>>>>>>> > >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > >>>>>>>>>>>>>>> rank 01 > >>>>>>>>>>>>>>> > >>>>>>>>> > >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > >>>>>>>>>>>>>>> rank 02 > >>>>>>>>>>>>>>> > >>>>>>>>> > >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > >>>>>>>>>>>>>>> ... > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>> Tetsuya Mishima > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I fixed this under the trunk (was an issue regardless of > >> RM) > >>>>> and > >>>>>>>>>>> have > >>>>>>>>>>>>>>> scheduled it for 1.7.4. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>>>>> Ralph > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp > >>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Hi Ralph, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thank you very much for your quick response.> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I'm afraid to say that I found one more issuse... > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> It's not so serious. Please check it when you have a lot > >> of > >>>>>>> time. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> The problem is cpus-per-proc with -map-by option under > >>> Torque > >>>>>>>>>>>>> manager. > >>>>>>>>>>>>>>>>> It doesn't work as shown below. I guess you can get the > >>> same > >>>>>>>>>>>>>>>>> behaviour under Slurm manager. > >> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Of course, if I remove -map-by option, it works quite > >> well. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 > >>>>>>>>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start > >>>>>>>>>>>>>>>>> qsub: job 8116.manage.cluster ready > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2 > >>>>>>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings > >>>>>>>>>>>>> -cpus-per-proc > >>>>>>>>>>>>>>> 4 > >>>>>>>>>>>>>>>>> -map-by socket mPre > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >>> > >> -------------------------------------------------------------------------- > >>>>>>>>>>>>>>>>> A request was made to bind to that would result in > >> binding > >>>>> more > >>>>>>>>>>>>>>>>> processes than cpus on a resource: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Bind to: CORE > >>>>>>>>>>>>>>>>> Node: node03>>>>>>> #processes: 2 > >>>>>>>>>>>>>>>>> #cpus: 1 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> You can override this protection by adding the > >>>>>>> "overload-allowed" > >>>>>>>>>>>>>>>>> option to your binding directive. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >>> > >> -------------------------------------------------------------------------- > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings > >>>>>>>>>>>>> -cpus-per-proc > >>>>>>>>>>>>>>> 4 > >>>>>>>>>>>>>>>>> mPre > >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1 [core > >> 8 > >>>>> [hwt > >>>>>>>>> 0]], > >>>>>>>>>>>>>>> socket > >>>>>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > >>>>>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1 [core > >> 12 > >>>>> [hwt > >>>>>>>>>>> 0]], > >>>>>>>>>>>>>>> socket > >>>>>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > >>>>>>>>>>>>>>>>> socket 1[core 15[hwt 0]]: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2 [core > >> 16 > >>>>> [hwt > >>>>>>>>>>> 0]], > >>>>>>>>>>>>>>> socket > >>>>>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > >>>>>>>>>>>>>>>>> socket 2[core 19[hwt 0]]: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2 [core > >> 20 > >>>>> [hwt > >>>>>>>>>>> 0]], > >>>>>>>>>>>>>>> socket > >>>>>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > >>>>>>>>>>>>>>>>> socket 2[core 23[hwt 0]]: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3 [core > >> 24 > >>>>> [hwt > >>>>>>>>>>> 0]], > >>>>>>>>>>>>>>> socket > >>>>>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > >>>>>>>>>>>>>>>>> socket 3[core 27[hwt 0]]: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3 [core > >> 28 > >>>>> [hwt > >>>>>>>>>>> 0]], > >>>>>>>>>>>>>>> socket > >>>>>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > >>>>>>>>>>>>>>>>> socket 3[core 31[hwt 0]]: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>> > >>> > >> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]>>>>>>>>>>>>> > >> > >>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0 > >>>>> [hwt > >>>>>>>>> 0]], > >>>>>>>>>>>>>>> socket > >>>>>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>>>>>>>>>>>>>> cket 0[core 3[hwt 0]]: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0 [core > >> 4 > >>>>> [hwt > >>>>>>>>> 0]], > >>>>>>>>>>>>>>> socket > >>>>>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > >>>>>>>>>>>>>>>>> cket 0[core 7[hwt 0]]: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>> Regards, > >>>>>>>>>>>>>>>>> Tetsuya Mishima > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again! > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain > >>>>> <r...@open-mpi.org> > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks! That's precisely where I was going to look when > >> I > >>>>> had > >>>>>>>>>>>>> time :-) > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> I'll update tomorrow. > >>>>>>>>>>>>>>>>>> Ralph > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM, > >>>>>>>>>>> <tmish...@jcity.maeda.co.jp>wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hi Ralph, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> This is the continuous story of "Segmentation fault in > >>>>>>> oob_tcp.c > >>>>>>>>>>> of > >>>>>>>>>>>>>>>>>> openmpi-1.7.4a1r29646". > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> I found the cause. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine > >>> can > >>>>>>> not. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Your host file: > >>>>>>>>>>>>>>>>>> cat hosts > >>>>>>>>>>>>>>>>>> bend001 slots=12 > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> My host file: > >>>>>>>>>>>>>>>>>> cat hosts > >>>>>>>>>>>>>>>>>> node08 > >>>>>>>>>>>>>>>>>> node08 > >>>>>>>>>>>>>>>>>> ...(total 8 lines) > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> I modified my script file to add "slots=1" to each line > >> of > >>>>> my > >>>>>>>>>>>>> hostfile > >>>>>>>>>>>>>>>>>> just before launching mpirun. Then it worked. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> My host file(modified): > >>>>>>>>>>>>>>>>>> cat hosts > >>>>>>>>>>>>>>>>>> node08 slots=1 > >>>>>>>>>>>>>>>>>> node08 slots=1 > >>>>>>>>>>>>>>>>>> ...(total 8 lines) > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Secondary, I confirmed that there's a slight difference > >>>>>>> between > >>>>>>>>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of > >>>>>>>>> 1.7.4a1r29646. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> $ diff > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>> > >>>>> > >> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c > >>>>>>>>>>>>>>>>>> 394,401c394,399 > >>>>>>>>>>>>>>>>>> < if (got_count) { > >>>>>>>>>>>>>>>>>> < node->slots_given = true; > >>>>>>>>>>>>>>>>>> < } else if (got_max) { > >>>>>>>>>>>>>>>>>> < node->slots = node->slots_max; > >>>>>>>>>>>>>>>>>> < node->slots_given = true; > >>>>>>>>>>>>>>>>>> < } else { > >>>>>>>>>>>>>>>>>> < /* should be set by obj_new, but just to be > >>> clear > >>>>> */ > >>>>>>>>>>>>>>>>>> < node->slots_given = false; > >>>>>>>>>>>>>>>>>> --- > >>>>>>>>>>>>>>>>>>> if (!got_count) { > >>>>>>>>>>>>>>>>>>> if (got_max) { > >>>>>>>>>>>>>>>>>>> node->slots = node->slots_max; > >>>>>>>>>>>>>>>>>>> } else { > >>>>>>>>>>>>>>>>>>> ++node->slots;>>>>>>>>>>>>> } > >>>>>>>>>>>>>>>>>> .... > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Finally, I added the line 402 below just as a tentative > >>>>> trial. > >>>>>>>>>>>>>>>>>> Then, it worked. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c: > >>>>>>>>>>>>>>>>>> ... > >>>>>>>>>>>>>>>>>> 394 if (got_count) { > >>>>>>>>>>>>>>>>>> 395 node->slots_given = true; > >>>>>>>>>>>>>>>>>> 396 } else if (got_max) { > >>>>>>>>>>>>>>>>>> 397 node->slots = node->slots_max; > >>>>>>>>>>>>>>>>>> 398 node->slots_given = true; > >>>>>>>>>>>>>>>>>> 399 } else { > >>>>>>>>>>>>>>>>>> 400 /* should be set by obj_new, but just to be > >>>>> clear > >>>>>>>>> */ > >>>>>>>>>>>>>>>>>> 401 node->slots_given > >>>>> = false; > >>>>>>>>>>>>>>>>>> 402 ++node->slots; /* added by tmishima */ > >>>>>>>>>>>>>>>>>> 403 } > >>>>>>>>>>>>>>>>>> ... > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Please fix the problem properly, because it's just based > >>> on > >>>>> my > >>>>>>>>>>>>>>>>>> random guess. It's related to the treatment of hostfile > >>>>> where > >>>>>>>>>>> slots > >>>>>>>>>>>>>>>>>> information is not given. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>>> Tetsuya Mishima > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>>>>>> users mailing list > >>>>>>>>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >>> > >> http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > >> > >>> > >>>>> > >>>>>>> > >>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> users mailing list>>>>>>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>>>>> users mailing list > >>>>>>>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>>>> users mailing list > >>>>>>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>>> users mailing list > >>>>>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>> users mailing list > >>>>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>>>> > >>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>> users mailing list > >>>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>>> > >>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>> users mailing list > >>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> users mailing list > >>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> users mailing list > >>>>>>>>>> us...@open-mpi.org > >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> us...@open-mpi.org > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users