I removed the debug in #2 - thanks for reporting it For #1, it actually looks to me like this is correct. If you look at your allocation, there are only 7 slots being allocated on node12, yet you have asked for 8 cpus to be assigned (2 procs with 2 cpus/proc). So the warning is in fact correct
On Dec 18, 2013, at 4:04 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So I'd like > to report > 3 issues mainly regarding -cpus-per-proc. > > 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2 sockets X > 4 cores/socket), > it starts to produce the error again as shown below. At least, > openmpi-1.7.4a1r29646 did > work well. > > [mishima@manage ~]$ qsub -I -l nodes=2:ppn=8 > qsub: waiting for job 8336.manage.cluster to start > qsub: job 8336.manage.cluster ready > > [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > [mishima@node11 demos]$ cat $PBS_NODEFILE > node11 > node11 > node11 > node11 > node11 > node11 > node11 > node11 > node12 > node12 > node12 > node12 > node12 > node12 > node12 > [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > myprog > -------------------------------------------------------------------------- > A request was made to bind to that would result in binding more > processes than cpus on a resource: > > Bind to: CORE > Node: node12 > #processes: 2 > #cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > -------------------------------------------------------------------------- > > Of course it works well using only one node. > > [mishima@node11 demos]$ mpirun -np 2 -cpus-per-proc 4 -report-bindings > myprog > [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > Hello world from process 1 of 2 > Hello world from process 0 of 2 > > > 2) Adding "-bind-to numa", it works but the message "bind:upward target > NUMANode type NUMANode" appears. > As far as I remember, I didn't see such a kind of message before. > > mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > -bind-to numa myprog > [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type > NUMANode > [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type > NUMANode > [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type > NUMANode > [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type > NUMANode > [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > Hello world from process 1 of 4 > Hello world from process 0 of 4 > Hello world from process 3 of 4 > Hello world from process 2 of 4 > > > 3) I use PGI compiler. It can not accept compiler switch > "-Wno-variadic-macros", which is > included in configure script. > > btl_usnic_CFLAGS="-Wno-variadic-macros" > > I removed this switch, then I could continue to build 1.7.4rc1. > > Regards, > Tetsuya Mishima > > >> Hmmm...okay, I understand the scenario. Must be something in the algo > when it only has one node, so it shouldn't be too hard to track down. >> >> I'm off on travel for a few days, but will return to this when I get > back. >> >> Sorry for delay - will try to look at this while I'm gone, but can't > promise anything :-( >> >> >> On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Hi Ralph, sorry for confusing. >>> >>> We usually logon to "manage", which is our control node. >>> From manage, we submit job or enter a remote node such as >>> node03 by torque interactive mode(qsub -I). >>> >>> At that time, instead of torque, I just did rsh to node03 from manage >>> and ran myprog on the node. I hope you could understand what I did. >>> >>> Now, I retried with "-host node03", which still causes the problem: >>> (I comfirmed local run on manage caused the same problem too) >>> >>> [mishima@manage ~]$ rsh node03 >>> Last login: Wed Dec 11 11:38:57 from manage >>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>> [mishima@node03 demos]$ >>> [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings >>> -cpus-per-proc 4 -map-by socket myprog >>> > -------------------------------------------------------------------------- >>> A request was made to bind to that would result in binding more >>> processes than cpus on a resource: >>> >>> Bind to: CORE >>> Node: node03 >>> #processes: 2 >>> #cpus: 1 >>> >>> You can override this protection by adding the "overload-allowed" >>> option to your binding directive. >>> > -------------------------------------------------------------------------- >>> >>> It' strange, but I have to report that "-map-by socket:span" worked > well. >>> >>> [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings >>> -cpus-per-proc 4 -map-by socket:span myprog >>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt 0]], > socket >>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>> ocket 1[core 11[hwt 0]]: >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt 0]], > socket >>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>> socket 1[core 15[hwt 0]]: >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt 0]], > socket >>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>> socket 2[core 19[hwt 0]]: >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt 0]], > socket >>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>> socket 2[core 23[hwt 0]]: >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt 0]], > socket >>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>> socket 3[core 27[hwt 0]]: >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt 0]], > socket >>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>> socket 3[core 31[hwt 0]]: >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>> cket 0[core 3[hwt 0]]: >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt 0]], > socket >>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>> cket 0[core 7[hwt 0]]: >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>> Hello world from process 2 of 8 >>> Hello world from process 6 of 8 >>> Hello world from process 3 of 8 >>> Hello world from process 7 of 8 >>> Hello world from process 1 of 8 >>> Hello world from process 5 of 8 >>> Hello world from process 0 of 8 >>> Hello world from process 4 of 8 >>> >>> Regards, >>> Tetsuya Mishima >>> >>> >>>> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote: >>>> >>>>> >>>>> >>>>> Hi Ralph, >>>>> >>>>> I tried again with -cpus-per-proc 2 as shown below. >>>>> Here, I found that "-map-by socket:span" worked well. >>>>> >>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc > 2 >>>>> -map-by socket:span myprog >>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt 0]], >>> socket >>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. >>>>> /././.][./././././././.][./././././././.] >>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt 0]], >>> socket >>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B >>>>> /./././.][./././././././.][./././././././.] >>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt 0]], >>> socket >>>>> 2[core 17[hwt 0]]: [./././././././.][./././. >>>>> /./././.][B/B/./././././.][./././././././.] >>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt 0]], >>> socket >>>>> 2[core 19[hwt 0]]: [./././././././.][./././. >>>>> /./././.][././B/B/./././.][./././././././.] >>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt 0]], >>> socket >>>>> 3[core 25[hwt 0]]: [./././././././.][./././. >>>>> /./././.][./././././././.][B/B/./././././.] >>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt 0]], >>> socket >>>>> 3[core 27[hwt 0]]: [./././././././.][./././. >>>>> /./././.][./././././././.][././B/B/./././.] >>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>> socket >>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. >>>>> /././.][./././././././.][./././././././.] >>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt 0]], >>> socket >>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. >>>>> /././.][./././././././.][./././././././.] >>>>> Hello world from process 1 of 8 >>>>> Hello world from process 0 of 8 >>>>> Hello world from process 4 of 8 >>>>> Hello world from process 2 of 8 >>>>> Hello world from process 7 of 8 >>>>> Hello world from process 6 of 8 >>>>> Hello world from process 5 of 8 >>>>> Hello world from process 3 of 8 >>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc > 2 >>>>> -map-by socket myprog >>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt 0]], >>> socket >>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././. >>>>> /././.][./././././././.][./././././././.] >>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt 0]], >>> socket >>>>> 0[core 7[hwt 0]]: [././././././B/B][././././. >>>>> /././.][./././././././.][./././././././.] >>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt 0]], >>> socket >>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. >>>>> /././.][./././././././.][./././././././.] >>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt 0]], >>> socket >>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B >>>>> /./././.][./././././././.][./././././././.] >>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt 0]], >>> socket >>>>> 1[core 13[hwt 0]]: [./././././././.][./././. >>>>> /B/B/./.][./././././././.][./././././././.] >>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt 0]], >>> socket >>>>> 1[core 15[hwt 0]]: [./././././././.][./././. >>>>> /././B/B][./././././././.][./././././././.] >>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>> socket >>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. >>>>> /././.][./././././././.][./././././././.] >>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt 0]], >>> socket >>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. >>>>> /././.][./././././././.][./././././././.] >>>>> Hello world from process 5 of 8 >>>>> Hello world from process 1 of 8 >>>>> Hello world from process 6 of 8 >>>>> Hello world from process 4 of 8 >>>>> Hello world from process 2 of 8 >>>>> Hello world from process 0 of 8 >>>>> Hello world from process 7 of 8 >>>>> Hello world from process 3 of 8 >>>>> >>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets. >>>>> In this case, I guess "-map-by socket:span" and "-map-by socket" has >>> same >>>>> meaning. >>>>> Therefore, there's no problem about that. Sorry for distubing. >>>> >>>> No problem - glad you could clear that up :-) >>>> >>>>> >>>>> By the way, through this test, I found another problem. >>>>> Without torque manager and just using rsh, it causes the same error >>> like >>>>> below: >>>>> >>>>> [mishima@manage openmpi-1.7]$ rsh node03 >>>>> Last login: Wed Dec 11 09:42:02 from manage >>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc > 4 >>>>> -map-by socket myprog >>>> >>>> I don't understand the difference here - you are simply starting it > from >>> a different node? It looks like everything is expected to run local to >>> mpirun, yes? So there is no rsh actually involved here. >>>> Are you still running in an allocation? >>>> >>>> If you run this with "-host node03" on the cmd line, do you see the > same >>> problem? >>>> >>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>> A request was made to bind to that would result in binding more >>>>> processes than cpus on a resource: >>>>> >>>>> Bind to: CORE >>>>> Node: node03 >>>>> #processes: 2 >>>>> #cpus: 1 >>>>> >>>>> You can override this protection by adding the "overload-allowed" >>>>> option to your binding directive. >>>>> >>> > -------------------------------------------------------------------------- >>>>> [mishima@node03 demos]$ >>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc > 4 >>>>> myprog >>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt 0]], >>> socket >>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>> ocket 1[core 11[hwt 0]]: >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt 0]], >>> socket >>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>> socket 1[core 15[hwt 0]]: >>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt 0]], >>> socket >>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>> socket 2[core 19[hwt 0]]: >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt 0]], >>> socket >>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>> socket 2[core 23[hwt 0]]: >>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt 0]], >>> socket >>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>> socket 3[core 27[hwt 0]]: >>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt 0]], >>> socket >>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>> socket 3[core 31[hwt 0]]: >>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>> socket >>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>> cket 0[core 3[hwt 0]]: >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt 0]], >>> socket >>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>> cket 0[core 7[hwt 0]]: >>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>> Hello world from process 4 of 8 >>>>> Hello world from process 2 of 8 >>>>> Hello world from process 6 of 8 >>>>> Hello world from process 5 of 8 >>>>> Hello world from process 3 of 8 >>>>> Hello world from process 7 of 8 >>>>> Hello world from process 0 of 8 >>>>> Hello world from process 1 of 8 >>>>> >>>>> Regards, >>>>> Tetsuya Mishima >>>>> >>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but let > me >>>>> poke around a bit and see what might be happening. >>>>>> >>>>>> On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Ralph, >>>>>>> >>>>>>> Thanks. I didn't know the meaning of "socket:span". >>>>>>> >>>>>>> But it still causes the problem, which seems socket:span doesn't >>> work. >>>>>>> >>>>>>> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32 >>>>>>> qsub: waiting for job 8265.manage.cluster to start >>>>>>> qsub: job 8265.manage.cluster ready >>>>>>> >>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > -cpus-per-proc >>> 4 >>>>>>> -map-by socket:span myprog >>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8[hwt 0]], >>>>> socket >>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12[hwt > 0]], >>>>> socket >>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>> socket 1[core 15[hwt 0]]: >>>>>>> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16[hwt > 0]], >>>>> socket >>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>> socket 2[core 19[hwt 0]]: >>>>>>> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20[hwt > 0]], >>>>> socket >>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>> socket 2[core 23[hwt 0]]: >>>>>>> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24[hwt > 0]], >>>>> socket >>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>> socket 3[core 27[hwt 0]]: >>>>>>> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28[hwt > 0]], >>>>> socket >>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>> socket 3[core 31[hwt 0]]: >>>>>>> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>>>> socket >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>> cket 0[core 3[hwt 0]]: >>>>>>> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4[hwt 0]], >>>>> socket >>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>> cket 0[core 7[hwt 0]]: >>>>>>> > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>> Hello world from process 0 of 8 >>>>>>> Hello world from process 3 of 8 >>>>>>> Hello world from process 1 of 8 >>>>>>> Hello world from process 4 of 8 >>>>>>> Hello world from process 6 of 8 >>>>>>> Hello world from process 5 of 8 >>>>>>> Hello world from process 2 of 8 >>>>>>> Hello world from process 7 of 8 >>>>>>> >>>>>>> Regards, >>>>>>> Tetsuya Mishima >>>>>>> >>>>>>>> No, that is actually correct. We map a socket until full, then > move >>> to >>>>>>> the next. What you want is --map-by socket:span >>>>>>>> >>>>>>>> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Ralph, >>>>>>>>> >>>>>>>>> I had a time to try your patch yesterday using >>> openmpi-1.7.4a1r29646. >>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by > socket" itself >>>>>>> didn't >>>>>>>>> work >>>>>>>>> well as shown bellow: >>>>>>>>> >>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32 >>>>>>>>> qsub: waiting for job 8260.manage.cluster to start >>>>>>>>> qsub: job 8260.manage.cluster ready >>>>>>>>> >>>>>>>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings >>> -cpus-per-proc >>>>> 4 >>>>>>>>> -map-by socket myprog >>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt > 0]], >>>>>>> socket >>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>> >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt > 0]], >>>>>>> socket >>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt > 0]], >>>>>>> socket >>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>> >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>> Hello world from process 2 of 8 >>>>>>>>> Hello world from process 1 of 8 >>>>>>>>> Hello world from process 3 of 8 >>>>>>>>> Hello world from process 0 of 8 >>>>>>>>> Hello world from process 6 of 8 >>>>>>>>> Hello world from process 5 of 8 >>>>>>>>> Hello world from process 4 of 8 >>>>>>>>> Hello world from process 7 of 8 >>>>>>>>> >>>>>>>>> I think this should be like this: >>>>>>>>> >>>>>>>>> rank 00 >>>>>>>>> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>> rank 01 >>>>>>>>> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>> rank 02 >>>>>>>>> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>> ... >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Tetsuya Mishima >>>>>>>>> >>>>>>>>>> I fixed this under the trunk (was an issue regardless of RM) and >>>>> have >>>>>>>>> scheduled it for 1.7.4. >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> Ralph >>>>>>>>>> >>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Ralph, >>>>>>>>>>> >>>>>>>>>>> Thank you very much for your quick response. >>>>>>>>>>> >>>>>>>>>>> I'm afraid to say that I found one more issuse... >>>>>>>>>>> >>>>>>>>>>> It's not so serious. Please check it when you have a lot of > time. >>>>>>>>>>> >>>>>>>>>>> The problem is cpus-per-proc with -map-by option under Torque >>>>>>> manager. >>>>>>>>>>> It doesn't work as shown below. I guess you can get the same >>>>>>>>>>> behaviour under Slurm manager. >>>>>>>>>>> >>>>>>>>>>> Of course, if I remove -map-by option, it works quite well. >>>>>>>>>>> >>>>>>>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 >>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start >>>>>>>>>>> qsub: job 8116.manage.cluster ready >>>>>>>>>>> >>>>>>>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2 >>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings >>>>>>> -cpus-per-proc >>>>>>>>> 4 >>>>>>>>>>> -map-by socket mPre >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>>>>>>>> A request was made to bind to that would result in binding more >>>>>>>>>>> processes than cpus on a resource: >>>>>>>>>>> >>>>>>>>>>> Bind to: CORE >>>>>>>>>>> Node: node03>>>>>>> #processes: 2 >>>>>>>>>>> #cpus: 1 >>>>>>>>>>> >>>>>>>>>>> You can override this protection by adding the > "overload-allowed" >>>>>>>>>>> option to your binding directive. >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings >>>>>>> -cpus-per-proc >>>>>>>>> 4 >>>>>>>>>>> mPre >>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>>>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>>>> >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>>>> >>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>> >>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again! >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain <r...@open-mpi.org> >>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Thanks! That's precisely where I was going to look when I had >>>>>>> time :-) >>>>>>>>>>>> >>>>>>>>>>>> I'll update tomorrow. >>>>>>>>>>>> Ralph >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM, >>>>> <tmish...@jcity.maeda.co.jp>wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>> >>>>>>>>>>>> This is the continuous story of "Segmentation fault in > oob_tcp.c >>>>> of >>>>>>>>>>>> openmpi-1.7.4a1r29646". >>>>>>>>>>>> >>>>>>>>>>>> I found the cause. >>>>>>>>>>>> >>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine can > not. >>>>>>>>>>>> >>>>>>>>>>>> Your host file: >>>>>>>>>>>> cat hosts >>>>>>>>>>>> bend001 slots=12 >>>>>>>>>>>> >>>>>>>>>>>> My host file: >>>>>>>>>>>> cat hosts >>>>>>>>>>>> node08 >>>>>>>>>>>> node08 >>>>>>>>>>>> ...(total 8 lines) >>>>>>>>>>>> >>>>>>>>>>>> I modified my script file to add "slots=1" to each line of my >>>>>>> hostfile >>>>>>>>>>>> just before launching mpirun. Then it worked. >>>>>>>>>>>> >>>>>>>>>>>> My host file(modified): >>>>>>>>>>>> cat hosts >>>>>>>>>>>> node08 slots=1 >>>>>>>>>>>> node08 slots=1 >>>>>>>>>>>> ...(total 8 lines) >>>>>>>>>>>> >>>>>>>>>>>> Secondary, I confirmed that there's a slight difference > between >>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of >>> 1.7.4a1r29646. >>>>>>>>>>>> >>>>>>>>>>>> $ diff >>>>>>>>>>>> >>>>>>> >>> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c >>>>>>>>>>>> 394,401c394,399 >>>>>>>>>>>> < if (got_count) { >>>>>>>>>>>> < node->slots_given = true; >>>>>>>>>>>> < } else if (got_max) { >>>>>>>>>>>> < node->slots = node->slots_max; >>>>>>>>>>>> < node->slots_given = true; >>>>>>>>>>>> < } else { >>>>>>>>>>>> < /* should be set by obj_new, but just to be clear */ >>>>>>>>>>>> < node->slots_given = false; >>>>>>>>>>>> --- >>>>>>>>>>>>> if (!got_count) { >>>>>>>>>>>>> if (got_max) { >>>>>>>>>>>>> node->slots = node->slots_max; >>>>>>>>>>>>> } else { >>>>>>>>>>>>> ++node->slots; >>>>>>>>>>>>> } >>>>>>>>>>>> .... >>>>>>>>>>>> >>>>>>>>>>>> Finally, I added the line 402 below just as a tentative trial. >>>>>>>>>>>> Then, it worked. >>>>>>>>>>>> >>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c: >>>>>>>>>>>> ... >>>>>>>>>>>> 394 if (got_count) { >>>>>>>>>>>> 395 node->slots_given = true; >>>>>>>>>>>> 396 } else if (got_max) { >>>>>>>>>>>> 397 node->slots = node->slots_max; >>>>>>>>>>>> 398 node->slots_given = true; >>>>>>>>>>>> 399 } else { >>>>>>>>>>>> 400 /* should be set by obj_new, but just to be clear >>> */ >>>>>>>>>>>> 401 node->slots_given = false; >>>>>>>>>>>> 402 ++node->slots; /* added by tmishima */ >>>>>>>>>>>> 403 } >>>>>>>>>>>> ... >>>>>>>>>>>> >>>>>>>>>>>> Please fix the problem properly, because it's just based on my >>>>>>>>>>>> random guess. It's related to the treatment of hostfile where >>>>> slots >>>>>>>>>>>> information is not given. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > >>> >>>>> >>>>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> users mailing list >>>>>>>>>>>> >>>>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users