Actually, it looks like it would happen with hetero-nodes set - only required that at least two nodes have the same architecture. So you might want to give the trunk a shot as it may well now be fixed.
On Dec 19, 2013, at 8:35 AM, Ralph Castain <r...@open-mpi.org> wrote: > Hmmm...not having any luck tracking this down yet. If anything, based on what > I saw in the code, I would have expected it to fail when hetero-nodes was > false, not the other way around. > > I'll keep poking around - just wanted to provide an update. > > On Dec 19, 2013, at 12:54 AM, tmish...@jcity.maeda.co.jp wrote: > >> >> >> Hi Ralph, sorry for intersecting post. >> >> Your advice about -hetero-nodes in other thread gives me a hint. >> >> I already put "orte_hetero_nodes = 1" in my mca-params.conf, because >> you told me a month ago that my environment would need this option. >> >> Removing this line from mca-params.conf, then it works. >> In other word, you can replicate it by adding -hetero-nodes as >> shown below. >> >> qsub: job 8364.manage.cluster completed >> [mishima@manage mpi]$ qsub -I -l nodes=2:ppn=8 >> qsub: waiting for job 8365.manage.cluster to start >> qsub: job 8365.manage.cluster ready >> >> [mishima@node11 ~]$ ompi_info --all | grep orte_hetero_nodes >> MCA orte: parameter "orte_hetero_nodes" (current value: >> "false", data source: default, level: 9 dev/all, >> type: bool) >> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings >> myprog >> [node11.cluster:27895] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >> [node11.cluster:27895] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket >> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >> [node12.cluster:24891] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket >> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >> [node12.cluster:24891] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >> Hello world from process 0 of 4 >> Hello world from process 1 of 4 >> Hello world from process 2 of 4 >> Hello world from process 3 of 4 >> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings >> -hetero-nodes myprog >> -------------------------------------------------------------------------- >> A request was made to bind to that would result in binding more >> processes than cpus on a resource: >> >> Bind to: CORE >> Node: node12 >> #processes: 2 >> #cpus: 1 >> >> You can override this protection by adding the "overload-allowed" >> option to your binding directive. >> -------------------------------------------------------------------------- >> >> >> As far as I checked, data->num_bound seems to become bad in bind_downwards, >> when I put "-hetero-nodes". I hope you can clear the problem. >> >> Regards, >> Tetsuya Mishima >> >> >>> Yes, it's very strange. But I don't think there's any chance that >>> I have < 8 actual cores on the node. I guess that you cat replicate >>> it with SLURM, please try it again. >>> >>> I changed to use node10 and node11, then I got the warning against >>> node11. >>> >>> Furthermore, just as an information for you, I tried to add >>> "-bind-to core:overload-allowed", then it worked as shown below. >>> But I think node11 is never overloaded because it has 8 cores. >>> >>> qsub: job 8342.manage.cluster completed >>> [mishima@manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8 >>> qsub: waiting for job 8343.manage.cluster to start >>> qsub: job 8343.manage.cluster ready >>> >>> [mishima@node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>> [mishima@node10 demos]$ cat $PBS_NODEFILE >>> node10 >>> node10 >>> node10 >>> node10 >>> node10 >>> node10 >>> node10 >>> node10 >>> node11 >>> node11 >>> node11 >>> node11 >>> node11 >>> node11 >>> node11 >>> node11 >>> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings >>> myprog >>> >> -------------------------------------------------------------------------- >>> A request was made to bind to that would result in binding more >>> processes than cpus on a resource: >>> >>> Bind to: CORE >>> Node: node11 >>> #processes: 2 >>> #cpus: 1 >>> >>> You can override this protection by adding the "overload-allowed" >>> option to your binding directive. >>> >> -------------------------------------------------------------------------- >>> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings >>> -bind-to core:overload-allowed myprog >>> [node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]], >> socket >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>> [node10.cluster:27020] MCW rank 1 bound to socket 1[core 4[hwt 0]], >> socket >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>> [node11.cluster:26597] MCW rank 3 bound to socket 1[core 4[hwt 0]], >> socket >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>> [node11.cluster:26597] MCW rank 2 bound to socket 0[core 0[hwt 0]], >> socket >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>> Hello world from process 1 of 4 >>> Hello world from process 0 of 4 >>> Hello world from process 3 of 4 >>> Hello world from process 2 of 4 >>> >>> Regards, >>> Tetsuya Mishima >>> >>> >>>> Very strange - I can't seem to replicate it. Is there any chance that >> you >>> have < 8 actual cores on node12? >>>> >>>> >>>> On Dec 18, 2013, at 4:53 PM, tmish...@jcity.maeda.co.jp wrote: >>>> >>>>> >>>>> >>>>> Hi Ralph, sorry for confusing you. >>>>> >>>>> At that time, I cut and paste the part of "cat $PBS_NODEFILE". >>>>> I guess I didn't paste the last line by my mistake. >>>>> >>>>> I retried the test and below one is exactly what I got when I did the >>> test. >>>>> >>>>> [mishima@manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8 >>>>> qsub: waiting for job 8338.manage.cluster to start >>>>> qsub: job 8338.manage.cluster ready >>>>> >>>>> [mishima@node11 ~]$ cat $PBS_NODEFILE >>>>> node11 >>>>> node11 >>>>> node11 >>>>> node11 >>>>> node11 >>>>> node11 >>>>> node11 >>>>> node11 >>>>> node12 >>>>> node12 >>>>> node12 >>>>> node12 >>>>> node12 >>>>> node12 >>>>> node12 >>>>> node12 >>>>> [mishima@node11 ~]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings >>> myprog >>>>> >>> >> -------------------------------------------------------------------------- >>>>> A request was made to bind to that would result in binding more >>>>> processes than cpus on a resource: >>>>> >>>>> Bind to: CORE >>>>> Node: node12 >>>>> #processes: 2 >>>>> #cpus: 1 >>>>> >>>>> You can override this protection by adding the "overload-allowed" >>>>> option to your binding directive. >>>>> >>> >> -------------------------------------------------------------------------- >>>>> >>>>> Regards, >>>>> >>>>> Tetsuya Mishima >>>>> >>>>>> I removed the debug in #2 - thanks for reporting it >>>>>> >>>>>> For #1, it actually looks to me like this is correct. If you look at >>> your >>>>> allocation, there are only 7 slots being allocated on node12, yet you >>> have >>>>> asked for 8 cpus to be assigned (2 procs with 2 >>>>>> cpus/proc). So the warning is in fact correct >>>>>> >>>>>> >>>>>> On Dec 18, 2013, at 4:04 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So >> I'd >>>>> like >>>>>>> to report >>>>>>> 3 issues mainly regarding -cpus-per-proc. >>>>>>> >>>>>>> 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2 >>>>> sockets X >>>>>>> 4 cores/socket), >>>>>>> it starts to produce the error again as shown below. At least, >>>>>>> openmpi-1.7.4a1r29646 did >>>>>>> work well. >>>>>>> >>>>>>> [mishima@manage ~]$ qsub -I -l nodes=2:ppn=8 >>>>>>> qsub: waiting for job 8336.manage.cluster to start >>>>>>> qsub: job 8336.manage.cluster ready >>>>>>> >>>>>>> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>> [mishima@node11 demos]$ cat $PBS_NODEFILE >>>>>>> node11 >>>>>>> node11 >>>>>>> node11 >>>>>>> node11 >>>>>>> node11 >>>>>>> node11 >>>>>>> node11 >>>>>>> node11 >>>>>>> node12 >>>>>>> node12 >>>>>>> node12 >>>>>>> node12 >>>>>>> node12 >>>>>>> node12 >>>>>>> node12 >>>>>>> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 >>> -report-bindings >>>>>>> myprog >>>>>>> >>>>> >>> >> -------------------------------------------------------------------------- >>>>>>> A request was made to bind to that would result in binding more >>>>>>> processes than cpus on a resource: >>>>>>> >>>>>>> Bind to: CORE >>>>>>> Node: node12 >>>>>>> #processes: 2 >>>>>>> #cpus: 1 >>>>>>> >>>>>>> You can override this protection by adding the "overload-allowed" >>>>>>> option to your binding directive. >>>>>>> >>>>> >>> >> -------------------------------------------------------------------------- >>>>>>> >>>>>>> Of course it works well using only one node. >>>>>>> >>>>>>> [mishima@node11 demos]$ mpirun -np 2 -cpus-per-proc 4 >>> -report-bindings >>>>>>> myprog >>>>>>> [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>>>> socket >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>>>>>> [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]], >>>>> socket >>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>>>>>> Hello world from process 1 of 2 >>>>>>> Hello world from process 0 of 2 >>>>>>> >>>>>>> >>>>>>> 2) Adding "-bind-to numa", it works but the message "bind:upward >>> target >>>>>>> NUMANode type NUMANode" appears. >>>>>>> As far as I remember, I didn't see such a kind of message before. >>>>>>> >>>>>>> mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 >> -report-bindings >>>>>>> -bind-to numa myprog >>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode >> type >>>>>>> NUMANode >>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode >> type >>>>>>> NUMANode >>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode >> type >>>>>>> NUMANode >>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode >> type >>>>>>> NUMANode >>>>>>> [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>>>> socket >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>>>>>> [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]], >>>>> socket >>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>>>>>> [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]], >>>>> socket >>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>>>>>> [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]], >>>>> socket >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>>>>>> Hello world from process 1 of 4 >>>>>>> Hello world from process 0 of 4 >>>>>>> Hello world from process 3 of 4 >>>>>>> Hello world from process 2 of 4 >>>>>>> >>>>>>> >>>>>>> 3) I use PGI compiler. It can not accept compiler switch >>>>>>> "-Wno-variadic-macros", which is >>>>>>> included in configure script. >>>>>>> >>>>>>> btl_usnic_CFLAGS="-Wno-variadic-macros" >>>>>>> >>>>>>> I removed this switch, then I could continue to build 1.7.4rc1. >>>>>>> >>>>>>> Regards, >>>>>>> Tetsuya Mishima >>>>>>> >>>>>>> >>>>>>>> Hmmm...okay, I understand the scenario. Must be something in the >>> algo >>>>>>> when it only has one node, so it shouldn't be too hard to track >> down. >>>>>>>> >>>>>>>> I'm off on travel for a few days, but will return to this when I >> get >>>>>>> back. >>>>>>>> >>>>>>>> Sorry for delay - will try to look at this while I'm gone, but >> can't >>>>>>> promise anything :-( >>>>>>>> >>>>>>>> >>>>>>>> On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Ralph, sorry for confusing. >>>>>>>>> >>>>>>>>> We usually logon to "manage", which is our control node. >>>>>>>>> From manage, we submit job or enter a remote node such as >>>>>>>>> node03 by torque interactive mode(qsub -I). >>>>>>>>> >>>>>>>>> At that time, instead of torque, I just did rsh to node03 from >>> manage >>>>>>>>> and ran myprog on the node. I hope you could understand what I >> did. >>>>>>>>> >>>>>>>>> Now, I retried with "-host node03", which still causes the >> problem: >>>>>>>>> (I comfirmed local run on manage caused the same problem too) >>>>>>>>> >>>>>>>>> [mishima@manage ~]$ rsh node03 >>>>>>>>> Last login: Wed Dec 11 11:38:57 from manage >>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>>>> [mishima@node03 demos]$ >>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03 >> -report-bindings >>>>>>>>> -cpus-per-proc 4 -map-by socket myprog >>>>>>>>> >>>>>>> >>>>> >>> >> -------------------------------------------------------------------------- >>>>>>>>> A request was made to bind to that would result in binding more >>>>>>>>> processes than cpus on a resource: >>>>>>>>> >>>>>>>>> Bind to: CORE >>>>>>>>> Node: node03 >>>>>>>>> #processes: 2 >>>>>>>>> #cpus: 1 >>>>>>>>> >>>>>>>>> You can override this protection by adding the "overload-allowed" >>>>>>>>> option to your binding directive. >>>>>>>>> >>>>>>> >>>>> >>> >> -------------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> It' strange, but I have to report that "-map-by socket:span" >> worked >>>>>>> well. >>>>>>>>> >>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03 >> -report-bindings >>>>>>>>> -cpus-per-proc 4 -map-by socket:span myprog >>>>>>>>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt >> 0]], >>>>>>> socket >>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>> >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt >> 0]], >>>>>>> socket >>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt >> 0]], >>>>>>> socket >>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>> >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>> Hello world from process 2 of 8 >>>>>>>>> Hello world from process 6 of 8 >>>>>>>>> Hello world from process 3 of 8 >>>>>>>>> Hello world from process 7 of 8 >>>>>>>>> Hello world from process 1 of 8 >>>>>>>>> Hello world from process 5 of 8 >>>>>>>>> Hello world from process 0 of 8 >>>>>>>>> Hello world from process 4 of 8 >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Tetsuya Mishima >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Ralph, >>>>>>>>>>> >>>>>>>>>>> I tried again with -cpus-per-proc 2 as shown below. >>>>>>>>>>> Here, I found that "-map-by socket:span" worked well. >>>>>>>>>>> >>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings >>>>> -cpus-per-proc >>>>>>> 2 >>>>>>>>>>> -map-by socket:span myprog >>>>>>>>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. >>>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B >>>>>>>>>>> /./././.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 2[core 17[hwt 0]]: [./././././././.][./././. >>>>>>>>>>> /./././.][B/B/./././././.][./././././././.] >>>>>>>>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 2[core 19[hwt 0]]: [./././././././.][./././. >>>>>>>>>>> /./././.][././B/B/./././.][./././././././.] >>>>>>>>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 3[core 25[hwt 0]]: [./././././././.][./././. >>>>>>>>>>> /./././.][./././././././.][B/B/./././././.] >>>>>>>>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 3[core 27[hwt 0]]: [./././././././.][./././. >>>>>>>>>>> /./././.][./././././././.][././B/B/./././.] >>>>>>>>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. >>>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. >>>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>>> Hello world from process 1 of 8 >>>>>>>>>>> Hello world from process 0 of 8 >>>>>>>>>>> Hello world from process 4 of 8 >>>>>>>>>>> Hello world from process 2 of 8 >>>>>>>>>>> Hello world from process 7 of 8 >>>>>>>>>>> Hello world from process 6 of 8 >>>>>>>>>>> Hello world from process 5 of 8> >>>>>>> Hello world from >> process 3 of 8 >>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings >>>>> -cpus-per-proc >>>>>>> 2 >>>>>>>>>>> -map-by socket myprog >>>>>>>>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././. >>>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 0[core 7[hwt 0]]: [././././././B/B][././././. >>>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. >>>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B >>>>>>>>>>> /./././.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 1[core 13[hwt 0]]: [./././././././.][./././. >>>>>>>>>>> /B/B/./.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 1[core 15[hwt 0]]: [./././././././.][./././. >>>>>>>>>>> /././B/B][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. >>>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. >>>>>>>>>>> /././.][./././././././.][./././././././.] >>>>>>>>>>> Hello world from process 5 of 8 >>>>>>>>>>> Hello world from process 1 of 8 >>>>>>>>>>> Hello world from process 6 of 8 >>>>>>>>>>> Hello world from process 4 of 8 >>>>>>>>>>> Hello world from process 2 of 8 >>>>>>>>>>> Hello world from process 0 of 8 >>>>>>>>>>> Hello world from process 7 of 8 >>>>>>>>>>> Hello world from process 3 of 8 >>>>>>>>>>> >>>>>>>>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets. >>>>>>>>>>> In this case, I guess "-map-by socket:span" and "-map-by >> socket" >>>>> has >>>>>>>>> same >>>>>>>>>>> meaning. >>>>>>>>>>> Therefore, there's no problem about that. Sorry for distubing. >>>>>>>>>> >>>>>>>>>> No problem - glad you could clear that up :-) >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> By the way, through this test, I found another problem. >>>>>>>>>>> Without torque manager and just using rsh, it causes the same >>> error >>>>>>>>> like >>>>>>>>>>> below: >>>>>>>>>>> >>>>>>>>>>> [mishima@manage openmpi-1.7]$ rsh node03 >>>>>>>>>>> Last login: Wed Dec 11 09:42:02 from manage >>>>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings >>>>> -cpus-per-proc >>>>>>> 4 >>>>>>>>>>> -map-by socket myprog >>>>>>>>>> >>>>>>>>>> I don't understand the difference here - you are simply starting >>> it >>>>>>> from>>>>> a different node? It looks like everything is expected to >>> run local >>>>> to >>>>>>>>> mpirun, yes? So there is no rsh actually involved here. >>>>>>>>>> Are you still running in an allocation? >>>>>>>>>> >>>>>>>>>> If you run this with "-host node03" on the cmd line, do you see >>> the >>>>>>> same >>>>>>>>> problem? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >> -------------------------------------------------------------------------- >>>>>>>>>>> A request was made to bind to that would result in binding more >>>>>>>>>>> processes than cpus on a resource: >>>>>>>>>>> >>>>>>>>>>> Bind to: CORE >>>>>>>>>>> Node: node03 >>>>>>>>>>> #processes: 2 >>>>>>>>>>> #cpus: 1 >>>>>>>>>>> >>>>>>>>>>> You can override this protection by adding the >> "overload-allowed" >>>>>>>>>>> option to your binding directive. >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >> -------------------------------------------------------------------------- >>>>>>>>>>> [mishima@node03 demos]$ >>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings >>>>> -cpus-per-proc >>>>>>> 4 >>>>>>>>>>> myprog >>>>>>>>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>>>> socket 3[core 27[hwt 0]]:>>>>> >>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>>>> >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>>>> >>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>>>> Hello world from process 4 of 8 >>>>>>>>>>> Hello world from process 2 of 8 >>>>>>>>>>> Hello world from process 6 of 8 >>>>>>>>>>> Hello world from process 5 of 8 >>>>>>>>>>> Hello world from process 3 of 8 >>>>>>>>>>> Hello world from process 7 of 8 >>>>>>>>>>> Hello world from process 0 of 8 >>>>>>>>>>> Hello world from process 1 of 8 >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>> >>>>>>>>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but >>> let >>>>>>> me >>>>>>>>>>> poke around a bit and see what might be happening. >>>>>>>>>>>> >>>>>>>>>>>> On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. I didn't know the meaning of "socket:span". >>>>>>>>>>>>> >>>>>>>>>>>>> But it still causes the problem, which seems socket:span >>> doesn't >>>>>>>>> work. >>>>>>>>>>>>> >>>>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32 >>>>>>>>>>>>> qsub: waiting for job 8265.manage.cluster to start >>>>>>>>>>>>> qsub: job 8265.manage.cluster ready >>>>>>>>>>>>> >>>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings >>>>>>> -cpus-per-proc >>>>>>>>> 4 >>>>>>>>>>>>> -map-by socket:span myprog >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8 >> [hwt >>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>>>>>> >>>>>>> >> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12 >> [hwt >>>>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>>>>>> >>>>>>> >> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16 >> [hwt >>>>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>>>>>> >>>>>>> >> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20 >> [hwt >>>>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>>>>>> >>>>>>> >> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24 >> [hwt >>>>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>>>>>> >>>>>>> >> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28 >> [hwt >>>>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>>>>>> >>>>>>> >> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0 >> [hwt >>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>>>>>> >>>>>>> >> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4 >> [hwt >>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>>>>>> >>>>>>> >> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>> Hello world from process 0 of 8 >>>>>>>>>>>>> Hello world from process 3 of 8 >>>>>>>>>>>>> Hello world from process 1 of 8 >>>>>>>>>>>>> Hello world from process 4 of 8 >>>>>>>>>>>>> Hello world from process 6 of 8 >>>>>>>>>>>>> Hello world from process 5 of 8 >>>>>>>>>>>>> Hello world from process 2 of 8 >>>>>>>>>>>>> Hello world from process 7 of 8 >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>> >>>>>>>>>>>>>> No, that is actually correct. We map a socket until full, >> then >>>>>>> move >>>>>>>>> to >>>>>>>>>>>>> the next. What you want is --map-by socket:span >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp >> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I had a time to try your patch yesterday using >>>>>>>>> openmpi-1.7.4a1r29646. >>>>>>>>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by >>>>>>> socket" itself >>>>>>>>>>>>> didn't >>>>>>>>>>>>>>> work >>>>>>>>>>>>>>> well as shown bellow: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32 >>>>>>>>>>>>>>> qsub: waiting for job 8260.manage.cluster to start >>>>>>>>>>>>>>> qsub: job 8260.manage.cluster ready >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>>>>>>>>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings >>>>>>>>> -cpus-per-proc >>>>>>>>>>> 4 >>>>>>>>>>>>>>> -map-by socket myprog >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8 >>> [hwt >>>>>>> 0]], >>>>>>>>>>>>> socket >>>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>>>>>>>> >>>>>>>>> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12 >>> [hwt >>>>>>>>> 0]], >>>>>>>>>>>>> socket >>>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>>>>>>>> >>>>>>>>> >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16 >>> [hwt >>>>>>>>> 0]], >>>>>>>>>>>>> socket >>>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>>>>>>>> >>>>>>>>> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20 >>> [hwt >>>>>>>>> 0]], >>>>>>>>>>>>> socket >>>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>>>>>>>> >>>>>>>>> >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24 >>> [hwt >>>>>>>>> 0]], >>>>>>>>>>>>> socket >>>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>>>>>>>> >>>>>>>>> >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28 >>> [hwt >>>>>>>>> 0]], >>>>>>>>>>>>> socket >>>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>>>>>>>> >>>>>>>>> >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0 >>> [hwt >>>>>>> 0]], >>>>>>>>>>>>> socket >>>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>>>>>>>> >>>>>>>>> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4 >>> [hwt >>>>>>> 0]], >>>>>>>>>>>>> socket >>>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>>>>>>>> >>>>>>>>> >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>>>> Hello world from process 2 of 8 >>>>>>>>>>>>>>> Hello world from process 1 of 8 >>>>>>>>>>>>>>> Hello world from process 3 of 8 >>>>>>>>>>>>>>> Hello world from process 0 of 8 >>>>>>>>>>>>>>> Hello world from process 6 of 8 >>>>>>>>>>>>>>> Hello world from process 5 of 8 >>>>>>>>>>>>>>> Hello world from process 4 of 8 >>>>>>>>>>>>>>> Hello world from process 7 of 8 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think this should be like this: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> rank 00 >>>>>>>>>>>>>>> >>>>>>>>> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>>>> rank 01 >>>>>>>>>>>>>>> >>>>>>>>> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>>>>>> rank 02 >>>>>>>>>>>>>>> >>>>>>>>> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I fixed this under the trunk (was an issue regardless of >> RM) >>>>> and >>>>>>>>>>> have >>>>>>>>>>>>>>> scheduled it for 1.7.4. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp >>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thank you very much for your quick response.> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm afraid to say that I found one more issuse... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It's not so serious. Please check it when you have a lot >> of >>>>>>> time. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The problem is cpus-per-proc with -map-by option under >>> Torque >>>>>>>>>>>>> manager. >>>>>>>>>>>>>>>>> It doesn't work as shown below. I guess you can get the >>> same >>>>>>>>>>>>>>>>> behaviour under Slurm manager. >> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Of course, if I remove -map-by option, it works quite >> well. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 >>>>>>>>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start >>>>>>>>>>>>>>>>> qsub: job 8116.manage.cluster ready >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2 >>>>>>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings >>>>>>>>>>>>> -cpus-per-proc >>>>>>>>>>>>>>> 4 >>>>>>>>>>>>>>>>> -map-by socket mPre >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> A request was made to bind to that would result in >> binding >>>>> more >>>>>>>>>>>>>>>>> processes than cpus on a resource: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Bind to: CORE >>>>>>>>>>>>>>>>> Node: node03>>>>>>> #processes: 2 >>>>>>>>>>>>>>>>> #cpus: 1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> You can override this protection by adding the >>>>>>> "overload-allowed" >>>>>>>>>>>>>>>>> option to your binding directive. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings >>>>>>>>>>>>> -cpus-per-proc >>>>>>>>>>>>>>> 4 >>>>>>>>>>>>>>>>> mPre >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core >> 8 >>>>> [hwt >>>>>>>>> 0]], >>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core >> 12 >>>>> [hwt >>>>>>>>>>> 0]], >>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>>>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core >> 16 >>>>> [hwt >>>>>>>>>>> 0]], >>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>>>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core >> 20 >>>>> [hwt >>>>>>>>>>> 0]], >>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>>>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core >> 24 >>>>> [hwt >>>>>>>>>>> 0]], >>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>>>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core >> 28 >>>>> [hwt >>>>>>>>>>> 0]], >>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>>>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>> >>> >> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]>>>>>>>>>>>>> >> >>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0 >>>>> [hwt >>>>>>>>> 0]], >>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>>>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core >> 4 >>>>> [hwt >>>>>>>>> 0]], >>>>>>>>>>>>>>> socket >>>>>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>>>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>> Regards, >>>>>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again! >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain >>>>> <r...@open-mpi.org> >>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks! That's precisely where I was going to look when >> I >>>>> had >>>>>>>>>>>>> time :-) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'll update tomorrow. >>>>>>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM, >>>>>>>>>>> <tmish...@jcity.maeda.co.jp>wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This is the continuous story of "Segmentation fault in >>>>>>> oob_tcp.c >>>>>>>>>>> of >>>>>>>>>>>>>>>>>> openmpi-1.7.4a1r29646". >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I found the cause. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine >>> can >>>>>>> not. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Your host file: >>>>>>>>>>>>>>>>>> cat hosts >>>>>>>>>>>>>>>>>> bend001 slots=12 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> My host file: >>>>>>>>>>>>>>>>>> cat hosts >>>>>>>>>>>>>>>>>> node08 >>>>>>>>>>>>>>>>>> node08 >>>>>>>>>>>>>>>>>> ...(total 8 lines) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I modified my script file to add "slots=1" to each line >> of >>>>> my >>>>>>>>>>>>> hostfile >>>>>>>>>>>>>>>>>> just before launching mpirun. Then it worked. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> My host file(modified): >>>>>>>>>>>>>>>>>> cat hosts >>>>>>>>>>>>>>>>>> node08 slots=1 >>>>>>>>>>>>>>>>>> node08 slots=1 >>>>>>>>>>>>>>>>>> ...(total 8 lines) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Secondary, I confirmed that there's a slight difference >>>>>>> between >>>>>>>>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of >>>>>>>>> 1.7.4a1r29646. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> $ diff >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>> >>>>> >> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c >>>>>>>>>>>>>>>>>> 394,401c394,399 >>>>>>>>>>>>>>>>>> < if (got_count) { >>>>>>>>>>>>>>>>>> < node->slots_given = true; >>>>>>>>>>>>>>>>>> < } else if (got_max) { >>>>>>>>>>>>>>>>>> < node->slots = node->slots_max; >>>>>>>>>>>>>>>>>> < node->slots_given = true; >>>>>>>>>>>>>>>>>> < } else { >>>>>>>>>>>>>>>>>> < /* should be set by obj_new, but just to be >>> clear >>>>> */ >>>>>>>>>>>>>>>>>> < node->slots_given = false; >>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>> if (!got_count) { >>>>>>>>>>>>>>>>>>> if (got_max) { >>>>>>>>>>>>>>>>>>> node->slots = node->slots_max; >>>>>>>>>>>>>>>>>>> } else { >>>>>>>>>>>>>>>>>>> ++node->slots;>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>> .... >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Finally, I added the line 402 below just as a tentative >>>>> trial. >>>>>>>>>>>>>>>>>> Then, it worked. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c: >>>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>> 394 if (got_count) { >>>>>>>>>>>>>>>>>> 395 node->slots_given = true; >>>>>>>>>>>>>>>>>> 396 } else if (got_max) { >>>>>>>>>>>>>>>>>> 397 node->slots = node->slots_max; >>>>>>>>>>>>>>>>>> 398 node->slots_given = true; >>>>>>>>>>>>>>>>>> 399 } else { >>>>>>>>>>>>>>>>>> 400 /* should be set by obj_new, but just to be >>>>> clear >>>>>>>>> */ >>>>>>>>>>>>>>>>>> 401 node->slots_given >>>>> = false; >>>>>>>>>>>>>>>>>> 402 ++node->slots; /* added by tmishima */ >>>>>>>>>>>>>>>>>> 403 } >>>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Please fix the problem properly, because it's just based >>> on >>>>> my >>>>>>>>>>>>>>>>>> random guess. It's related to the treatment of hostfile >>>>> where >>>>>>>>>>> slots >>>>>>>>>>>>>>>>>> information is not given. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ >> >>> >>>>> >>>>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >