Hi Ralph, sorry for intersecting post.
Your advice about -hetero-nodes in other thread gives me a hint. I already put "orte_hetero_nodes = 1" in my mca-params.conf, because you told me a month ago that my environment would need this option. Removing this line from mca-params.conf, then it works. In other word, you can replicate it by adding -hetero-nodes as shown below. qsub: job 8364.manage.cluster completed [mishima@manage mpi]$ qsub -I -l nodes=2:ppn=8 qsub: waiting for job 8365.manage.cluster to start qsub: job 8365.manage.cluster ready [mishima@node11 ~]$ ompi_info --all | grep orte_hetero_nodes MCA orte: parameter "orte_hetero_nodes" (current value: "false", data source: default, level: 9 dev/all, type: bool) [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/ [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings myprog [node11.cluster:27895] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] [node11.cluster:27895] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] [node12.cluster:24891] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] [node12.cluster:24891] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] Hello world from process 0 of 4 Hello world from process 1 of 4 Hello world from process 2 of 4 Hello world from process 3 of 4 [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings -hetero-nodes myprog -------------------------------------------------------------------------- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node: node12 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -------------------------------------------------------------------------- As far as I checked, data->num_bound seems to become bad in bind_downwards, when I put "-hetero-nodes". I hope you can clear the problem. Regards, Tetsuya Mishima > Yes, it's very strange. But I don't think there's any chance that > I have < 8 actual cores on the node. I guess that you cat replicate > it with SLURM, please try it again. > > I changed to use node10 and node11, then I got the warning against > node11. > > Furthermore, just as an information for you, I tried to add > "-bind-to core:overload-allowed", then it worked as shown below. > But I think node11 is never overloaded because it has 8 cores. > > qsub: job 8342.manage.cluster completed > [mishima@manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8 > qsub: waiting for job 8343.manage.cluster to start > qsub: job 8343.manage.cluster ready > > [mishima@node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > [mishima@node10 demos]$ cat $PBS_NODEFILE > node10 > node10 > node10 > node10 > node10 > node10 > node10 > node10 > node11 > node11 > node11 > node11 > node11 > node11 > node11 > node11 > [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > myprog > -------------------------------------------------------------------------- > A request was made to bind to that would result in binding more > processes than cpus on a resource: > > Bind to: CORE > Node: node11 > #processes: 2 > #cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > -------------------------------------------------------------------------- > [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > -bind-to core:overload-allowed myprog > [node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > [node10.cluster:27020] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > [node11.cluster:26597] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > [node11.cluster:26597] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > Hello world from process 1 of 4 > Hello world from process 0 of 4 > Hello world from process 3 of 4 > Hello world from process 2 of 4 > > Regards, > Tetsuya Mishima > > > > Very strange - I can't seem to replicate it. Is there any chance that you > have < 8 actual cores on node12? > > > > > > On Dec 18, 2013, at 4:53 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > > > > > Hi Ralph, sorry for confusing you. > > > > > > At that time, I cut and paste the part of "cat $PBS_NODEFILE". > > > I guess I didn't paste the last line by my mistake. > > > > > > I retried the test and below one is exactly what I got when I did the > test. > > > > > > [mishima@manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8 > > > qsub: waiting for job 8338.manage.cluster to start > > > qsub: job 8338.manage.cluster ready > > > > > > [mishima@node11 ~]$ cat $PBS_NODEFILE > > > node11 > > > node11 > > > node11 > > > node11 > > > node11 > > > node11 > > > node11 > > > node11 > > > node12 > > > node12 > > > node12 > > > node12 > > > node12 > > > node12 > > > node12 > > > node12 > > > [mishima@node11 ~]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > myprog > > > > -------------------------------------------------------------------------- > > > A request was made to bind to that would result in binding more > > > processes than cpus on a resource: > > > > > > Bind to: CORE > > > Node: node12 > > > #processes: 2 > > > #cpus: 1 > > > > > > You can override this protection by adding the "overload-allowed" > > > option to your binding directive. > > > > -------------------------------------------------------------------------- > > > > > > Regards, > > > > > > Tetsuya Mishima > > > > > >> I removed the debug in #2 - thanks for reporting it > > >> > > >> For #1, it actually looks to me like this is correct. If you look at > your > > > allocation, there are only 7 slots being allocated on node12, yet you > have > > > asked for 8 cpus to be assigned (2 procs with 2 > > >> cpus/proc). So the warning is in fact correct > > >> > > >> > > >> On Dec 18, 2013, at 4:04 PM, tmish...@jcity.maeda.co.jp wrote: > > >> > > >>> > > >>> > > >>> Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So I'd > > > like > > >>> to report > > >>> 3 issues mainly regarding -cpus-per-proc. > > >>> > > >>> 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2 > > > sockets X > > >>> 4 cores/socket), > > >>> it starts to produce the error again as shown below. At least, > > >>> openmpi-1.7.4a1r29646 did > > >>> work well. > > >>> > > >>> [mishima@manage ~]$ qsub -I -l nodes=2:ppn=8 > > >>> qsub: waiting for job 8336.manage.cluster to start > > >>> qsub: job 8336.manage.cluster ready > > >>> > > >>> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > > >>> [mishima@node11 demos]$ cat $PBS_NODEFILE > > >>> node11 > > >>> node11 > > >>> node11 > > >>> node11 > > >>> node11 > > >>> node11 > > >>> node11 > > >>> node11 > > >>> node12 > > >>> node12 > > >>> node12 > > >>> node12 > > >>> node12 > > >>> node12 > > >>> node12 > > >>> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 > -report-bindings > > >>> myprog > > >>> > > > > -------------------------------------------------------------------------- > > >>> A request was made to bind to that would result in binding more > > >>> processes than cpus on a resource: > > >>> > > >>> Bind to: CORE > > >>> Node: node12 > > >>> #processes: 2 > > >>> #cpus: 1 > > >>> > > >>> You can override this protection by adding the "overload-allowed" > > >>> option to your binding directive. > > >>> > > > > -------------------------------------------------------------------------- > > >>> > > >>> Of course it works well using only one node. > > >>> > > >>> [mishima@node11 demos]$ mpirun -np 2 -cpus-per-proc 4 > -report-bindings > > >>> myprog > > >>> [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]], > > > socket > > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > > >>> [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]], > > > socket > > >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > > >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > > >>> Hello world from process 1 of 2 > > >>> Hello world from process 0 of 2 > > >>> > > >>> > > >>> 2) Adding "-bind-to numa", it works but the message "bind:upward > target > > >>> NUMANode type NUMANode" appears. > > >>> As far as I remember, I didn't see such a kind of message before. > > >>> > > >>> mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > > >>> -bind-to numa myprog > > >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type > > >>> NUMANode > > >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type > > >>> NUMANode > > >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type > > >>> NUMANode > > >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type > > >>> NUMANode > > >>> [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]], > > > socket > > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > > >>> [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]], > > > socket > > >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > > >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > > >>> [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]], > > > socket > > >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > > >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > > >>> [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]], > > > socket > > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > > >>> Hello world from process 1 of 4 > > >>> Hello world from process 0 of 4 > > >>> Hello world from process 3 of 4 > > >>> Hello world from process 2 of 4 > > >>> > > >>> > > >>> 3) I use PGI compiler. It can not accept compiler switch > > >>> "-Wno-variadic-macros", which is > > >>> included in configure script. > > >>> > > >>> btl_usnic_CFLAGS="-Wno-variadic-macros" > > >>> > > >>> I removed this switch, then I could continue to build 1.7.4rc1. > > >>> > > >>> Regards, > > >>> Tetsuya Mishima > > >>> > > >>> > > >>>> Hmmm...okay, I understand the scenario. Must be something in the > algo > > >>> when it only has one node, so it shouldn't be too hard to track down. > > >>>> > > >>>> I'm off on travel for a few days, but will return to this when I get > > >>> back. > > >>>> > > >>>> Sorry for delay - will try to look at this while I'm gone, but can't > > >>> promise anything :-( > > >>>> > > >>>> > > >>>> On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote: > > >>>> > > >>>>> > > >>>>> > > >>>>> Hi Ralph, sorry for confusing. > > >>>>> > > >>>>> We usually logon to "manage", which is our control node. > > >>>>> From manage, we submit job or enter a remote node such as > > >>>>> node03 by torque interactive mode(qsub -I). > > >>>>> > > >>>>> At that time, instead of torque, I just did rsh to node03 from > manage > > >>>>> and ran myprog on the node. I hope you could understand what I did. > > >>>>> > > >>>>> Now, I retried with "-host node03", which still causes the problem: > > >>>>> (I comfirmed local run on manage caused the same problem too) > > >>>>> > > >>>>> [mishima@manage ~]$ rsh node03 > > >>>>> Last login: Wed Dec 11 11:38:57 from manage > > >>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > > >>>>> [mishima@node03 demos]$ > > >>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings > > >>>>> -cpus-per-proc 4 -map-by socket myprog > > >>>>> > > >>> > > > > -------------------------------------------------------------------------- > > >>>>> A request was made to bind to that would result in binding more > > >>>>> processes than cpus on a resource: > > >>>>> > > >>>>> Bind to: CORE > > >>>>> Node: node03 > > >>>>> #processes: 2 > > >>>>> #cpus: 1 > > >>>>> > > >>>>> You can override this protection by adding the "overload-allowed" > > >>>>> option to your binding directive. > > >>>>> > > >>> > > > > -------------------------------------------------------------------------- > > >>>>> > > >>>>> It' strange, but I have to report that "-map-by socket:span" worked > > >>> well. > > >>>>> > > >>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings > > >>>>> -cpus-per-proc 4 -map-by socket:span myprog > > >>>>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt 0]], > > >>> socket > > >>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > > >>>>> ocket 1[core 11[hwt 0]]: > > >>>>> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > > >>>>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt > 0]], > > >>> socket > > >>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > > >>>>> socket 1[core 15[hwt 0]]: > > >>>>> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > > >>>>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt > 0]], > > >>> socket > > >>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > > >>>>> socket 2[core 19[hwt 0]]: > > >>>>> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > > >>>>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt > 0]], > > >>> socket > > >>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > > >>>>> socket 2[core 23[hwt 0]]: > > >>>>> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > > >>>>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt > 0]], > > >>> socket > > >>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > > >>>>> socket 3[core 27[hwt 0]]: > > >>>>> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > > >>>>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt > 0]], > > >>> socket > > >>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > > >>>>> socket 3[core 31[hwt 0]]: > > >>>>> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > > >>>>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt 0]], > > >>> socket > > >>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > >>>>> cket 0[core 3[hwt 0]]: > > >>>>> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > > >>>>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt 0]], > > >>> socket > > >>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > > >>>>> cket 0[core 7[hwt 0]]: > > >>>>> > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > > >>>>> Hello world from process 2 of 8 > > >>>>> Hello world from process 6 of 8 > > >>>>> Hello world from process 3 of 8 > > >>>>> Hello world from process 7 of 8 > > >>>>> Hello world from process 1 of 8 > > >>>>> Hello world from process 5 of 8 > > >>>>> Hello world from process 0 of 8 > > >>>>> Hello world from process 4 of 8 > > >>>>> > > >>>>> Regards, > > >>>>> Tetsuya Mishima > > >>>>> > > >>>>> > > >>>>>> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote: > > >>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> Hi Ralph, > > >>>>>>> > > >>>>>>> I tried again with -cpus-per-proc 2 as shown below. > > >>>>>>> Here, I found that "-map-by socket:span" worked well. > > >>>>>>> > > >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > > > -cpus-per-proc > > >>> 2 > > >>>>>>> -map-by socket:span myprog > > >>>>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt > 0]], > > >>>>> socket > > >>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. > > >>>>>>> /././.][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B > > >>>>>>> /./././.][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 2[core 17[hwt 0]]: [./././././././.][./././. > > >>>>>>> /./././.][B/B/./././././.][./././././././.] > > >>>>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 2[core 19[hwt 0]]: [./././././././.][./././. > > >>>>>>> /./././.][././B/B/./././.][./././././././.] > > >>>>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 3[core 25[hwt 0]]: [./././././././.][./././. > > >>>>>>> /./././.][./././././././.][B/B/./././././.] > > >>>>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 3[core 27[hwt 0]]: [./././././././.][./././. > > >>>>>>> /./././.][./././././././.][././B/B/./././.] > > >>>>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt > 0]], > > >>>>> socket > > >>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. > > >>>>>>> /././.][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt > 0]], > > >>>>> socket > > >>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. > > >>>>>>> /././.][./././././././.][./././././././.] > > >>>>>>> Hello world from process 1 of 8 > > >>>>>>> Hello world from process 0 of 8 > > >>>>>>> Hello world from process 4 of 8 > > >>>>>>> Hello world from process 2 of 8 > > >>>>>>> Hello world from process 7 of 8 > > >>>>>>> Hello world from process 6 of 8 > > >>>>>>> Hello world from process 5 of 8> >>>>>>> Hello world from process 3 of 8 > > >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > > > -cpus-per-proc > > >>> 2 > > >>>>>>> -map-by socket myprog > > >>>>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt > 0]], > > >>>>> socket > > >>>>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././. > > >>>>>>> /././.][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt > 0]], > > >>>>> socket > > >>>>>>> 0[core 7[hwt 0]]: [././././././B/B][././././. > > >>>>>>> /././.][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt > 0]], > > >>>>> socket > > >>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. > > >>>>>>> /././.][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B > > >>>>>>> /./././.][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 1[core 13[hwt 0]]: [./././././././.][./././. > > >>>>>>> /B/B/./.][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 1[core 15[hwt 0]]: [./././././././.][./././. > > >>>>>>> /././B/B][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt > 0]], > > >>>>> socket > > >>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. > > >>>>>>> /././.][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt > 0]], > > >>>>> socket > > >>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. > > >>>>>>> /././.][./././././././.][./././././././.] > > >>>>>>> Hello world from process 5 of 8 > > >>>>>>> Hello world from process 1 of 8 > > >>>>>>> Hello world from process 6 of 8 > > >>>>>>> Hello world from process 4 of 8 > > >>>>>>> Hello world from process 2 of 8 > > >>>>>>> Hello world from process 0 of 8 > > >>>>>>> Hello world from process 7 of 8 > > >>>>>>> Hello world from process 3 of 8 > > >>>>>>> > > >>>>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets. > > >>>>>>> In this case, I guess "-map-by socket:span" and "-map-by socket" > > > has > > >>>>> same > > >>>>>>> meaning. > > >>>>>>> Therefore, there's no problem about that. Sorry for distubing. > > >>>>>> > > >>>>>> No problem - glad you could clear that up :-) > > >>>>>> > > >>>>>>> > > >>>>>>> By the way, through this test, I found another problem. > > >>>>>>> Without torque manager and just using rsh, it causes the same > error > > >>>>> like > > >>>>>>> below: > > >>>>>>> > > >>>>>>> [mishima@manage openmpi-1.7]$ rsh node03 > > >>>>>>> Last login: Wed Dec 11 09:42:02 from manage > > >>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > > >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > > > -cpus-per-proc > > >>> 4 > > >>>>>>> -map-by socket myprog > > >>>>>> > > >>>>>> I don't understand the difference here - you are simply starting > it > > >>> from>>>>> a different node? It looks like everything is expected to > run local > > > to > > >>>>> mpirun, yes? So there is no rsh actually involved here. > > >>>>>> Are you still running in an allocation? > > >>>>>> > > >>>>>> If you run this with "-host node03" on the cmd line, do you see > the > > >>> same > > >>>>> problem? > > >>>>>> > > >>>>>> > > >>>>>>> > > >>>>> > > >>> > > > > -------------------------------------------------------------------------- > > >>>>>>> A request was made to bind to that would result in binding more > > >>>>>>> processes than cpus on a resource: > > >>>>>>> > > >>>>>>> Bind to: CORE > > >>>>>>> Node: node03 > > >>>>>>> #processes: 2 > > >>>>>>> #cpus: 1 > > >>>>>>> > > >>>>>>> You can override this protection by adding the "overload-allowed" > > >>>>>>> option to your binding directive. > > >>>>>>> > > >>>>> > > >>> > > > > -------------------------------------------------------------------------- > > >>>>>>> [mishima@node03 demos]$ > > >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > > > -cpus-per-proc > > >>> 4 > > >>>>>>> myprog > > >>>>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt > 0]], > > >>>>> socket > > >>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > > >>>>>>> ocket 1[core 11[hwt 0]]: > > >>>>>>> > > > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > > >>>>>>> socket 1[core 15[hwt 0]]: > > >>>>>>> > > > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > > >>>>>>> socket 2[core 19[hwt 0]]: > > >>>>>>> > > > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > > >>>>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > > >>>>>>> socket 2[core 23[hwt 0]]: > > >>>>>>> > > > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > > >>>>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > > >>>>>>> socket 3[core 27[hwt 0]]:>>>>> > > > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > > >>>>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt > > > 0]], > > >>>>> socket > > >>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > > >>>>>>> socket 3[core 31[hwt 0]]: > > >>>>>>> > > > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > > >>>>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt > 0]], > > >>>>> socket > > >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > >>>>>>> cket 0[core 3[hwt 0]]: > > >>>>>>> > > > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > > >>>>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt > 0]], > > >>>>> socket > > >>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > > >>>>>>> cket 0[core 7[hwt 0]]: > > >>>>>>> > > > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > > >>>>>>> Hello world from process 4 of 8 > > >>>>>>> Hello world from process 2 of 8 > > >>>>>>> Hello world from process 6 of 8 > > >>>>>>> Hello world from process 5 of 8 > > >>>>>>> Hello world from process 3 of 8 > > >>>>>>> Hello world from process 7 of 8 > > >>>>>>> Hello world from process 0 of 8 > > >>>>>>> Hello world from process 1 of 8 > > >>>>>>> > > >>>>>>> Regards, > > >>>>>>> Tetsuya Mishima > > >>>>>>> > > >>>>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but > let > > >>> me > > >>>>>>> poke around a bit and see what might be happening. > > >>>>>>>> > > >>>>>>>> On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote: > > >>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Hi Ralph, > > >>>>>>>>> > > >>>>>>>>> Thanks. I didn't know the meaning of "socket:span". > > >>>>>>>>> > > >>>>>>>>> But it still causes the problem, which seems socket:span > doesn't > > >>>>> work. > > >>>>>>>>> > > >>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32 > > >>>>>>>>> qsub: waiting for job 8265.manage.cluster to start > > >>>>>>>>> qsub: job 8265.manage.cluster ready > > >>>>>>>>> > > >>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > > >>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > > >>> -cpus-per-proc > > >>>>> 4 > > >>>>>>>>> -map-by socket:span myprog > > >>>>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8 [hwt > > > 0]], > > >>>>>>> socket > > >>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > > >>>>>>>>> ocket 1[core 11[hwt 0]]: > > >>>>>>>>> > > >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > > >>>>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12 [hwt > > >>> 0]], > > >>>>>>> socket > > >>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > > >>>>>>>>> socket 1[core 15[hwt 0]]: > > >>>>>>>>> > > >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > > >>>>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16 [hwt > > >>> 0]], > > >>>>>>> socket > > >>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > > >>>>>>>>> socket 2[core 19[hwt 0]]: > > >>>>>>>>> > > >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > > >>>>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20 [hwt > > >>> 0]], > > >>>>>>> socket > > >>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > > >>>>>>>>> socket 2[core 23[hwt 0]]: > > >>>>>>>>> > > >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > > >>>>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24 [hwt > > >>> 0]], > > >>>>>>> socket > > >>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > > >>>>>>>>> socket 3[core 27[hwt 0]]: > > >>>>>>>>> > > >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > > >>>>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28 [hwt > > >>> 0]], > > >>>>>>> socket > > >>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > > >>>>>>>>> socket 3[core 31[hwt 0]]: > > >>>>>>>>> > > >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > > >>>>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0 [hwt > > > 0]], > > >>>>>>> socket > > >>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > >>>>>>>>> cket 0[core 3[hwt 0]]: > > >>>>>>>>> > > >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > > >>>>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4 [hwt > > > 0]], > > >>>>>>> socket > > >>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > > >>>>>>>>> cket 0[core 7[hwt 0]]: > > >>>>>>>>> > > >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > > >>>>>>>>> Hello world from process 0 of 8 > > >>>>>>>>> Hello world from process 3 of 8 > > >>>>>>>>> Hello world from process 1 of 8 > > >>>>>>>>> Hello world from process 4 of 8 > > >>>>>>>>> Hello world from process 6 of 8 > > >>>>>>>>> Hello world from process 5 of 8 > > >>>>>>>>> Hello world from process 2 of 8 > > >>>>>>>>> Hello world from process 7 of 8 > > >>>>>>>>> > > >>>>>>>>> Regards, > > >>>>>>>>> Tetsuya Mishima > > >>>>>>>>> > > >>>>>>>>>> No, that is actually correct. We map a socket until full, then > > >>> move > > >>>>> to > > >>>>>>>>> the next. What you want is --map-by socket:span > > >>>>>>>>>> > > >>>>>>>>>> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote: > > >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Hi Ralph, > > >>>>>>>>>>> > > >>>>>>>>>>> I had a time to try your patch yesterday using > > >>>>> openmpi-1.7.4a1r29646. > > >>>>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by > > >>> socket" itself > > >>>>>>>>> didn't > > >>>>>>>>>>> work > > >>>>>>>>>>> well as shown bellow: > > >>>>>>>>>>> > > >>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32 > > >>>>>>>>>>> qsub: waiting for job 8260.manage.cluster to start > > >>>>>>>>>>> qsub: job 8260.manage.cluster ready > > >>>>>>>>>>> > > >>>>>>>>>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > > >>>>>>>>>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings > > >>>>> -cpus-per-proc > > >>>>>>> 4 > > >>>>>>>>>>> -map-by socket myprog > > >>>>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8 > [hwt > > >>> 0]], > > >>>>>>>>> socket > > >>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > > >>>>>>>>>>> ocket 1[core 11[hwt 0]]: > > >>>>>>>>>>> > > >>>>> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > > >>>>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12 > [hwt > > >>>>> 0]], > > >>>>>>>>> socket > > >>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > > >>>>>>>>>>> socket 1[core 15[hwt 0]]: > > >>>>>>>>>>> > > >>>>> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > > >>>>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16 > [hwt > > >>>>> 0]], > > >>>>>>>>> socket > > >>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > > >>>>>>>>>>> socket 2[core 19[hwt 0]]: > > >>>>>>>>>>> > > >>>>> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > > >>>>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20 > [hwt > > >>>>> 0]], > > >>>>>>>>> socket > > >>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > > >>>>>>>>>>> socket 2[core 23[hwt 0]]: > > >>>>>>>>>>> > > >>>>> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > > >>>>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24 > [hwt > > >>>>> 0]], > > >>>>>>>>> socket > > >>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > > >>>>>>>>>>> socket 3[core 27[hwt 0]]: > > >>>>>>>>>>> > > >>>>> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > > >>>>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28 > [hwt > > >>>>> 0]], > > >>>>>>>>> socket > > >>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > > >>>>>>>>>>> socket 3[core 31[hwt 0]]: > > >>>>>>>>>>> > > >>>>> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > > >>>>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0 > [hwt > > >>> 0]], > > >>>>>>>>> socket > > >>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > >>>>>>>>>>> cket 0[core 3[hwt 0]]: > > >>>>>>>>>>> > > >>>>> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > > >>>>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4 > [hwt > > >>> 0]], > > >>>>>>>>> socket > > >>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > > >>>>>>>>>>> cket 0[core 7[hwt 0]]: > > >>>>>>>>>>> > > >>>>> > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > > >>>>>>>>>>> Hello world from process 2 of 8 > > >>>>>>>>>>> Hello world from process 1 of 8 > > >>>>>>>>>>> Hello world from process 3 of 8 > > >>>>>>>>>>> Hello world from process 0 of 8 > > >>>>>>>>>>> Hello world from process 6 of 8 > > >>>>>>>>>>> Hello world from process 5 of 8 > > >>>>>>>>>>> Hello world from process 4 of 8 > > >>>>>>>>>>> Hello world from process 7 of 8 > > >>>>>>>>>>> > > >>>>>>>>>>> I think this should be like this: > > >>>>>>>>>>> > > >>>>>>>>>>> rank 00 > > >>>>>>>>>>> > > >>>>> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > > >>>>>>>>>>> rank 01 > > >>>>>>>>>>> > > >>>>> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > > >>>>>>>>>>> rank 02 > > >>>>>>>>>>> > > >>>>> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > > >>>>>>>>>>> ... > > >>>>>>>>>>> > > >>>>>>>>>>> Regards, > > >>>>>>>>>>> Tetsuya Mishima > > >>>>>>>>>>> > > >>>>>>>>>>>> I fixed this under the trunk (was an issue regardless of RM) > > > and > > >>>>>>> have > > >>>>>>>>>>> scheduled it for 1.7.4. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thanks! > > >>>>>>>>>>>> Ralph > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp > wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Hi Ralph, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Thank you very much for your quick response.> >>>>>>>>>>>>> > > >>>>>>>>>>>>> I'm afraid to say that I found one more issuse... > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> It's not so serious. Please check it when you have a lot of > > >>> time. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> The problem is cpus-per-proc with -map-by option under > Torque > > >>>>>>>>> manager. > > >>>>>>>>>>>>> It doesn't work as shown below. I guess you can get the > same > > >>>>>>>>>>>>> behaviour under Slurm manager. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Of course, if I remove -map-by option, it works quite well. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 > > >>>>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start > > >>>>>>>>>>>>> qsub: job 8116.manage.cluster ready > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2 > > >>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings > > >>>>>>>>> -cpus-per-proc > > >>>>>>>>>>> 4 > > >>>>>>>>>>>>> -map-by socket mPre > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>> > > >>>>> > > >>> > > > > -------------------------------------------------------------------------- > > >>>>>>>>>>>>> A request was made to bind to that would result in binding > > > more > > >>>>>>>>>>>>> processes than cpus on a resource: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Bind to: CORE > > >>>>>>>>>>>>> Node: node03>>>>>>> #processes: 2 > > >>>>>>>>>>>>> #cpus: 1 > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> You can override this protection by adding the > > >>> "overload-allowed" > > >>>>>>>>>>>>> option to your binding directive. > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>> > > >>>>> > > >>> > > > > -------------------------------------------------------------------------- > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings > > >>>>>>>>> -cpus-per-proc > > >>>>>>>>>>> 4 > > >>>>>>>>>>>>> mPre > > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8 > > > [hwt > > >>>>> 0]], > > >>>>>>>>>>> socket > > >>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > > >>>>>>>>>>>>> ocket 1[core 11[hwt 0]]: > > >>>>>>>>>>>>> > > >>>>>>> > > > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12 > > > [hwt > > >>>>>>> 0]], > > >>>>>>>>>>> socket > > >>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > > >>>>>>>>>>>>> socket 1[core 15[hwt 0]]: > > >>>>>>>>>>>>> > > >>>>>>> > > > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16 > > > [hwt > > >>>>>>> 0]], > > >>>>>>>>>>> socket > > >>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > > >>>>>>>>>>>>> socket 2[core 19[hwt 0]]: > > >>>>>>>>>>>>> > > >>>>>>> > > > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20 > > > [hwt > > >>>>>>> 0]], > > >>>>>>>>>>> socket > > >>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > > >>>>>>>>>>>>> socket 2[core 23[hwt 0]]: > > >>>>>>>>>>>>> > > >>>>>>> > > > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24 > > > [hwt > > >>>>>>> 0]], > > >>>>>>>>>>> socket > > >>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > > >>>>>>>>>>>>> socket 3[core 27[hwt 0]]: > > >>>>>>>>>>>>> > > >>>>>>> > > > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28 > > > [hwt > > >>>>>>> 0]], > > >>>>>>>>>>> socket > > >>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > > >>>>>>>>>>>>> socket 3[core 31[hwt 0]]: > > >>>>>>>>>>>>> > > >>>>>>> > > > > [./././././././.][./././././././.][./././././././.][././././B/B/B/B]>>>>>>>>>>>>> > [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0 > > > [hwt > > >>>>> 0]], > > >>>>>>>>>>> socket > > >>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > >>>>>>>>>>>>> cket 0[core 3[hwt 0]]: > > >>>>>>>>>>>>> > > >>>>>>> > > > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4 > > > [hwt > > >>>>> 0]], > > >>>>>>>>>>> socket > > >>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > > >>>>>>>>>>>>> cket 0[core 7[hwt 0]]: > > >>>>>>>>>>>>> > > >>>>>>> > > > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > Regards, > > >>>>>>>>>>>>> Tetsuya Mishima > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again! > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain > > > <r...@open-mpi.org> > > >>>>>>> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks! That's precisely where I was going to look when I > > > had > > >>>>>>>>> time :-) > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> I'll update tomorrow. > > >>>>>>>>>>>>>> Ralph > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM, > > >>>>>>> <tmish...@jcity.maeda.co.jp>wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Hi Ralph, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> This is the continuous story of "Segmentation fault in > > >>> oob_tcp.c > > >>>>>>> of > > >>>>>>>>>>>>>> openmpi-1.7.4a1r29646". > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> I found the cause. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine > can > > >>> not. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Your host file: > > >>>>>>>>>>>>>> cat hosts > > >>>>>>>>>>>>>> bend001 slots=12 > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> My host file: > > >>>>>>>>>>>>>> cat hosts > > >>>>>>>>>>>>>> node08 > > >>>>>>>>>>>>>> node08 > > >>>>>>>>>>>>>> ...(total 8 lines) > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> I modified my script file to add "slots=1" to each line of > > > my > > >>>>>>>>> hostfile > > >>>>>>>>>>>>>> just before launching mpirun. Then it worked. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> My host file(modified): > > >>>>>>>>>>>>>> cat hosts > > >>>>>>>>>>>>>> node08 slots=1 > > >>>>>>>>>>>>>> node08 slots=1 > > >>>>>>>>>>>>>> ...(total 8 lines) > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Secondary, I confirmed that there's a slight difference > > >>> between > > >>>>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of > > >>>>> 1.7.4a1r29646. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> $ diff > > >>>>>>>>>>>>>> > > >>>>>>>>> > > >>>>> > > > hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c > > >>>>>>>>>>>>>> 394,401c394,399 > > >>>>>>>>>>>>>> < if (got_count) { > > >>>>>>>>>>>>>> < node->slots_given = true; > > >>>>>>>>>>>>>> < } else if (got_max) { > > >>>>>>>>>>>>>> < node->slots = node->slots_max; > > >>>>>>>>>>>>>> < node->slots_given = true; > > >>>>>>>>>>>>>> < } else { > > >>>>>>>>>>>>>> < /* should be set by obj_new, but just to be > clear > > > */ > > >>>>>>>>>>>>>> < node->slots_given = false; > > >>>>>>>>>>>>>> --- > > >>>>>>>>>>>>>>> if (!got_count) { > > >>>>>>>>>>>>>>> if (got_max) { > > >>>>>>>>>>>>>>> node->slots = node->slots_max; > > >>>>>>>>>>>>>>> } else { > > >>>>>>>>>>>>>>> ++node->slots;>>>>>>>>>>>>> } > > >>>>>>>>>>>>>> .... > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Finally, I added the line 402 below just as a tentative > > > trial. > > >>>>>>>>>>>>>> Then, it worked. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c: > > >>>>>>>>>>>>>> ... > > >>>>>>>>>>>>>> 394 if (got_count) { > > >>>>>>>>>>>>>> 395 node->slots_given = true; > > >>>>>>>>>>>>>> 396 } else if (got_max) { > > >>>>>>>>>>>>>> 397 node->slots = node->slots_max; > > >>>>>>>>>>>>>> 398 node->slots_given = true; > > >>>>>>>>>>>>>> 399 } else { > > >>>>>>>>>>>>>> 400 /* should be set by obj_new, but just to be > > > clear > > >>>>> */ > > >>>>>>>>>>>>>> 401 node->slots_given > > > = false; > > >>>>>>>>>>>>>> 402 ++node->slots; /* added by tmishima */ > > >>>>>>>>>>>>>> 403 } > > >>>>>>>>>>>>>> ... > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Please fix the problem properly, because it's just based > on > > > my > > >>>>>>>>>>>>>> random guess. It's related to the treatment of hostfile > > > where > > >>>>>>> slots > > >>>>>>>>>>>>>> information is not given. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Regards, > > >>>>>>>>>>>>>> Tetsuya Mishima > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> _______________________________________________ > > >>>>>>>>>>>>>> users mailing list > > >>>>>>>>>>>>>> us...@open-mpi.org > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>> > > >>>>> > > >>> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > > > > > > >>> > > >>>>> > > >>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> users mailing list > > >>>>>>>>>>>>>> > > >>>>>>> > > > users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> _______________________________________________ > > >>>>>>>>>>>>> users mailing list > > >>>>>>>>>>>>> us...@open-mpi.org > > >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>>>>>>>> > > >>>>>>>>>>>> _______________________________________________ > > >>>>>>>>>>>> users mailing list > > >>>>>>>>>>>> us...@open-mpi.org > > >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>>>>>>> > > >>>>>>>>>>> _______________________________________________ > > >>>>>>>>>>> users mailing list > > >>>>>>>>>>> us...@open-mpi.org > > >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>>>>>> > > >>>>>>>>>> _______________________________________________ > > >>>>>>>>>> users mailing list > > >>>>>>>>>> us...@open-mpi.org > > >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>>>>> > > >>>>>>>>> _______________________________________________ > > >>>>>>>>> users mailing list > > >>>>>>>>> us...@open-mpi.org > > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>>>> > > >>>>>>>> _______________________________________________ > > >>>>>>>> users mailing list > > >>>>>>>> us...@open-mpi.org > > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>>> > > >>>>>>> _______________________________________________ > > >>>>>>> users mailing list > > >>>>>>> us...@open-mpi.org > > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>> > > >>>>>> _______________________________________________ > > >>>>>> users mailing list > > >>>>>> us...@open-mpi.org > > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>> > > >>>>> _______________________________________________ > > >>>>> users mailing list > > >>>>> us...@open-mpi.org > > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>> > > >>> _______________________________________________ > > >>> users mailing list > > >>> us...@open-mpi.org > > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > >> _______________________________________________ > > >> users mailing list > > >> us...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users