Hmmm...okay, I understand the scenario. Must be something in the algo when it only has one node, so it shouldn't be too hard to track down.
I'm off on travel for a few days, but will return to this when I get back. Sorry for delay - will try to look at this while I'm gone, but can't promise anything :-( On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, sorry for confusing. > > We usually logon to "manage", which is our control node. > From manage, we submit job or enter a remote node such as > node03 by torque interactive mode(qsub -I). > > At that time, instead of torque, I just did rsh to node03 from manage > and ran myprog on the node. I hope you could understand what I did. > > Now, I retried with "-host node03", which still causes the problem: > (I comfirmed local run on manage caused the same problem too) > > [mishima@manage ~]$ rsh node03 > Last login: Wed Dec 11 11:38:57 from manage > [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > [mishima@node03 demos]$ > [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings > -cpus-per-proc 4 -map-by socket myprog > -------------------------------------------------------------------------- > A request was made to bind to that would result in binding more > processes than cpus on a resource: > > Bind to: CORE > Node: node03 > #processes: 2 > #cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > -------------------------------------------------------------------------- > > It' strange, but I have to report that "-map-by socket:span" worked well. > > [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings > -cpus-per-proc 4 -map-by socket:span myprog > [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > ocket 1[core 11[hwt 0]]: > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket > 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > socket 1[core 15[hwt 0]]: > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket > 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > socket 2[core 19[hwt 0]]: > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket > 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > socket 2[core 23[hwt 0]]: > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket > 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > socket 3[core 27[hwt 0]]: > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket > 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > socket 3[core 31[hwt 0]]: > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > cket 0[core 3[hwt 0]]: > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > cket 0[core 7[hwt 0]]: > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > Hello world from process 2 of 8 > Hello world from process 6 of 8 > Hello world from process 3 of 8 > Hello world from process 7 of 8 > Hello world from process 1 of 8 > Hello world from process 5 of 8 > Hello world from process 0 of 8 > Hello world from process 4 of 8 > > Regards, > Tetsuya Mishima > > >> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Hi Ralph, >>> >>> I tried again with -cpus-per-proc 2 as shown below. >>> Here, I found that "-map-by socket:span" worked well. >>> >>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 2 >>> -map-by socket:span myprog >>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt 0]], > socket >>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. >>> /././.][./././././././.][./././././././.] >>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt 0]], > socket >>> 1[core 11[hwt 0]]: [./././././././.][././B/B >>> /./././.][./././././././.][./././././././.] >>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt 0]], > socket >>> 2[core 17[hwt 0]]: [./././././././.][./././. >>> /./././.][B/B/./././././.][./././././././.] >>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt 0]], > socket >>> 2[core 19[hwt 0]]: [./././././././.][./././. >>> /./././.][././B/B/./././.][./././././././.] >>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt 0]], > socket >>> 3[core 25[hwt 0]]: [./././././././.][./././. >>> /./././.][./././././././.][B/B/./././././.] >>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt 0]], > socket >>> 3[core 27[hwt 0]]: [./././././././.][./././. >>> /./././.][./././././././.][././B/B/./././.] >>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket >>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. >>> /././.][./././././././.][./././././././.] >>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt 0]], > socket >>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. >>> /././.][./././././././.][./././././././.] >>> Hello world from process 1 of 8 >>> Hello world from process 0 of 8 >>> Hello world from process 4 of 8 >>> Hello world from process 2 of 8 >>> Hello world from process 7 of 8 >>> Hello world from process 6 of 8 >>> Hello world from process 5 of 8 >>> Hello world from process 3 of 8 >>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 2 >>> -map-by socket myprog >>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt 0]], > socket >>> 0[core 5[hwt 0]]: [././././B/B/./.][././././. >>> /././.][./././././././.][./././././././.] >>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt 0]], > socket >>> 0[core 7[hwt 0]]: [././././././B/B][././././. >>> /././.][./././././././.][./././././././.] >>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt 0]], > socket >>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. >>> /././.][./././././././.][./././././././.] >>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt 0]], > socket >>> 1[core 11[hwt 0]]: [./././././././.][././B/B >>> /./././.][./././././././.][./././././././.] >>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt 0]], > socket >>> 1[core 13[hwt 0]]: [./././././././.][./././. >>> /B/B/./.][./././././././.][./././././././.] >>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt 0]], > socket >>> 1[core 15[hwt 0]]: [./././././././.][./././. >>> /././B/B][./././././././.][./././././././.] >>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket >>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. >>> /././.][./././././././.][./././././././.] >>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt 0]], > socket >>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. >>> /././.][./././././././.][./././././././.] >>> Hello world from process 5 of 8 >>> Hello world from process 1 of 8 >>> Hello world from process 6 of 8 >>> Hello world from process 4 of 8 >>> Hello world from process 2 of 8 >>> Hello world from process 0 of 8 >>> Hello world from process 7 of 8 >>> Hello world from process 3 of 8 >>> >>> "-np 8" and "-cpus-per-proc 4" just filled all sockets. >>> In this case, I guess "-map-by socket:span" and "-map-by socket" has > same >>> meaning. >>> Therefore, there's no problem about that. Sorry for distubing. >> >> No problem - glad you could clear that up :-) >> >>> >>> By the way, through this test, I found another problem. >>> Without torque manager and just using rsh, it causes the same error > like >>> below: >>> >>> [mishima@manage openmpi-1.7]$ rsh node03 >>> Last login: Wed Dec 11 09:42:02 from manage >>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 >>> -map-by socket myprog >> >> I don't understand the difference here - you are simply starting it from > a different node? It looks like everything is expected to run local to > mpirun, yes? So there is no rsh actually involved here. >> Are you still running in an allocation? >> >> If you run this with "-host node03" on the cmd line, do you see the same > problem? >> >> >>> > -------------------------------------------------------------------------- >>> A request was made to bind to that would result in binding more >>> processes than cpus on a resource: >>> >>> Bind to: CORE >>> Node: node03 >>> #processes: 2 >>> #cpus: 1 >>> >>> You can override this protection by adding the "overload-allowed" >>> option to your binding directive. >>> > -------------------------------------------------------------------------- >>> [mishima@node03 demos]$ >>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 >>> myprog >>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt 0]], > socket >>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>> ocket 1[core 11[hwt 0]]: >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt 0]], > socket >>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>> socket 1[core 15[hwt 0]]: >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt 0]], > socket >>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>> socket 2[core 19[hwt 0]]: >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt 0]], > socket >>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>> socket 2[core 23[hwt 0]]: >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt 0]], > socket >>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>> socket 3[core 27[hwt 0]]: >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt 0]], > socket >>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>> socket 3[core 31[hwt 0]]: >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>> cket 0[core 3[hwt 0]]: >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt 0]], > socket >>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>> cket 0[core 7[hwt 0]]: >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>> Hello world from process 4 of 8 >>> Hello world from process 2 of 8 >>> Hello world from process 6 of 8 >>> Hello world from process 5 of 8 >>> Hello world from process 3 of 8 >>> Hello world from process 7 of 8 >>> Hello world from process 0 of 8 >>> Hello world from process 1 of 8 >>> >>> Regards, >>> Tetsuya Mishima >>> >>>> Hmmm...that's strange. I only have 2 sockets on my system, but let me >>> poke around a bit and see what might be happening. >>>> >>>> On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote: >>>> >>>>> >>>>> >>>>> Hi Ralph, >>>>> >>>>> Thanks. I didn't know the meaning of "socket:span". >>>>> >>>>> But it still causes the problem, which seems socket:span doesn't > work. >>>>> >>>>> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32 >>>>> qsub: waiting for job 8265.manage.cluster to start >>>>> qsub: job 8265.manage.cluster ready >>>>> >>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc > 4 >>>>> -map-by socket:span myprog >>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8[hwt 0]], >>> socket >>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>> ocket 1[core 11[hwt 0]]: >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12[hwt 0]], >>> socket >>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>> socket 1[core 15[hwt 0]]: >>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16[hwt 0]], >>> socket >>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>> socket 2[core 19[hwt 0]]: >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20[hwt 0]], >>> socket >>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>> socket 2[core 23[hwt 0]]: >>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24[hwt 0]], >>> socket >>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>> socket 3[core 27[hwt 0]]: >>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28[hwt 0]], >>> socket >>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>> socket 3[core 31[hwt 0]]: >>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>> socket >>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>> cket 0[core 3[hwt 0]]: >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4[hwt 0]], >>> socket >>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>> cket 0[core 7[hwt 0]]: >>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>> Hello world from process 0 of 8 >>>>> Hello world from process 3 of 8 >>>>> Hello world from process 1 of 8 >>>>> Hello world from process 4 of 8 >>>>> Hello world from process 6 of 8 >>>>> Hello world from process 5 of 8 >>>>> Hello world from process 2 of 8 >>>>> Hello world from process 7 of 8 >>>>> >>>>> Regards, >>>>> Tetsuya Mishima >>>>> >>>>>> No, that is actually correct. We map a socket until full, then move > to >>>>> the next. What you want is --map-by socket:span >>>>>> >>>>>> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Ralph, >>>>>>> >>>>>>> I had a time to try your patch yesterday using > openmpi-1.7.4a1r29646. >>>>>>> >>>>>>> It stopped the error but unfortunately "mapping by socket" itself >>>>> didn't >>>>>>> work >>>>>>> well as shown bellow: >>>>>>> >>>>>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32 >>>>>>> qsub: waiting for job 8260.manage.cluster to start >>>>>>> qsub: job 8260.manage.cluster ready >>>>>>> >>>>>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings > -cpus-per-proc >>> 4 >>>>>>> -map-by socket myprog >>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt 0]], >>>>> socket >>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt > 0]], >>>>> socket >>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>> socket 1[core 15[hwt 0]]: >>>>>>> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt > 0]], >>>>> socket >>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>> socket 2[core 19[hwt 0]]: >>>>>>> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt > 0]], >>>>> socket >>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>> socket 2[core 23[hwt 0]]: >>>>>>> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt > 0]], >>>>> socket >>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>> socket 3[core 27[hwt 0]]: >>>>>>> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt > 0]], >>>>> socket >>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>> socket 3[core 31[hwt 0]]: >>>>>>> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>>>> socket >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>> cket 0[core 3[hwt 0]]: >>>>>>> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt 0]], >>>>> socket >>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>> cket 0[core 7[hwt 0]]: >>>>>>> > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>> Hello world from process 2 of 8 >>>>>>> Hello world from process 1 of 8 >>>>>>> Hello world from process 3 of 8 >>>>>>> Hello world from process 0 of 8 >>>>>>> Hello world from process 6 of 8 >>>>>>> Hello world from process 5 of 8 >>>>>>> Hello world from process 4 of 8 >>>>>>> Hello world from process 7 of 8 >>>>>>> >>>>>>> I think this should be like this: >>>>>>> >>>>>>> rank 00 >>>>>>> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>> rank 01 >>>>>>> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>> rank 02 >>>>>>> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>> ... >>>>>>> >>>>>>> Regards, >>>>>>> Tetsuya Mishima >>>>>>> >>>>>>>> I fixed this under the trunk (was an issue regardless of RM) and >>> have >>>>>>> scheduled it for 1.7.4. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> Ralph >>>>>>>> >>>>>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Ralph, >>>>>>>>> >>>>>>>>> Thank you very much for your quick response. >>>>>>>>> >>>>>>>>> I'm afraid to say that I found one more issuse... >>>>>>>>> >>>>>>>>> It's not so serious. Please check it when you have a lot of time. >>>>>>>>> >>>>>>>>> The problem is cpus-per-proc with -map-by option under Torque >>>>> manager. >>>>>>>>> It doesn't work as shown below. I guess you can get the same >>>>>>>>> behaviour under Slurm manager. >>>>>>>>> >>>>>>>>> Of course, if I remove -map-by option, it works quite well. >>>>>>>>> >>>>>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 >>>>>>>>> qsub: waiting for job 8116.manage.cluster to start >>>>>>>>> qsub: job 8116.manage.cluster ready >>>>>>>>> >>>>>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2 >>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings >>>>> -cpus-per-proc >>>>>>> 4 >>>>>>>>> -map-by socket mPre >>>>>>>>> >>>>>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>>>>>> A request was made to bind to that would result in binding more >>>>>>>>> processes than cpus on a resource: >>>>>>>>> >>>>>>>>> Bind to: CORE >>>>>>>>> Node: node03>>>>>>> #processes: 2 >>>>>>>>> #cpus: 1 >>>>>>>>> >>>>>>>>> You can override this protection by adding the "overload-allowed" >>>>>>>>> option to your binding directive. >>>>>>>>> >>>>>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> >>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings >>>>> -cpus-per-proc >>>>>>> 4 >>>>>>>>> mPre >>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt > 0]], >>>>>>> socket >>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>> >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt > 0]], >>>>>>> socket >>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt > 0]], >>>>>>> socket >>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>> >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Tetsuya Mishima >>>>>>>>> >>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain <r...@open-mpi.org> >>> wrote: >>>>>>>>>> >>>>>>>>>> Thanks! That's precisely where I was going to look when I had >>>>> time :-) >>>>>>>>>> >>>>>>>>>> I'll update tomorrow. >>>>>>>>>> Ralph >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM, >>> <tmish...@jcity.maeda.co.jp>wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Ralph, >>>>>>>>>> >>>>>>>>>> This is the continuous story of "Segmentation fault in oob_tcp.c >>> of >>>>>>>>>> openmpi-1.7.4a1r29646". >>>>>>>>>> >>>>>>>>>> I found the cause. >>>>>>>>>> >>>>>>>>>> Firstly, I noticed that your hostfile can work and mine can not. >>>>>>>>>> >>>>>>>>>> Your host file: >>>>>>>>>> cat hosts >>>>>>>>>> bend001 slots=12 >>>>>>>>>> >>>>>>>>>> My host file: >>>>>>>>>> cat hosts >>>>>>>>>> node08 >>>>>>>>>> node08 >>>>>>>>>> ...(total 8 lines) >>>>>>>>>> >>>>>>>>>> I modified my script file to add "slots=1" to each line of my >>>>> hostfile >>>>>>>>>> just before launching mpirun. Then it worked. >>>>>>>>>> >>>>>>>>>> My host file(modified): >>>>>>>>>> cat hosts >>>>>>>>>> node08 slots=1 >>>>>>>>>> node08 slots=1 >>>>>>>>>> ...(total 8 lines) >>>>>>>>>> >>>>>>>>>> Secondary, I confirmed that there's a slight difference between >>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of > 1.7.4a1r29646. >>>>>>>>>> >>>>>>>>>> $ diff >>>>>>>>>> >>>>> > hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c >>>>>>>>>> 394,401c394,399 >>>>>>>>>> < if (got_count) { >>>>>>>>>> < node->slots_given = true; >>>>>>>>>> < } else if (got_max) { >>>>>>>>>> < node->slots = node->slots_max; >>>>>>>>>> < node->slots_given = true; >>>>>>>>>> < } else { >>>>>>>>>> < /* should be set by obj_new, but just to be clear */ >>>>>>>>>> < node->slots_given = false; >>>>>>>>>> --- >>>>>>>>>>> if (!got_count) { >>>>>>>>>>> if (got_max) { >>>>>>>>>>> node->slots = node->slots_max; >>>>>>>>>>> } else { >>>>>>>>>>> ++node->slots; >>>>>>>>>>> } >>>>>>>>>> .... >>>>>>>>>> >>>>>>>>>> Finally, I added the line 402 below just as a tentative trial. >>>>>>>>>> Then, it worked. >>>>>>>>>> >>>>>>>>>> cat -n orte/util/hostfile/hostfile.c: >>>>>>>>>> ... >>>>>>>>>> 394 if (got_count) { >>>>>>>>>> 395 node->slots_given = true; >>>>>>>>>> 396 } else if (got_max) { >>>>>>>>>> 397 node->slots = node->slots_max; >>>>>>>>>> 398 node->slots_given = true; >>>>>>>>>> 399 } else { >>>>>>>>>> 400 /* should be set by obj_new, but just to be clear > */ >>>>>>>>>> 401 node->slots_given = false; >>>>>>>>>> 402 ++node->slots; /* added by tmishima */ >>>>>>>>>> 403 } >>>>>>>>>> ... >>>>>>>>>> >>>>>>>>>> Please fix the problem properly, because it's just based on my >>>>>>>>>> random guess. It's related to the treatment of hostfile where >>> slots >>>>>>>>>> information is not given. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Tetsuya Mishima >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > >>> >>>>> >>>>>>> >>>>>>>>> >>>>>>>>>> users mailing list >>>>>>>>>> >>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users