Hmmm...that's strange. I only have 2 sockets on my system, but let me poke around a bit and see what might be happening.
On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > Thanks. I didn't know the meaning of "socket:span". > > But it still causes the problem, which seems socket:span doesn't work. > > [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32 > qsub: waiting for job 8265.manage.cluster to start > qsub: job 8265.manage.cluster ready > > [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 > -map-by socket:span myprog > [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > ocket 1[core 11[hwt 0]]: > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket > 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > socket 1[core 15[hwt 0]]: > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket > 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > socket 2[core 19[hwt 0]]: > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket > 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > socket 2[core 23[hwt 0]]: > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket > 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > socket 3[core 27[hwt 0]]: > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket > 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > socket 3[core 31[hwt 0]]: > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > cket 0[core 3[hwt 0]]: > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > cket 0[core 7[hwt 0]]: > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > Hello world from process 0 of 8 > Hello world from process 3 of 8 > Hello world from process 1 of 8 > Hello world from process 4 of 8 > Hello world from process 6 of 8 > Hello world from process 5 of 8 > Hello world from process 2 of 8 > Hello world from process 7 of 8 > > Regards, > Tetsuya Mishima > >> No, that is actually correct. We map a socket until full, then move to > the next. What you want is --map-by socket:span >> >> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Hi Ralph, >>> >>> I had a time to try your patch yesterday using openmpi-1.7.4a1r29646. >>> >>> It stopped the error but unfortunately "mapping by socket" itself > didn't >>> work >>> well as shown bellow: >>> >>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32 >>> qsub: waiting for job 8260.manage.cluster to start >>> qsub: job 8260.manage.cluster ready >>> >>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 >>> -map-by socket myprog >>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt 0]], > socket >>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>> ocket 1[core 11[hwt 0]]: >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt 0]], > socket >>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>> socket 1[core 15[hwt 0]]: >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt 0]], > socket >>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>> socket 2[core 19[hwt 0]]: >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt 0]], > socket >>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>> socket 2[core 23[hwt 0]]: >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt 0]], > socket >>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>> socket 3[core 27[hwt 0]]: >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt 0]], > socket >>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>> socket 3[core 31[hwt 0]]: >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>> cket 0[core 3[hwt 0]]: >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt 0]], > socket >>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>> cket 0[core 7[hwt 0]]: >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>> Hello world from process 2 of 8 >>> Hello world from process 1 of 8 >>> Hello world from process 3 of 8 >>> Hello world from process 0 of 8 >>> Hello world from process 6 of 8 >>> Hello world from process 5 of 8 >>> Hello world from process 4 of 8 >>> Hello world from process 7 of 8 >>> >>> I think this should be like this: >>> >>> rank 00 >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>> rank 01 >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>> rank 02 >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>> ... >>> >>> Regards, >>> Tetsuya Mishima >>> >>>> I fixed this under the trunk (was an issue regardless of RM) and have >>> scheduled it for 1.7.4. >>>> >>>> Thanks! >>>> Ralph >>>> >>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote: >>>> >>>>> >>>>> >>>>> Hi Ralph, >>>>> >>>>> Thank you very much for your quick response. >>>>> >>>>> I'm afraid to say that I found one more issuse... >>>>> >>>>> It's not so serious. Please check it when you have a lot of time. >>>>> >>>>> The problem is cpus-per-proc with -map-by option under Torque > manager. >>>>> It doesn't work as shown below. I guess you can get the same >>>>> behaviour under Slurm manager. >>>>> >>>>> Of course, if I remove -map-by option, it works quite well. >>>>> >>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 >>>>> qsub: waiting for job 8116.manage.cluster to start >>>>> qsub: job 8116.manage.cluster ready >>>>> >>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2 >>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings > -cpus-per-proc >>> 4 >>>>> -map-by socket mPre >>>>> >>> > -------------------------------------------------------------------------- >>>>> A request was made to bind to that would result in binding more >>>>> processes than cpus on a resource: >>>>> >>>>> Bind to: CORE >>>>> Node: node03 >>>>> #processes: 2 >>>>> #cpus: 1 >>>>> >>>>> You can override this protection by adding the "overload-allowed" >>>>> option to your binding directive. >>>>> >>> > -------------------------------------------------------------------------- >>>>> >>>>> >>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings > -cpus-per-proc >>> 4 >>>>> mPre >>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]], >>> socket >>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>> ocket 1[core 11[hwt 0]]: >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt 0]], >>> socket >>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>> socket 1[core 15[hwt 0]]: >>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt 0]], >>> socket >>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>> socket 2[core 19[hwt 0]]: >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt 0]], >>> socket >>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>> socket 2[core 23[hwt 0]]: >>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt 0]], >>> socket >>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>> socket 3[core 27[hwt 0]]: >>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt 0]], >>> socket >>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>> socket 3[core 31[hwt 0]]: >>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>> socket >>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>> cket 0[core 3[hwt 0]]: >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt 0]], >>> socket >>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>> cket 0[core 7[hwt 0]]: >>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>> >>>>> Regards, >>>>> Tetsuya Mishima >>>>> >>>>>> Fixed and scheduled to move to 1.7.4. Thanks again! >>>>>> >>>>>> >>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>> >>>>>> Thanks! That's precisely where I was going to look when I had > time :-) >>>>>> >>>>>> I'll update tomorrow. >>>>>> Ralph >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Nov 17, 2013 at 7:01 PM, <tmish...@jcity.maeda.co.jp>wrote: >>>>>> >>>>>> >>>>>> Hi Ralph, >>>>>> >>>>>> This is the continuous story of "Segmentation fault in oob_tcp.c of >>>>>> openmpi-1.7.4a1r29646". >>>>>> >>>>>> I found the cause. >>>>>> >>>>>> Firstly, I noticed that your hostfile can work and mine can not. >>>>>> >>>>>> Your host file: >>>>>> cat hosts >>>>>> bend001 slots=12 >>>>>> >>>>>> My host file: >>>>>> cat hosts >>>>>> node08 >>>>>> node08 >>>>>> ...(total 8 lines) >>>>>> >>>>>> I modified my script file to add "slots=1" to each line of my > hostfile >>>>>> just before launching mpirun. Then it worked. >>>>>> >>>>>> My host file(modified): >>>>>> cat hosts >>>>>> node08 slots=1 >>>>>> node08 slots=1 >>>>>> ...(total 8 lines) >>>>>> >>>>>> Secondary, I confirmed that there's a slight difference between >>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646. >>>>>> >>>>>> $ diff >>>>>> > hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c >>>>>> 394,401c394,399 >>>>>> < if (got_count) { >>>>>> < node->slots_given = true; >>>>>> < } else if (got_max) { >>>>>> < node->slots = node->slots_max; >>>>>> < node->slots_given = true; >>>>>> < } else { >>>>>> < /* should be set by obj_new, but just to be clear */ >>>>>> < node->slots_given = false; >>>>>> --- >>>>>>> if (!got_count) { >>>>>>> if (got_max) { >>>>>>> node->slots = node->slots_max; >>>>>>> } else { >>>>>>> ++node->slots; >>>>>>> } >>>>>> .... >>>>>> >>>>>> Finally, I added the line 402 below just as a tentative trial. >>>>>> Then, it worked. >>>>>> >>>>>> cat -n orte/util/hostfile/hostfile.c: >>>>>> ... >>>>>> 394 if (got_count) { >>>>>> 395 node->slots_given = true; >>>>>> 396 } else if (got_max) { >>>>>> 397 node->slots = node->slots_max; >>>>>> 398 node->slots_given = true; >>>>>> 399 } else { >>>>>> 400 /* should be set by obj_new, but just to be clear */ >>>>>> 401 node->slots_given = false; >>>>>> 402 ++node->slots; /* added by tmishima */ >>>>>> 403 } >>>>>> ... >>>>>> >>>>>> Please fix the problem properly, because it's just based on my >>>>>> random guess. It's related to the treatment of hostfile where slots >>>>>> information is not given. >>>>>> >>>>>> Regards, >>>>>> Tetsuya Mishima >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> >>>>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > >>> >>>>> >>>>>> users mailing list >>>>>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users