I fixed this under the trunk (was an issue regardless of RM) and have scheduled it for 1.7.4.
Thanks! Ralph On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > Thank you very much for your quick response. > > I'm afraid to say that I found one more issuse... > > It's not so serious. Please check it when you have a lot of time. > > The problem is cpus-per-proc with -map-by option under Torque manager. > It doesn't work as shown below. I guess you can get the same > behaviour under Slurm manager. > > Of course, if I remove -map-by option, it works quite well. > > [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 > qsub: waiting for job 8116.manage.cluster to start > qsub: job 8116.manage.cluster ready > > [mishima@node03 ~]$ cd ~/Ducom/testbed2 > [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 > -map-by socket mPre > -------------------------------------------------------------------------- > A request was made to bind to that would result in binding more > processes than cpus on a resource: > > Bind to: CORE > Node: node03 > #processes: 2 > #cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > -------------------------------------------------------------------------- > > > [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 > mPre > [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > ocket 1[core 11[hwt 0]]: > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket > 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > socket 1[core 15[hwt 0]]: > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket > 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > socket 2[core 19[hwt 0]]: > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket > 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > socket 2[core 23[hwt 0]]: > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket > 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > socket 3[core 27[hwt 0]]: > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket > 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > socket 3[core 31[hwt 0]]: > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > cket 0[core 3[hwt 0]]: > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > cket 0[core 7[hwt 0]]: > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > > Regards, > Tetsuya Mishima > >> Fixed and scheduled to move to 1.7.4. Thanks again! >> >> >> On Nov 17, 2013, at 6:11 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> Thanks! That's precisely where I was going to look when I had time :-) >> >> I'll update tomorrow. >> Ralph >> >> >> >> >> On Sun, Nov 17, 2013 at 7:01 PM, <tmish...@jcity.maeda.co.jp>wrote: >> >> >> Hi Ralph, >> >> This is the continuous story of "Segmentation fault in oob_tcp.c of >> openmpi-1.7.4a1r29646". >> >> I found the cause. >> >> Firstly, I noticed that your hostfile can work and mine can not. >> >> Your host file: >> cat hosts >> bend001 slots=12 >> >> My host file: >> cat hosts >> node08 >> node08 >> ...(total 8 lines) >> >> I modified my script file to add "slots=1" to each line of my hostfile >> just before launching mpirun. Then it worked. >> >> My host file(modified): >> cat hosts >> node08 slots=1 >> node08 slots=1 >> ...(total 8 lines) >> >> Secondary, I confirmed that there's a slight difference between >> orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646. >> >> $ diff >> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c >> 394,401c394,399 >> < if (got_count) { >> < node->slots_given = true; >> < } else if (got_max) { >> < node->slots = node->slots_max; >> < node->slots_given = true; >> < } else { >> < /* should be set by obj_new, but just to be clear */ >> < node->slots_given = false; >> --- >>> if (!got_count) { >>> if (got_max) { >>> node->slots = node->slots_max; >>> } else { >>> ++node->slots; >>> } >> .... >> >> Finally, I added the line 402 below just as a tentative trial. >> Then, it worked. >> >> cat -n orte/util/hostfile/hostfile.c: >> ... >> 394 if (got_count) { >> 395 node->slots_given = true; >> 396 } else if (got_max) { >> 397 node->slots = node->slots_max; >> 398 node->slots_given = true; >> 399 } else { >> 400 /* should be set by obj_new, but just to be clear */ >> 401 node->slots_given = false; >> 402 ++node->slots; /* added by tmishima */ >> 403 } >> ... >> >> Please fix the problem properly, because it's just based on my >> random guess. It's related to the treatment of hostfile where slots >> information is not given. >> >> Regards, >> Tetsuya Mishima >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > >> users mailing list >> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users