Hi,
I used interactive mode just because it was easy to report the behavior. I'm sure that submiting job gives the same result. Therefore, I think the environment variables are also set in the session. Anyway, I'm away from the cluster now. Regarding "$ env | grep PBS", I'll send it later. Regards, Tetsuya Mishima Hi, > > Am 26.11.2013 um 01:22 schrieb tmish...@jcity.maeda.co.jp: > > > Thank you very much for your quick response. > > > > I'm afraid to say that I found one more issuse... > > > > It's not so serious. Please check it when you have a lot of time. > > > > The problem is cpus-per-proc with -map-by option under Torque manager. > > It doesn't work as shown below. I guess you can get the same > > behaviour under Slurm manager. > > > > Of course, if I remove -map-by option, it works quite well. > > > > [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 > > qsub: waiting for job 8116.manage.cluster to start > > qsub: job 8116.manage.cluster ready > > Are the environment variables of Torque also set in an interactive session? What is the output of: > > $ env | grep PBS > > inside such a session. > > -- Reuti > > > > [mishima@node03 ~]$ cd ~/Ducom/testbed2 > > [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 > > -map-by socket mPre > > -------------------------------------------------------------------------- > > A request was made to bind to that would result in binding more > > processes than cpus on a resource: > > > > Bind to: CORE > > Node: node03 > > #processes: 2 > > #cpus: 1 > > > > You can override this protection by adding the "overload-allowed" > > option to your binding directive. > > -------------------------------------------------------------------------- > > > > > > [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 > > mPre > > [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket > > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > > ocket 1[core 11[hwt 0]]: > > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > > [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket > > 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > > socket 1[core 15[hwt 0]]: > > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > > [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket > > 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > > socket 2[core 19[hwt 0]]: > > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > > [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket > > 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > > socket 2[core 23[hwt 0]]: > > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > > [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket > > 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > > socket 3[core 27[hwt 0]]: > > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > > [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket > > 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > > socket 3[core 31[hwt 0]]: > > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > > [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > cket 0[core 3[hwt 0]]: > > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > > [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket > > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > > cket 0[core 7[hwt 0]]: > > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > > > > Regards, > > Tetsuya Mishima > > > >> Fixed and scheduled to move to 1.7.4. Thanks again! > >> > >> > >> On Nov 17, 2013, at 6:11 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> > >> Thanks! That's precisely where I was going to look when I had time :-) > >> > >> I'll update tomorrow. > >> Ralph > >> > >> > >> > >> > >> On Sun, Nov 17, 2013 at 7:01 PM, <tmish...@jcity.maeda.co.jp>wrote: > >> > >> > >> Hi Ralph, > >> > >> This is the continuous story of "Segmentation fault in oob_tcp.c of > >> openmpi-1.7.4a1r29646". > >> > >> I found the cause. > >> > >> Firstly, I noticed that your hostfile can work and mine can not. > >> > >> Your host file: > >> cat hosts > >> bend001 slots=12 > >> > >> My host file: > >> cat hosts > >> node08 > >> node08 > >> ...(total 8 lines) > >> > >> I modified my script file to add "slots=1" to each line of my hostfile > >> just before launching mpirun. Then it worked. > >> > >> My host file(modified): > >> cat hosts > >> node08 slots=1 > >> node08 slots=1 > >> ...(total 8 lines) > >> > >> Secondary, I confirmed that there's a slight difference between > >> orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646. > >> > >> $ diff > >> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c > >> 394,401c394,399 > >> < if (got_count) { > >> < node->slots_given = true; > >> < } else if (got_max) { > >> < node->slots = node->slots_max; > >> < node->slots_given = true; > >> < } else { > >> < /* should be set by obj_new, but just to be clear */ > >> < node->slots_given = false; > >> --- > >>> if (!got_count) { > >>> if (got_max) { > >>> node->slots = node->slots_max; > >>> } else { > >>> ++node->slots; > >>> } > >> .... > >> > >> Finally, I added the line 402 below just as a tentative trial. > >> Then, it worked. > >> > >> cat -n orte/util/hostfile/hostfile.c: > >> ... > >> 394 if (got_count) { > >> 395 node->slots_given = true; > >> 396 } else if (got_max) { > >> 397 node->slots = node->slots_max; > >> 398 node->slots_given = true; > >> 399 } else { > >> 400 /* should be set by obj_new, but just to be clear */ > >> 401 node->slots_given = false; > >> 402 ++node->slots; /* added by tmishima */ > >> 403 } > >> ... > >> > >> Please fix the problem properly, because it's just based on my > >> random guess. It's related to the treatment of hostfile where slots > >> information is not given. > >> > >> Regards, > >> Tetsuya Mishima > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > > > >> users mailing list > >> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users