Hi,
Here is the output of "printenv | grep PBS". It seems that all variables are set as I expected. [mishima@manage mpi_demo]$ qsub -I -l nodes=1:ppn=32 qsub: waiting for job 8120.manage.cluster to start qsub: job 8120.manage.cluster ready [mishima@node03 ~]$ printenv | grep PBS PBS_VERSION=TORQUE-2.3.6 PBS_JOBNAME=STDIN PBS_ENVIRONMENT=PBS_INTERACTIVE PBS_O_WORKDIR=/home/mishima/mis/openmpi/mpi_demo PBS_TASKNUM=1 PBS_O_HOME=/home/mishima PBS_MOMPORT=15003 PBS_O_QUEUE=default PBS_O_LOGNAME=mishima PBS_O_LANG=en_US.UTF-8 PBS_JOBCOOKIE=D2C01A2A13513BE20A1EC27B2B67FF5F PBS_NODENUM=0 PBS_O_SHELL=/bin/bash PBS_SERVER=manage.cluster PBS_JOBID=8120.manage.cluster PBS_O_HOST=manage.cluster PBS_VNODENUM=0 PBS_QUEUE=default PBS_O_MAIL=/var/spool/mail/mishima PBS_NODEFILE=/var/spool/torque/aux//8120.manage.cluster PBS_O_PATH=/opt/pgi/linux86-64/2013/bin:/opt/mpi/openmpi-pgi/bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin: /bin:/usr/bin:/opt/dell/srvadmin/bin:/home/mishima/bin In addition, I confirmed that submitting job gave the same result: [mishima@manage mpi_demo]$ cat myscript.sh #!/bin/sh #PBS -l nodes=node03:ppn=32 cd $PBS_O_WORKDIR cat $PBS_NODEFILE mpirun -np 8 -report-bindings -cpus-per-proc 4 -map-by socket myprog [mishima@manage mpi_demo]$ qsub myscript.sh 8119.manage.cluster [mishima@manage mpi_demo]$ cat myscript.sh.o8119 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 node03 -------------------------------------------------------------------------- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node: node03 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -------------------------------------------------------------------------- Regards, Tetsuya Mishima > Hi, > > Am 26.11.2013 um 01:22 schrieb tmish...@jcity.maeda.co.jp: > > > Thank you very much for your quick response. > > > > I'm afraid to say that I found one more issuse... > > > > It's not so serious. Please check it when you have a lot of time. > > > > The problem is cpus-per-proc with -map-by option under Torque manager. > > It doesn't work as shown below. I guess you can get the same > > behaviour under Slurm manager. > > > > Of course, if I remove -map-by option, it works quite well. > > > > [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 > > qsub: waiting for job 8116.manage.cluster to start > > qsub: job 8116.manage.cluster ready > > Are the environment variables of Torque also set in an interactive session? What is the output of: > > $ env | grep PBS > > inside such a session. > > -- Reuti > > > > [mishima@node03 ~]$ cd ~/Ducom/testbed2 > > [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 > > -map-by socket mPre > > -------------------------------------------------------------------------- > > A request was made to bind to that would result in binding more > > processes than cpus on a resource: > > > > Bind to: CORE > > Node: node03 > > #processes: 2 > > #cpus: 1 > > > > You can override this protection by adding the "overload-allowed" > > option to your binding directive. > > -------------------------------------------------------------------------- > > > > > > [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 > > mPre > > [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket > > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > > ocket 1[core 11[hwt 0]]: > > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > > [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket > > 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > > socket 1[core 15[hwt 0]]: > > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > > [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket > > 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > > socket 2[core 19[hwt 0]]: > > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > > [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket > > 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > > socket 2[core 23[hwt 0]]: > > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > > [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket > > 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > > socket 3[core 27[hwt 0]]: > > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > > [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket > > 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > > socket 3[core 31[hwt 0]]: > > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > > [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > cket 0[core 3[hwt 0]]: > > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > > [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket > > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > > cket 0[core 7[hwt 0]]: > > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > > > > Regards, > > Tetsuya Mishima > > > >> Fixed and scheduled to move to 1.7.4. Thanks again! > >> > >> > >> On Nov 17, 2013, at 6:11 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> > >> Thanks! That's precisely where I was going to look when I had time :-) > >> > >> I'll update tomorrow. > >> Ralph > >> > >> > >> > >> > >> On Sun, Nov 17, 2013 at 7:01 PM, <tmish...@jcity.maeda.co.jp>wrote: > >> > >> > >> Hi Ralph, > >> > >> This is the continuous story of "Segmentation fault in oob_tcp.c of > >> openmpi-1.7.4a1r29646". > >> > >> I found the cause. > >> > >> Firstly, I noticed that your hostfile can work and mine can not. > >> > >> Your host file: > >> cat hosts > >> bend001 slots=12 > >> > >> My host file: > >> cat hosts > >> node08 > >> node08 > >> ...(total 8 lines) > >> > >> I modified my script file to add "slots=1" to each line of my hostfile > >> just before launching mpirun. Then it worked. > >> > >> My host file(modified): > >> cat hosts > >> node08 slots=1 > >> node08 slots=1 > >> ...(total 8 lines) > >> > >> Secondary, I confirmed that there's a slight difference between > >> orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646. > >> > >> $ diff > >> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c > >> 394,401c394,399 > >> < if (got_count) { > >> < node->slots_given = true; > >> < } else if (got_max) { > >> < node->slots = node->slots_max; > >> < node->slots_given = true; > >> < } else { > >> < /* should be set by obj_new, but just to be clear */ > >> < node->slots_given = false; > >> --- > >>> if (!got_count) { > >>> if (got_max) { > >>> node->slots = node->slots_max; > >>> } else { > >>> ++node->slots; > >>> } > >> .... > >> > >> Finally, I added the line 402 below just as a tentative trial. > >> Then, it worked. > >> > >> cat -n orte/util/hostfile/hostfile.c: > >> ... > >> 394 if (got_count) { > >> 395 node->slots_given = true; > >> 396 } else if (got_max) { > >> 397 node->slots = node->slots_max; > >> 398 node->slots_given = true; > >> 399 } else { > >> 400 /* should be set by obj_new, but just to be clear */ > >> 401 node->slots_given = false; > >> 402 ++node->slots; /* added by tmishima */ > >> 403 } > >> ... > >> > >> Please fix the problem properly, because it's just based on my > >> random guess. It's related to the treatment of hostfile where slots > >> information is not given. > >> > >> Regards, > >> Tetsuya Mishima > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > > > >> users mailing list > >> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users