Hi Reuti and Gus,
Thank you for your comments. Our cluster is a little bit heterogeneous, which has nodes with 4, 8, 32 cores. I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for "-l nodes=2:ppn=4". (strictly speaking, Torque picked up proper nodes.) As I mentioned before, I usually use openmpi-1.6.x, which has no troble against that kind of use. I encountered the issue when I was evaluating openmpi-1.7 to check when we could move on to it, although we have no positive reason to do that at this moment. As Gus pointed out, I use a script file as shown below for a practical use of openmpi-1.6.x. #PBS -l nodes=2:ppn=32 # even "-l nodes=1:ppn=32+4:ppn=8" works fine export OMP_NUM_THREADS=4 modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines here mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -report-bindings \ -x OMP_NUM_THREADS ./my_program # 32-core node has 8 numanodes, 8-core node has 2 numanodes It works well under the combination of openmpi-1.6.x and Torque. The problem is just openmpi-1.7's behavior. Regards, Tetsuya Mishima > Hi Tetsuya Mishima > > Mpiexec offers you a number of possibilities that you could try: > --bynode, > --pernode, > --npernode, > --bysocket, > --bycore, > --cpus-per-proc, > --cpus-per-rank, > --rankfile > and more. > > Most likely one or more of them will fit your needs. > > There are also associated flags to bind processes to cores, > to sockets, etc, to report the bindings, and so on. > > Check the mpiexec man page for details. > > Nevertheless, I am surprised that modifying the > $PBS_NODEFILE doesn't work for you in OMPI 1.7. > I have done this many times in older versions of OMPI. > > Would it work for you to go back to the stable OMPI 1.6.X, > or does it lack any special feature that you need? > > I hope this helps, > Gus Correa > > On 03/19/2013 03:00 AM, tmish...@jcity.maeda.co.jp wrote: > > > > > > Hi Jeff, > > > > I didn't have much time to test this morning. So, I checked it again > > now. Then, the trouble seems to depend on the number of nodes to use. > > > > This works(nodes< 4): > > mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 > > (OMP_NUM_THREADS=4) > > > > This causes error(nodes>= 4): > > mpiexec -bynode -np 8 ./my_program&& #PBS -l nodes=4:ppn=8 > > (OMP_NUM_THREADS=4) > > > > Regards, > > Tetsuya Mishima > > > >> Oy; that's weird. > >> > >> I'm afraid we're going to have to wait for Ralph to answer why that is > > happening -- sorry! > >> > >> > >> On Mar 18, 2013, at 4:45 PM,<tmish...@jcity.maeda.co.jp> wrote: > >> > >>> > >>> > >>> Hi Correa and Jeff, > >>> > >>> Thank you for your comments. I quickly checked your suggestion. > >>> > >>> As a result, my simple example case worked well. > >>> export OMP_NUM_THREADS=4 > >>> mpiexec -bynode -np 2 ./my_program&& #PBS -l nodes=2:ppn=4 > >>> > >>> But, practical case that more than 1 process was allocated to a node > > like > >>> below did not work. > >>> export OMP_NUM_THREADS=4 > >>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 > >>> > >>> The error message is as follows: > >>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is > >>> attempting to be sent to a process whose contact infor > >>> mation is unknown in file rml_oob_send.c at line 316 > >>> [node08.cluster:11946] [[30666,0],3] unable to find address for > >>> [[30666,0],1] > >>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is > >>> attempting to be sent to a process whose contact infor > >>> mation is unknown in file base/grpcomm_base_rollup.c at line 123 > >>> > >>> Here is our openmpi configuration: > >>> ./configure \ > >>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \ > >>> --with-tm \ > >>> --with-verbs \ > >>> --disable-ipv6 \ > >>> CC=pgcc CFLAGS="-fast -tp k8-64e" \ > >>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ > >>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \ > >>> FC=pgfortran FCFLAGS="-fast -tp k8-64e" > >>> > >>> Regards, > >>> Tetsuya Mishima > >>> > >>>> On Mar 17, 2013, at 10:55 PM, Gustavo Correa<g...@ldeo.columbia.edu> > >>> wrote: > >>>> > >>>>> In your example, have you tried not to modify the node file, > >>>>> launch two mpi processes with mpiexec, and request a "-bynode" > >>> distribution of processes: > >>>>> > >>>>> mpiexec -bynode -np 2 ./my_program > >>>> > >>>> This should work in 1.7, too (I use these kinds of options with SLURM > > all > >>> the time). > >>>> > >>>> However, we should probably verify that the hostfile functionality in > >>> batch jobs hasn't been broken in 1.7, too, because I'm pretty sure that > >>> what you described should work. However, Ralph, our > >>>> run-time guy, is on vacation this week. There might be a delay in > >>> checking into this. > >>>> > >>>> -- > >>>> Jeff Squyres > >>>> jsquy...@cisco.com > >>>> For corporate legal information go to: > >>> http://www.cisco.com/web/about/doing_business/legal/cri/ > >>>> > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> -- > >> Jeff Squyres > >> jsquy...@cisco.com > >> For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >