I've submitted a patch to fix the Torque launch issue - just some leftover garbage that existed at the time of the 1.7.0 branch and didn't get removed.
For the hostfile issue, I'm stumped as I can't see how the problem would come about. Could you please rerun your original test and add "--display-allocation" to your cmd line? Let's see if it is correctly finding the original allocation. Thanks Ralph On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Gus, > > Thank you for your comments. I understand your advice. > Our script used to be --npernode type as well. > > As I told before, our cluster consists of nodes having 4, 8, > and 32 cores, although it used to be homogeneous at the > starting time. Furthermore, since performance of each core > is almost same, a mixed use of nodes with different number > of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8. > > --npernode type is not applicable to such a mixed use. > That's why I'd like to continue to use modified hostfile. > > By the way, the problem I reported to Jeff yesterday > was that openmpi-1.7 with torque is something wrong, > because it caused error against such a simple case as > shown below, which surprised me. Now, the problem is not > limited to modified hostfile, I guess. > > #PBS -l nodes=4:ppn=8 > mpirun -np 8 ./my_program > (OMP_NUM_THREADS=4) > > Regards, > Tetsuya Mishima > >> Hi Tetsuya >> >> Your script that edits $PBS_NODEFILE into a separate hostfile >> is very similar to some that I used here for >> hybrid OpenMP+MPI programs on older versions of OMPI. >> I haven't tried this in 1.6.X, >> but it looks like you did and it works also. >> I haven't tried 1.7 either. >> Since we run production machines, >> I try to stick to the stable versions of OMPI (even numbered: >> 1.6.X, 1.4.X, 1.2.X). >> >> I believe you can get the same effect even if you >> don't edit your $PBS_NODEFILE and let OMPI use it as is. >> Say, if you choose carefully the values in your >> #PBS -l nodes=?:ppn=? >> of your >> $OMP_NUM_THREADS >> and use an mpiexec with --npernode or --cpus-per-proc. >> >> For instance, for twelve MPI processes, with two threads each, >> on nodes with eight cores each, I would try >> (but I haven't tried!): >> >> #PBS -l nodes=3:ppn=8 >> >> export $OMP_NUM_THREADS=2 >> >> mpiexec -np 12 -npernode 4 >> >> or perhaps more tightly: >> >> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2 >> >> I hope this helps, >> Gus Correa >> >> >> >> On 03/19/2013 03:12 PM, tmish...@jcity.maeda.co.jp wrote: >>> >>> >>> Hi Reuti and Gus, >>> >>> Thank you for your comments. >>> >>> Our cluster is a little bit heterogeneous, which has nodes with 4, 8, > 32 >>> cores. >>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for "-l >>> nodes=2:ppn=4". >>> (strictly speaking, Torque picked up proper nodes.) >>> >>> As I mentioned before, I usually use openmpi-1.6.x, which has no troble >>> against that kind >>> of use. I encountered the issue when I was evaluating openmpi-1.7 to > check >>> when we could >>> move on to it, although we have no positive reason to do that at this >>> moment. >>> >>> As Gus pointed out, I use a script file as shown below for a practical > use >>> of openmpi-1.6.x. >>> >>> #PBS -l nodes=2:ppn=32 # even "-l nodes=1:ppn=32+4:ppn=8" works fine >>> export OMP_NUM_THREADS=4 >>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines > here >>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -report-bindings \ >>> -x OMP_NUM_THREADS ./my_program # 32-core node has 8 numanodes, 8-core >>> node has 2 numanodes >>> >>> It works well under the combination of openmpi-1.6.x and Torque. The >>> problem is just >>> openmpi-1.7's behavior. >>> >>> Regards, >>> Tetsuya Mishima >>> >>>> Hi Tetsuya Mishima >>>> >>>> Mpiexec offers you a number of possibilities that you could try: >>>> --bynode, >>>> --pernode, >>>> --npernode, >>>> --bysocket, >>>> --bycore, >>>> --cpus-per-proc, >>>> --cpus-per-rank, >>>> --rankfile >>>> and more. >>>> >>>> Most likely one or more of them will fit your needs. >>>> >>>> There are also associated flags to bind processes to cores, >>>> to sockets, etc, to report the bindings, and so on. >>>> >>>> Check the mpiexec man page for details. >>>> >>>> Nevertheless, I am surprised that modifying the >>>> $PBS_NODEFILE doesn't work for you in OMPI 1.7. >>>> I have done this many times in older versions of OMPI. >>>> >>>> Would it work for you to go back to the stable OMPI 1.6.X, >>>> or does it lack any special feature that you need? >>>> >>>> I hope this helps, >>>> Gus Correa >>>> >>>> On 03/19/2013 03:00 AM, tmish...@jcity.maeda.co.jp wrote: >>>>> >>>>> >>>>> Hi Jeff, >>>>> >>>>> I didn't have much time to test this morning. So, I checked it again >>>>> now. Then, the trouble seems to depend on the number of nodes to use. >>>>> >>>>> This works(nodes< 4): >>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 >>>>> (OMP_NUM_THREADS=4) >>>>> >>>>> This causes error(nodes>= 4): >>>>> mpiexec -bynode -np 8 ./my_program&& #PBS -l nodes=4:ppn=8 >>>>> (OMP_NUM_THREADS=4) >>>>> >>>>> Regards, >>>>> Tetsuya Mishima >>>>> >>>>>> Oy; that's weird. >>>>>> >>>>>> I'm afraid we're going to have to wait for Ralph to answer why that > is >>>>> happening -- sorry! >>>>>> >>>>>> >>>>>> On Mar 18, 2013, at 4:45 PM,<tmish...@jcity.maeda.co.jp> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Correa and Jeff, >>>>>>> >>>>>>> Thank you for your comments. I quickly checked your suggestion. >>>>>>> >>>>>>> As a result, my simple example case worked well. >>>>>>> export OMP_NUM_THREADS=4 >>>>>>> mpiexec -bynode -np 2 ./my_program&& #PBS -l nodes=2:ppn=4 >>>>>>> >>>>>>> But, practical case that more than 1 process was allocated to a > node >>>>> like >>>>>>> below did not work. >>>>>>> export OMP_NUM_THREADS=4 >>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 >>>>>>> >>>>>>> The error message is as follows: >>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is >>>>>>> attempting to be sent to a process whose contact infor >>>>>>> mation is unknown in file rml_oob_send.c at line 316 >>>>>>> [node08.cluster:11946] [[30666,0],3] unable to find address for >>>>>>> [[30666,0],1] >>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is >>>>>>> attempting to be sent to a process whose contact infor >>>>>>> mation is unknown in file base/grpcomm_base_rollup.c at line 123 >>>>>>> >>>>>>> Here is our openmpi configuration: >>>>>>> ./configure \ >>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \ >>>>>>> --with-tm \ >>>>>>> --with-verbs \ >>>>>>> --disable-ipv6 \ >>>>>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \ >>>>>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ >>>>>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \ >>>>>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e" >>>>>>> >>>>>>> Regards, >>>>>>> Tetsuya Mishima >>>>>>> >>>>>>>> On Mar 17, 2013, at 10:55 PM, Gustavo > Correa<g...@ldeo.columbia.edu> >>>>>>> wrote: >>>>>>>> >>>>>>>>> In your example, have you tried not to modify the node file, >>>>>>>>> launch two mpi processes with mpiexec, and request a "-bynode" >>>>>>> distribution of processes: >>>>>>>>> >>>>>>>>> mpiexec -bynode -np 2 ./my_program >>>>>>>> >>>>>>>> This should work in 1.7, too (I use these kinds of options with >>> SLURM >>>>> all >>>>>>> the time). >>>>>>>> >>>>>>>> However, we should probably verify that the hostfile functionality >>> in >>>>>>> batch jobs hasn't been broken in 1.7, too, because I'm pretty sure >>> that >>>>>>> what you described should work. However, Ralph, our >>>>>>>> run-time guy, is on vacation this week. There might be a delay in >>>>>>> checking into this. >>>>>>>> >>>>>>>> -- >>>>>>>> Jeff Squyres >>>>>>>> jsquy...@cisco.com >>>>>>>> For corporate legal information go to: >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> -- >>>>>> Jeff Squyres >>>>>> jsquy...@cisco.com >>>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users