Hi Ralph,
I have a line below in ~/.openmpi/mca-params.conf to use rsh. orte_rsh_agent = /usr/bin/rsh I changed this line to: plm_rsh_agent = /usr/bin/rsh # for openmpi-1.7 Then, the error message disappeared. Thanks. Retruning to the subject, I can rebuild with --enable-debug. Just wait until it will complete. Regards, Tetsuya Mishima > You obviously have some MCA params set somewhere: > > > -------------------------------------------------------------------------- > > A deprecated MCA parameter value was specified in an MCA parameter > > file. Deprecated MCA parameters should be avoided; they may disappear > > in future releases. > > > > Deprecated parameter: orte_rsh_agent > > -------------------------------------------------------------------------- > > Check your environment for anything with OMPI_MCA_xxx, and your default MCA parameter file to see what has been specified. > > The allocation looks okay - I'll have to look for other debug flags you can set. Meantime, can you please add --enable-debug to your configure cmd line and rebuild? > > Thanks > Ralph > > > On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Hi Ralph, > > > > Here is a result of rerun with --display-allocation. > > I set OMP_NUM_THREADS=1 to make the problem clear. > > > > Regards, > > Tetsuya Mishima > > > > P.S. As far as I checked, these 2 cases are OK(no problem). > > (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation > > ~/Ducom/testbed/mPre m02-ld > > (2)mpirun -v -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre > > m02-ld > > > > Script File: > > > > #!/bin/sh > > #PBS -A tmishima > > #PBS -N Ducom-run > > #PBS -j oe > > #PBS -l nodes=2:ppn=4 > > export OMP_NUM_THREADS=1 > > cd $PBS_O_WORKDIR > > cp $PBS_NODEFILE pbs_hosts > > NPROCS=`wc -l < pbs_hosts` > > mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS > > --display-allocation ~/Ducom/testbed/mPre m02-ld > > > > Output: > > -------------------------------------------------------------------------- > > A deprecated MCA parameter value was specified in an MCA parameter > > file. Deprecated MCA parameters should be avoided; they may disappear > > in future releases. > > > > Deprecated parameter: orte_rsh_agent > > -------------------------------------------------------------------------- > > > > ====================== ALLOCATED NODES ====================== > > > > Data for node: node06 Num slots: 4 Max slots: 0 > > Data for node: node05 Num slots: 4 Max slots: 0 > > > > ================================================================= > > -------------------------------------------------------------------------- > > A hostfile was provided that contains at least one node not > > present in the allocation: > > > > hostfile: pbs_hosts > > node: node06 > > > > If you are operating in a resource-managed environment, then only > > nodes that are in the allocation can be used in the hostfile. You > > may find relative node syntax to be a useful alternative to > > specifying absolute node names see the orte_hosts man page for > > further information. > > -------------------------------------------------------------------------- > > > > > >> I've submitted a patch to fix the Torque launch issue - just some > > leftover garbage that existed at the time of the 1.7.0 branch and didn't > > get removed. > >> > >> For the hostfile issue, I'm stumped as I can't see how the problem would > > come about. Could you please rerun your original test and add > > "--display-allocation" to your cmd line? Let's see if it is > >> correctly finding the original allocation. > >> > >> Thanks > >> Ralph > >> > >> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote: > >> > >>> > >>> > >>> Hi Gus, > >>> > >>> Thank you for your comments. I understand your advice. > >>> Our script used to be --npernode type as well. > >>> > >>> As I told before, our cluster consists of nodes having 4, 8, > >>> and 32 cores, although it used to be homogeneous at the > >>> starting time. Furthermore, since performance of each core > >>> is almost same, a mixed use of nodes with different number > >>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8. > >>> > >>> --npernode type is not applicable to such a mixed use. > >>> That's why I'd like to continue to use modified hostfile. > >>> > >>> By the way, the problem I reported to Jeff yesterday > >>> was that openmpi-1.7 with torque is something wrong, > >>> because it caused error against such a simple case as > >>> shown below, which surprised me. Now, the problem is not > >>> limited to modified hostfile, I guess. > >>> > >>> #PBS -l nodes=4:ppn=8 > >>> mpirun -np 8 ./my_program > >>> (OMP_NUM_THREADS=4) > >>> > >>> Regards, > >>> Tetsuya Mishima > >>> > >>>> Hi Tetsuya > >>>> > >>>> Your script that edits $PBS_NODEFILE into a separate hostfile > >>>> is very similar to some that I used here for > >>>> hybrid OpenMP+MPI programs on older versions of OMPI. > >>>> I haven't tried this in 1.6.X, > >>>> but it looks like you did and it works also. > >>>> I haven't tried 1.7 either. > >>>> Since we run production machines, > >>>> I try to stick to the stable versions of OMPI (even numbered: > >>>> 1.6.X, 1.4.X, 1.2.X). > >>>> > >>>> I believe you can get the same effect even if you > >>>> don't edit your $PBS_NODEFILE and let OMPI use it as is. > >>>> Say, if you choose carefully the values in your > >>>> #PBS -l nodes=?:ppn=? > >>>> of your > >>>> $OMP_NUM_THREADS > >>>> and use an mpiexec with --npernode or --cpus-per-proc. > >>>> > >>>> For instance, for twelve MPI processes, with two threads each, > >>>> on nodes with eight cores each, I would try > >>>> (but I haven't tried!): > >>>> > >>>> #PBS -l nodes=3:ppn=8 > >>>> > >>>> export $OMP_NUM_THREADS=2 > >>>> > >>>> mpiexec -np 12 -npernode 4 > >>>> > >>>> or perhaps more tightly: > >>>> > >>>> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2 > >>>> > >>>> I hope this helps, > >>>> Gus Correa > >>>> > >>>> > >>>> > >>>> On 03/19/2013 03:12 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>> > >>>>> > >>>>> Hi Reuti and Gus, > >>>>> > >>>>> Thank you for your comments. > >>>>> > >>>>> Our cluster is a little bit heterogeneous, which has nodes with 4, 8, > >>> 32 > >>>>> cores. > >>>>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for "-l > >>>>> nodes=2:ppn=4". > >>>>> (strictly speaking, Torque picked up proper nodes.) > >>>>> > >>>>> As I mentioned before, I usually use openmpi-1.6.x, which has no > > troble > >>>>> against that kind > >>>>> of use. I encountered the issue when I was evaluating openmpi-1.7 to > >>> check > >>>>> when we could > >>>>> move on to it, although we have no positive reason to do that at this > >>>>> moment. > >>>>> > >>>>> As Gus pointed out, I use a script file as shown below for a > > practical > >>> use > >>>>> of openmpi-1.6.x. > >>>>> > >>>>> #PBS -l nodes=2:ppn=32 # even "-l nodes=1:ppn=32+4:ppn=8" works fine > >>>>> export OMP_NUM_THREADS=4 > >>>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines > >>> here > >>>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -report-bindings \ > >>>>> -x OMP_NUM_THREADS ./my_program # 32-core node has 8 numanodes, > > 8-core > >>>>> node has 2 numanodes > >>>>> > >>>>> It works well under the combination of openmpi-1.6.x and Torque. The > >>>>> problem is just > >>>>> openmpi-1.7's behavior. > >>>>> > >>>>> Regards, > >>>>> Tetsuya Mishima > >>>>> > >>>>>> Hi Tetsuya Mishima > >>>>>> > >>>>>> Mpiexec offers you a number of possibilities that you could try: > >>>>>> --bynode, > >>>>>> --pernode, > >>>>>> --npernode, > >>>>>> --bysocket, > >>>>>> --bycore, > >>>>>> --cpus-per-proc, > >>>>>> --cpus-per-rank, > >>>>>> --rankfile > >>>>>> and more. > >>>>>> > >>>>>> Most likely one or more of them will fit your needs. > >>>>>> > >>>>>> There are also associated flags to bind processes to cores, > >>>>>> to sockets, etc, to report the bindings, and so on. > >>>>>> > >>>>>> Check the mpiexec man page for details. > >>>>>> > >>>>>> Nevertheless, I am surprised that modifying the > >>>>>> $PBS_NODEFILE doesn't work for you in OMPI 1.7. > >>>>>> I have done this many times in older versions of OMPI. > >>>>>> > >>>>>> Would it work for you to go back to the stable OMPI 1.6.X, > >>>>>> or does it lack any special feature that you need? > >>>>>> > >>>>>> I hope this helps, > >>>>>> Gus Correa > >>>>>> > >>>>>> On 03/19/2013 03:00 AM, tmish...@jcity.maeda.co.jp wrote: > >>>>>>> > >>>>>>> > >>>>>>> Hi Jeff, > >>>>>>> > >>>>>>> I didn't have much time to test this morning. So, I checked it > > again > >>>>>>> now. Then, the trouble seems to depend on the number of nodes to > > use. > >>>>>>> > >>>>>>> This works(nodes< 4): > >>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 > >>>>>>> (OMP_NUM_THREADS=4) > >>>>>>> > >>>>>>> This causes error(nodes>= 4): > >>>>>>> mpiexec -bynode -np 8 ./my_program&& #PBS -l nodes=4:ppn=8 > >>>>>>> (OMP_NUM_THREADS=4) > >>>>>>> > >>>>>>> Regards, > >>>>>>> Tetsuya Mishima > >>>>>>> > >>>>>>>> Oy; that's weird. > >>>>>>>> > >>>>>>>> I'm afraid we're going to have to wait for Ralph to answer why > > that > >>> is > >>>>>>> happening -- sorry! > >>>>>>>> > >>>>>>>> > >>>>>>>> On Mar 18, 2013, at 4:45 PM,<tmish...@jcity.maeda.co.jp> wrote: > >>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Hi Correa and Jeff, > >>>>>>>>> > >>>>>>>>> Thank you for your comments. I quickly checked your suggestion. > >>>>>>>>> > >>>>>>>>> As a result, my simple example case worked well. > >>>>>>>>> export OMP_NUM_THREADS=4 > >>>>>>>>> mpiexec -bynode -np 2 ./my_program&& #PBS -l nodes=2:ppn=4 > >>>>>>>>> > >>>>>>>>> But, practical case that more than 1 process was allocated to a > >>> node > >>>>>>> like > >>>>>>>>> below did not work. > >>>>>>>>> export OMP_NUM_THREADS=4 > >>>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 > >>>>>>>>> > >>>>>>>>> The error message is as follows: > >>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is > >>>>>>>>> attempting to be sent to a process whose contact infor > >>>>>>>>> mation is unknown in file rml_oob_send.c at line 316 > >>>>>>>>> [node08.cluster:11946] [[30666,0],3] unable to find address for > >>>>>>>>> [[30666,0],1] > >>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is > >>>>>>>>> attempting to be sent to a process whose contact infor > >>>>>>>>> mation is unknown in file base/grpcomm_base_rollup.c at line 123 > >>>>>>>>> > >>>>>>>>> Here is our openmpi configuration: > >>>>>>>>> ./configure \ > >>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \ > >>>>>>>>> --with-tm \ > >>>>>>>>> --with-verbs \ > >>>>>>>>> --disable-ipv6 \ > >>>>>>>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \ > >>>>>>>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ > >>>>>>>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \ > >>>>>>>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e" > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Tetsuya Mishima > >>>>>>>>> > >>>>>>>>>> On Mar 17, 2013, at 10:55 PM, Gustavo > >>> Correa<g...@ldeo.columbia.edu> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> In your example, have you tried not to modify the node file, > >>>>>>>>>>> launch two mpi processes with mpiexec, and request a "-bynode" > >>>>>>>>> distribution of processes: > >>>>>>>>>>> > >>>>>>>>>>> mpiexec -bynode -np 2 ./my_program > >>>>>>>>>> > >>>>>>>>>> This should work in 1.7, too (I use these kinds of options with > >>>>> SLURM > >>>>>>> all > >>>>>>>>> the time). > >>>>>>>>>> > >>>>>>>>>> However, we should probably verify that the hostfile > > functionality > >>>>> in > >>>>>>>>> batch jobs hasn't been broken in 1.7, too, because I'm pretty > > sure > >>>>> that > >>>>>>>>> what you described should work. However, Ralph, our > >>>>>>>>>> run-time guy, is on vacation this week. There might be a delay > > in > >>>>>>>>> checking into this. > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Jeff Squyres > >>>>>>>>>> jsquy...@cisco.com > >>>>>>>>>> For corporate legal information go to: > >>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> users mailing list > >>>>>>>>>> us...@open-mpi.org > >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Jeff Squyres > >>>>>>>> jsquy...@cisco.com > >>>>>>>> For corporate legal information go to: > >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> us...@open-mpi.org > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >