You obviously have some MCA params set somewhere: > -------------------------------------------------------------------------- > A deprecated MCA parameter value was specified in an MCA parameter > file. Deprecated MCA parameters should be avoided; they may disappear > in future releases. > > Deprecated parameter: orte_rsh_agent > --------------------------------------------------------------------------
Check your environment for anything with OMPI_MCA_xxx, and your default MCA parameter file to see what has been specified. The allocation looks okay - I'll have to look for other debug flags you can set. Meantime, can you please add --enable-debug to your configure cmd line and rebuild? Thanks Ralph On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > Here is a result of rerun with --display-allocation. > I set OMP_NUM_THREADS=1 to make the problem clear. > > Regards, > Tetsuya Mishima > > P.S. As far as I checked, these 2 cases are OK(no problem). > (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation > ~/Ducom/testbed/mPre m02-ld > (2)mpirun -v -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre > m02-ld > > Script File: > > #!/bin/sh > #PBS -A tmishima > #PBS -N Ducom-run > #PBS -j oe > #PBS -l nodes=2:ppn=4 > export OMP_NUM_THREADS=1 > cd $PBS_O_WORKDIR > cp $PBS_NODEFILE pbs_hosts > NPROCS=`wc -l < pbs_hosts` > mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS > --display-allocation ~/Ducom/testbed/mPre m02-ld > > Output: > -------------------------------------------------------------------------- > A deprecated MCA parameter value was specified in an MCA parameter > file. Deprecated MCA parameters should be avoided; they may disappear > in future releases. > > Deprecated parameter: orte_rsh_agent > -------------------------------------------------------------------------- > > ====================== ALLOCATED NODES ====================== > > Data for node: node06 Num slots: 4 Max slots: 0 > Data for node: node05 Num slots: 4 Max slots: 0 > > ================================================================= > -------------------------------------------------------------------------- > A hostfile was provided that contains at least one node not > present in the allocation: > > hostfile: pbs_hosts > node: node06 > > If you are operating in a resource-managed environment, then only > nodes that are in the allocation can be used in the hostfile. You > may find relative node syntax to be a useful alternative to > specifying absolute node names see the orte_hosts man page for > further information. > -------------------------------------------------------------------------- > > >> I've submitted a patch to fix the Torque launch issue - just some > leftover garbage that existed at the time of the 1.7.0 branch and didn't > get removed. >> >> For the hostfile issue, I'm stumped as I can't see how the problem would > come about. Could you please rerun your original test and add > "--display-allocation" to your cmd line? Let's see if it is >> correctly finding the original allocation. >> >> Thanks >> Ralph >> >> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Hi Gus, >>> >>> Thank you for your comments. I understand your advice. >>> Our script used to be --npernode type as well. >>> >>> As I told before, our cluster consists of nodes having 4, 8, >>> and 32 cores, although it used to be homogeneous at the >>> starting time. Furthermore, since performance of each core >>> is almost same, a mixed use of nodes with different number >>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8. >>> >>> --npernode type is not applicable to such a mixed use. >>> That's why I'd like to continue to use modified hostfile. >>> >>> By the way, the problem I reported to Jeff yesterday >>> was that openmpi-1.7 with torque is something wrong, >>> because it caused error against such a simple case as >>> shown below, which surprised me. Now, the problem is not >>> limited to modified hostfile, I guess. >>> >>> #PBS -l nodes=4:ppn=8 >>> mpirun -np 8 ./my_program >>> (OMP_NUM_THREADS=4) >>> >>> Regards, >>> Tetsuya Mishima >>> >>>> Hi Tetsuya >>>> >>>> Your script that edits $PBS_NODEFILE into a separate hostfile >>>> is very similar to some that I used here for >>>> hybrid OpenMP+MPI programs on older versions of OMPI. >>>> I haven't tried this in 1.6.X, >>>> but it looks like you did and it works also. >>>> I haven't tried 1.7 either. >>>> Since we run production machines, >>>> I try to stick to the stable versions of OMPI (even numbered: >>>> 1.6.X, 1.4.X, 1.2.X). >>>> >>>> I believe you can get the same effect even if you >>>> don't edit your $PBS_NODEFILE and let OMPI use it as is. >>>> Say, if you choose carefully the values in your >>>> #PBS -l nodes=?:ppn=? >>>> of your >>>> $OMP_NUM_THREADS >>>> and use an mpiexec with --npernode or --cpus-per-proc. >>>> >>>> For instance, for twelve MPI processes, with two threads each, >>>> on nodes with eight cores each, I would try >>>> (but I haven't tried!): >>>> >>>> #PBS -l nodes=3:ppn=8 >>>> >>>> export $OMP_NUM_THREADS=2 >>>> >>>> mpiexec -np 12 -npernode 4 >>>> >>>> or perhaps more tightly: >>>> >>>> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2 >>>> >>>> I hope this helps, >>>> Gus Correa >>>> >>>> >>>> >>>> On 03/19/2013 03:12 PM, tmish...@jcity.maeda.co.jp wrote: >>>>> >>>>> >>>>> Hi Reuti and Gus, >>>>> >>>>> Thank you for your comments. >>>>> >>>>> Our cluster is a little bit heterogeneous, which has nodes with 4, 8, >>> 32 >>>>> cores. >>>>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for "-l >>>>> nodes=2:ppn=4". >>>>> (strictly speaking, Torque picked up proper nodes.) >>>>> >>>>> As I mentioned before, I usually use openmpi-1.6.x, which has no > troble >>>>> against that kind >>>>> of use. I encountered the issue when I was evaluating openmpi-1.7 to >>> check >>>>> when we could >>>>> move on to it, although we have no positive reason to do that at this >>>>> moment. >>>>> >>>>> As Gus pointed out, I use a script file as shown below for a > practical >>> use >>>>> of openmpi-1.6.x. >>>>> >>>>> #PBS -l nodes=2:ppn=32 # even "-l nodes=1:ppn=32+4:ppn=8" works fine >>>>> export OMP_NUM_THREADS=4 >>>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines >>> here >>>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -report-bindings \ >>>>> -x OMP_NUM_THREADS ./my_program # 32-core node has 8 numanodes, > 8-core >>>>> node has 2 numanodes >>>>> >>>>> It works well under the combination of openmpi-1.6.x and Torque. The >>>>> problem is just >>>>> openmpi-1.7's behavior. >>>>> >>>>> Regards, >>>>> Tetsuya Mishima >>>>> >>>>>> Hi Tetsuya Mishima >>>>>> >>>>>> Mpiexec offers you a number of possibilities that you could try: >>>>>> --bynode, >>>>>> --pernode, >>>>>> --npernode, >>>>>> --bysocket, >>>>>> --bycore, >>>>>> --cpus-per-proc, >>>>>> --cpus-per-rank, >>>>>> --rankfile >>>>>> and more. >>>>>> >>>>>> Most likely one or more of them will fit your needs. >>>>>> >>>>>> There are also associated flags to bind processes to cores, >>>>>> to sockets, etc, to report the bindings, and so on. >>>>>> >>>>>> Check the mpiexec man page for details. >>>>>> >>>>>> Nevertheless, I am surprised that modifying the >>>>>> $PBS_NODEFILE doesn't work for you in OMPI 1.7. >>>>>> I have done this many times in older versions of OMPI. >>>>>> >>>>>> Would it work for you to go back to the stable OMPI 1.6.X, >>>>>> or does it lack any special feature that you need? >>>>>> >>>>>> I hope this helps, >>>>>> Gus Correa >>>>>> >>>>>> On 03/19/2013 03:00 AM, tmish...@jcity.maeda.co.jp wrote: >>>>>>> >>>>>>> >>>>>>> Hi Jeff, >>>>>>> >>>>>>> I didn't have much time to test this morning. So, I checked it > again >>>>>>> now. Then, the trouble seems to depend on the number of nodes to > use. >>>>>>> >>>>>>> This works(nodes< 4): >>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 >>>>>>> (OMP_NUM_THREADS=4) >>>>>>> >>>>>>> This causes error(nodes>= 4): >>>>>>> mpiexec -bynode -np 8 ./my_program&& #PBS -l nodes=4:ppn=8 >>>>>>> (OMP_NUM_THREADS=4) >>>>>>> >>>>>>> Regards, >>>>>>> Tetsuya Mishima >>>>>>> >>>>>>>> Oy; that's weird. >>>>>>>> >>>>>>>> I'm afraid we're going to have to wait for Ralph to answer why > that >>> is >>>>>>> happening -- sorry! >>>>>>>> >>>>>>>> >>>>>>>> On Mar 18, 2013, at 4:45 PM,<tmish...@jcity.maeda.co.jp> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Correa and Jeff, >>>>>>>>> >>>>>>>>> Thank you for your comments. I quickly checked your suggestion. >>>>>>>>> >>>>>>>>> As a result, my simple example case worked well. >>>>>>>>> export OMP_NUM_THREADS=4 >>>>>>>>> mpiexec -bynode -np 2 ./my_program&& #PBS -l nodes=2:ppn=4 >>>>>>>>> >>>>>>>>> But, practical case that more than 1 process was allocated to a >>> node >>>>>>> like >>>>>>>>> below did not work. >>>>>>>>> export OMP_NUM_THREADS=4 >>>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 >>>>>>>>> >>>>>>>>> The error message is as follows: >>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is >>>>>>>>> attempting to be sent to a process whose contact infor >>>>>>>>> mation is unknown in file rml_oob_send.c at line 316 >>>>>>>>> [node08.cluster:11946] [[30666,0],3] unable to find address for >>>>>>>>> [[30666,0],1] >>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is >>>>>>>>> attempting to be sent to a process whose contact infor >>>>>>>>> mation is unknown in file base/grpcomm_base_rollup.c at line 123 >>>>>>>>> >>>>>>>>> Here is our openmpi configuration: >>>>>>>>> ./configure \ >>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \ >>>>>>>>> --with-tm \ >>>>>>>>> --with-verbs \ >>>>>>>>> --disable-ipv6 \ >>>>>>>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \ >>>>>>>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ >>>>>>>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \ >>>>>>>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e" >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Tetsuya Mishima >>>>>>>>> >>>>>>>>>> On Mar 17, 2013, at 10:55 PM, Gustavo >>> Correa<g...@ldeo.columbia.edu> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> In your example, have you tried not to modify the node file, >>>>>>>>>>> launch two mpi processes with mpiexec, and request a "-bynode" >>>>>>>>> distribution of processes: >>>>>>>>>>> >>>>>>>>>>> mpiexec -bynode -np 2 ./my_program >>>>>>>>>> >>>>>>>>>> This should work in 1.7, too (I use these kinds of options with >>>>> SLURM >>>>>>> all >>>>>>>>> the time). >>>>>>>>>> >>>>>>>>>> However, we should probably verify that the hostfile > functionality >>>>> in >>>>>>>>> batch jobs hasn't been broken in 1.7, too, because I'm pretty > sure >>>>> that >>>>>>>>> what you described should work. However, Ralph, our >>>>>>>>>> run-time guy, is on vacation this week. There might be a delay > in >>>>>>>>> checking into this. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Jeff Squyres >>>>>>>>>> jsquy...@cisco.com >>>>>>>>>> For corporate legal information go to: >>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Jeff Squyres >>>>>>>> jsquy...@cisco.com >>>>>>>> For corporate legal information go to: >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users