Could you please apply the attached patch and try it again? If you haven't had time to configure with --enable-debug, that is fine - this will output regardless.
Thanks Ralph
user.diff
Description: Binary data
On Mar 20, 2013, at 4:59 PM, Ralph Castain <r...@open-mpi.org> wrote: > You obviously have some MCA params set somewhere: > >> -------------------------------------------------------------------------- >> A deprecated MCA parameter value was specified in an MCA parameter >> file. Deprecated MCA parameters should be avoided; they may disappear >> in future releases. >> >> Deprecated parameter: orte_rsh_agent >> -------------------------------------------------------------------------- > > Check your environment for anything with OMPI_MCA_xxx, and your default MCA > parameter file to see what has been specified. > > The allocation looks okay - I'll have to look for other debug flags you can > set. Meantime, can you please add --enable-debug to your configure cmd line > and rebuild? > > Thanks > Ralph > > > On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote: > >> >> >> Hi Ralph, >> >> Here is a result of rerun with --display-allocation. >> I set OMP_NUM_THREADS=1 to make the problem clear. >> >> Regards, >> Tetsuya Mishima >> >> P.S. As far as I checked, these 2 cases are OK(no problem). >> (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation >> ~/Ducom/testbed/mPre m02-ld >> (2)mpirun -v -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre >> m02-ld >> >> Script File: >> >> #!/bin/sh >> #PBS -A tmishima >> #PBS -N Ducom-run >> #PBS -j oe >> #PBS -l nodes=2:ppn=4 >> export OMP_NUM_THREADS=1 >> cd $PBS_O_WORKDIR >> cp $PBS_NODEFILE pbs_hosts >> NPROCS=`wc -l < pbs_hosts` >> mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS >> --display-allocation ~/Ducom/testbed/mPre m02-ld >> >> Output: >> -------------------------------------------------------------------------- >> A deprecated MCA parameter value was specified in an MCA parameter >> file. Deprecated MCA parameters should be avoided; they may disappear >> in future releases. >> >> Deprecated parameter: orte_rsh_agent >> -------------------------------------------------------------------------- >> >> ====================== ALLOCATED NODES ====================== >> >> Data for node: node06 Num slots: 4 Max slots: 0 >> Data for node: node05 Num slots: 4 Max slots: 0 >> >> ================================================================= >> -------------------------------------------------------------------------- >> A hostfile was provided that contains at least one node not >> present in the allocation: >> >> hostfile: pbs_hosts >> node: node06 >> >> If you are operating in a resource-managed environment, then only >> nodes that are in the allocation can be used in the hostfile. You >> may find relative node syntax to be a useful alternative to >> specifying absolute node names see the orte_hosts man page for >> further information. >> -------------------------------------------------------------------------- >> >> >>> I've submitted a patch to fix the Torque launch issue - just some >> leftover garbage that existed at the time of the 1.7.0 branch and didn't >> get removed. >>> >>> For the hostfile issue, I'm stumped as I can't see how the problem would >> come about. Could you please rerun your original test and add >> "--display-allocation" to your cmd line? Let's see if it is >>> correctly finding the original allocation. >>> >>> Thanks >>> Ralph >>> >>> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote: >>> >>>> >>>> >>>> Hi Gus, >>>> >>>> Thank you for your comments. I understand your advice. >>>> Our script used to be --npernode type as well. >>>> >>>> As I told before, our cluster consists of nodes having 4, 8, >>>> and 32 cores, although it used to be homogeneous at the >>>> starting time. Furthermore, since performance of each core >>>> is almost same, a mixed use of nodes with different number >>>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8. >>>> >>>> --npernode type is not applicable to such a mixed use. >>>> That's why I'd like to continue to use modified hostfile. >>>> >>>> By the way, the problem I reported to Jeff yesterday >>>> was that openmpi-1.7 with torque is something wrong, >>>> because it caused error against such a simple case as >>>> shown below, which surprised me. Now, the problem is not >>>> limited to modified hostfile, I guess. >>>> >>>> #PBS -l nodes=4:ppn=8 >>>> mpirun -np 8 ./my_program >>>> (OMP_NUM_THREADS=4) >>>> >>>> Regards, >>>> Tetsuya Mishima >>>> >>>>> Hi Tetsuya >>>>> >>>>> Your script that edits $PBS_NODEFILE into a separate hostfile >>>>> is very similar to some that I used here for >>>>> hybrid OpenMP+MPI programs on older versions of OMPI. >>>>> I haven't tried this in 1.6.X, >>>>> but it looks like you did and it works also. >>>>> I haven't tried 1.7 either. >>>>> Since we run production machines, >>>>> I try to stick to the stable versions of OMPI (even numbered: >>>>> 1.6.X, 1.4.X, 1.2.X). >>>>> >>>>> I believe you can get the same effect even if you >>>>> don't edit your $PBS_NODEFILE and let OMPI use it as is. >>>>> Say, if you choose carefully the values in your >>>>> #PBS -l nodes=?:ppn=? >>>>> of your >>>>> $OMP_NUM_THREADS >>>>> and use an mpiexec with --npernode or --cpus-per-proc. >>>>> >>>>> For instance, for twelve MPI processes, with two threads each, >>>>> on nodes with eight cores each, I would try >>>>> (but I haven't tried!): >>>>> >>>>> #PBS -l nodes=3:ppn=8 >>>>> >>>>> export $OMP_NUM_THREADS=2 >>>>> >>>>> mpiexec -np 12 -npernode 4 >>>>> >>>>> or perhaps more tightly: >>>>> >>>>> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2 >>>>> >>>>> I hope this helps, >>>>> Gus Correa >>>>> >>>>> >>>>> >>>>> On 03/19/2013 03:12 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>> >>>>>> >>>>>> Hi Reuti and Gus, >>>>>> >>>>>> Thank you for your comments. >>>>>> >>>>>> Our cluster is a little bit heterogeneous, which has nodes with 4, 8, >>>> 32 >>>>>> cores. >>>>>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for "-l >>>>>> nodes=2:ppn=4". >>>>>> (strictly speaking, Torque picked up proper nodes.) >>>>>> >>>>>> As I mentioned before, I usually use openmpi-1.6.x, which has no >> troble >>>>>> against that kind >>>>>> of use. I encountered the issue when I was evaluating openmpi-1.7 to >>>> check >>>>>> when we could >>>>>> move on to it, although we have no positive reason to do that at this >>>>>> moment. >>>>>> >>>>>> As Gus pointed out, I use a script file as shown below for a >> practical >>>> use >>>>>> of openmpi-1.6.x. >>>>>> >>>>>> #PBS -l nodes=2:ppn=32 # even "-l nodes=1:ppn=32+4:ppn=8" works fine >>>>>> export OMP_NUM_THREADS=4 >>>>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines >>>> here >>>>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -report-bindings \ >>>>>> -x OMP_NUM_THREADS ./my_program # 32-core node has 8 numanodes, >> 8-core >>>>>> node has 2 numanodes >>>>>> >>>>>> It works well under the combination of openmpi-1.6.x and Torque. The >>>>>> problem is just >>>>>> openmpi-1.7's behavior. >>>>>> >>>>>> Regards, >>>>>> Tetsuya Mishima >>>>>> >>>>>>> Hi Tetsuya Mishima >>>>>>> >>>>>>> Mpiexec offers you a number of possibilities that you could try: >>>>>>> --bynode, >>>>>>> --pernode, >>>>>>> --npernode, >>>>>>> --bysocket, >>>>>>> --bycore, >>>>>>> --cpus-per-proc, >>>>>>> --cpus-per-rank, >>>>>>> --rankfile >>>>>>> and more. >>>>>>> >>>>>>> Most likely one or more of them will fit your needs. >>>>>>> >>>>>>> There are also associated flags to bind processes to cores, >>>>>>> to sockets, etc, to report the bindings, and so on. >>>>>>> >>>>>>> Check the mpiexec man page for details. >>>>>>> >>>>>>> Nevertheless, I am surprised that modifying the >>>>>>> $PBS_NODEFILE doesn't work for you in OMPI 1.7. >>>>>>> I have done this many times in older versions of OMPI. >>>>>>> >>>>>>> Would it work for you to go back to the stable OMPI 1.6.X, >>>>>>> or does it lack any special feature that you need? >>>>>>> >>>>>>> I hope this helps, >>>>>>> Gus Correa >>>>>>> >>>>>>> On 03/19/2013 03:00 AM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>> >>>>>>>> >>>>>>>> Hi Jeff, >>>>>>>> >>>>>>>> I didn't have much time to test this morning. So, I checked it >> again >>>>>>>> now. Then, the trouble seems to depend on the number of nodes to >> use. >>>>>>>> >>>>>>>> This works(nodes< 4): >>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 >>>>>>>> (OMP_NUM_THREADS=4) >>>>>>>> >>>>>>>> This causes error(nodes>= 4): >>>>>>>> mpiexec -bynode -np 8 ./my_program&& #PBS -l nodes=4:ppn=8 >>>>>>>> (OMP_NUM_THREADS=4) >>>>>>>> >>>>>>>> Regards, >>>>>>>> Tetsuya Mishima >>>>>>>> >>>>>>>>> Oy; that's weird. >>>>>>>>> >>>>>>>>> I'm afraid we're going to have to wait for Ralph to answer why >> that >>>> is >>>>>>>> happening -- sorry! >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mar 18, 2013, at 4:45 PM,<tmish...@jcity.maeda.co.jp> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Correa and Jeff, >>>>>>>>>> >>>>>>>>>> Thank you for your comments. I quickly checked your suggestion. >>>>>>>>>> >>>>>>>>>> As a result, my simple example case worked well. >>>>>>>>>> export OMP_NUM_THREADS=4 >>>>>>>>>> mpiexec -bynode -np 2 ./my_program&& #PBS -l nodes=2:ppn=4 >>>>>>>>>> >>>>>>>>>> But, practical case that more than 1 process was allocated to a >>>> node >>>>>>>> like >>>>>>>>>> below did not work. >>>>>>>>>> export OMP_NUM_THREADS=4 >>>>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 >>>>>>>>>> >>>>>>>>>> The error message is as follows: >>>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is >>>>>>>>>> attempting to be sent to a process whose contact infor >>>>>>>>>> mation is unknown in file rml_oob_send.c at line 316 >>>>>>>>>> [node08.cluster:11946] [[30666,0],3] unable to find address for >>>>>>>>>> [[30666,0],1] >>>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is >>>>>>>>>> attempting to be sent to a process whose contact infor >>>>>>>>>> mation is unknown in file base/grpcomm_base_rollup.c at line 123 >>>>>>>>>> >>>>>>>>>> Here is our openmpi configuration: >>>>>>>>>> ./configure \ >>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \ >>>>>>>>>> --with-tm \ >>>>>>>>>> --with-verbs \ >>>>>>>>>> --disable-ipv6 \ >>>>>>>>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \ >>>>>>>>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ >>>>>>>>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \ >>>>>>>>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e" >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Tetsuya Mishima >>>>>>>>>> >>>>>>>>>>> On Mar 17, 2013, at 10:55 PM, Gustavo >>>> Correa<g...@ldeo.columbia.edu> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> In your example, have you tried not to modify the node file, >>>>>>>>>>>> launch two mpi processes with mpiexec, and request a "-bynode" >>>>>>>>>> distribution of processes: >>>>>>>>>>>> >>>>>>>>>>>> mpiexec -bynode -np 2 ./my_program >>>>>>>>>>> >>>>>>>>>>> This should work in 1.7, too (I use these kinds of options with >>>>>> SLURM >>>>>>>> all >>>>>>>>>> the time). >>>>>>>>>>> >>>>>>>>>>> However, we should probably verify that the hostfile >> functionality >>>>>> in >>>>>>>>>> batch jobs hasn't been broken in 1.7, too, because I'm pretty >> sure >>>>>> that >>>>>>>>>> what you described should work. However, Ralph, our >>>>>>>>>>> run-time guy, is on vacation this week. There might be a delay >> in >>>>>>>>>> checking into this. >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Jeff Squyres >>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>> For corporate legal information go to: >>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Jeff Squyres >>>>>>>>> jsquy...@cisco.com >>>>>>>>> For corporate legal information go to: >>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >