Hi Ralph,
I have completed rebuild of openmpi1.7rc8. To save time, I added --disable-vt. ( Is it OK? ) Well, what shall I do ? ./configure \ --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \ --with-tm \ --with-verbs \ --disable-ipv6 \ --disable-vt \ --enable-debug \ CC=pgcc CFLAGS="-fast -tp k8-64e" \ CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ F77=pgfortran FFLAGS="-fast -tp k8-64e" \ FC=pgfortran FCFLAGS="-fast -tp k8-64e" Note: I tried patch user.diff after rebuiding openmpi1.7rc8. But, I got an error and could not go foward. $ patch -p0 < user.diff # this is OK $ make # I got an error CC util/hostfile/hostfile.lo PGC-S-0037-Syntax error: Recovery attempted by deleting <string> (util/hostfile/hostfile.c: 728) PGC/x86-64 Linux 12.9-0: compilation completed with severe errors Regards, Tetsuya Mishima > Could you please apply the attached patch and try it again? If you haven't had time to configure with --enable-debug, that is fine - this will output regardless. > > Thanks > Ralph > > - user.diff > > > On Mar 20, 2013, at 4:59 PM, Ralph Castain <r...@open-mpi.org> wrote: > > > You obviously have some MCA params set somewhere: > > > >> -------------------------------------------------------------------------- > >> A deprecated MCA parameter value was specified in an MCA parameter > >> file. Deprecated MCA parameters should be avoided; they may disappear > >> in future releases. > >> > >> Deprecated parameter: orte_rsh_agent > >> -------------------------------------------------------------------------- > > > > Check your environment for anything with OMPI_MCA_xxx, and your default MCA parameter file to see what has been specified. > > > > The allocation looks okay - I'll have to look for other debug flags you can set. Meantime, can you please add --enable-debug to your configure cmd line and rebuild? > > > > Thanks > > Ralph > > > > > > On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote: > > > >> > >> > >> Hi Ralph, > >> > >> Here is a result of rerun with --display-allocation. > >> I set OMP_NUM_THREADS=1 to make the problem clear. > >> > >> Regards, > >> Tetsuya Mishima > >> > >> P.S. As far as I checked, these 2 cases are OK(no problem). > >> (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation > >> ~/Ducom/testbed/mPre m02-ld > >> (2)mpirun -v -x OMP_NUM_THREADS --display-allocation ~/Ducom/testbed/mPre > >> m02-ld > >> > >> Script File: > >> > >> #!/bin/sh > >> #PBS -A tmishima > >> #PBS -N Ducom-run > >> #PBS -j oe > >> #PBS -l nodes=2:ppn=4 > >> export OMP_NUM_THREADS=1 > >> cd $PBS_O_WORKDIR > >> cp $PBS_NODEFILE pbs_hosts > >> NPROCS=`wc -l < pbs_hosts` > >> mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS > >> --display-allocation ~/Ducom/testbed/mPre m02-ld > >> > >> Output: > >> -------------------------------------------------------------------------- > >> A deprecated MCA parameter value was specified in an MCA parameter > >> file. Deprecated MCA parameters should be avoided; they may disappear > >> in future releases. > >> > >> Deprecated parameter: orte_rsh_agent > >> -------------------------------------------------------------------------- > >> > >> ====================== ALLOCATED NODES ====================== > >> > >> Data for node: node06 Num slots: 4 Max slots: 0 > >> Data for node: node05 Num slots: 4 Max slots: 0 > >> > >> ================================================================= > >> -------------------------------------------------------------------------- > >> A hostfile was provided that contains at least one node not > >> present in the allocation: > >> > >> hostfile: pbs_hosts > >> node: node06 > >> > >> If you are operating in a resource-managed environment, then only > >> nodes that are in the allocation can be used in the hostfile. You > >> may find relative node syntax to be a useful alternative to > >> specifying absolute node names see the orte_hosts man page for > >> further information. > >> -------------------------------------------------------------------------- > >> > >> > >>> I've submitted a patch to fix the Torque launch issue - just some > >> leftover garbage that existed at the time of the 1.7.0 branch and didn't > >> get removed. > >>> > >>> For the hostfile issue, I'm stumped as I can't see how the problem would > >> come about. Could you please rerun your original test and add > >> "--display-allocation" to your cmd line? Let's see if it is > >>> correctly finding the original allocation. > >>> > >>> Thanks > >>> Ralph > >>> > >>> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote: > >>> > >>>> > >>>> > >>>> Hi Gus, > >>>> > >>>> Thank you for your comments. I understand your advice. > >>>> Our script used to be --npernode type as well. > >>>> > >>>> As I told before, our cluster consists of nodes having 4, 8, > >>>> and 32 cores, although it used to be homogeneous at the > >>>> starting time. Furthermore, since performance of each core > >>>> is almost same, a mixed use of nodes with different number > >>>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8. > >>>> > >>>> --npernode type is not applicable to such a mixed use. > >>>> That's why I'd like to continue to use modified hostfile. > >>>> > >>>> By the way, the problem I reported to Jeff yesterday > >>>> was that openmpi-1.7 with torque is something wrong, > >>>> because it caused error against such a simple case as > >>>> shown below, which surprised me. Now, the problem is not > >>>> limited to modified hostfile, I guess. > >>>> > >>>> #PBS -l nodes=4:ppn=8 > >>>> mpirun -np 8 ./my_program > >>>> (OMP_NUM_THREADS=4) > >>>> > >>>> Regards, > >>>> Tetsuya Mishima > >>>> > >>>>> Hi Tetsuya > >>>>> > >>>>> Your script that edits $PBS_NODEFILE into a separate hostfile > >>>>> is very similar to some that I used here for > >>>>> hybrid OpenMP+MPI programs on older versions of OMPI. > >>>>> I haven't tried this in 1.6.X, > >>>>> but it looks like you did and it works also. > >>>>> I haven't tried 1.7 either. > >>>>> Since we run production machines, > >>>>> I try to stick to the stable versions of OMPI (even numbered: > >>>>> 1.6.X, 1.4.X, 1.2.X). > >>>>> > >>>>> I believe you can get the same effect even if you > >>>>> don't edit your $PBS_NODEFILE and let OMPI use it as is. > >>>>> Say, if you choose carefully the values in your > >>>>> #PBS -l nodes=?:ppn=? > >>>>> of your > >>>>> $OMP_NUM_THREADS > >>>>> and use an mpiexec with --npernode or --cpus-per-proc. > >>>>> > >>>>> For instance, for twelve MPI processes, with two threads each, > >>>>> on nodes with eight cores each, I would try > >>>>> (but I haven't tried!): > >>>>> > >>>>> #PBS -l nodes=3:ppn=8 > >>>>> > >>>>> export $OMP_NUM_THREADS=2 > >>>>> > >>>>> mpiexec -np 12 -npernode 4 > >>>>> > >>>>> or perhaps more tightly: > >>>>> > >>>>> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2 > >>>>> > >>>>> I hope this helps, > >>>>> Gus Correa > >>>>> > >>>>> > >>>>> > >>>>> On 03/19/2013 03:12 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>> > >>>>>> > >>>>>> Hi Reuti and Gus, > >>>>>> > >>>>>> Thank you for your comments. > >>>>>> > >>>>>> Our cluster is a little bit heterogeneous, which has nodes with 4, 8, > >>>> 32 > >>>>>> cores. > >>>>>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for "-l > >>>>>> nodes=2:ppn=4". > >>>>>> (strictly speaking, Torque picked up proper nodes.) > >>>>>> > >>>>>> As I mentioned before, I usually use openmpi-1.6.x, which has no > >> troble > >>>>>> against that kind > >>>>>> of use. I encountered the issue when I was evaluating openmpi-1.7 to > >>>> check > >>>>>> when we could > >>>>>> move on to it, although we have no positive reason to do that at this > >>>>>> moment. > >>>>>> > >>>>>> As Gus pointed out, I use a script file as shown below for a > >> practical > >>>> use > >>>>>> of openmpi-1.6.x. > >>>>>> > >>>>>> #PBS -l nodes=2:ppn=32 # even "-l nodes=1:ppn=32+4:ppn=8" works fine > >>>>>> export OMP_NUM_THREADS=4 > >>>>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines > >>>> here > >>>>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -report-bindings \ > >>>>>> -x OMP_NUM_THREADS ./my_program # 32-core node has 8 numanodes, > >> 8-core > >>>>>> node has 2 numanodes > >>>>>> > >>>>>> It works well under the combination of openmpi-1.6.x and Torque. The > >>>>>> problem is just > >>>>>> openmpi-1.7's behavior. > >>>>>> > >>>>>> Regards, > >>>>>> Tetsuya Mishima > >>>>>> > >>>>>>> Hi Tetsuya Mishima > >>>>>>> > >>>>>>> Mpiexec offers you a number of possibilities that you could try: > >>>>>>> --bynode, > >>>>>>> --pernode, > >>>>>>> --npernode, > >>>>>>> --bysocket, > >>>>>>> --bycore, > >>>>>>> --cpus-per-proc, > >>>>>>> --cpus-per-rank, > >>>>>>> --rankfile > >>>>>>> and more. > >>>>>>> > >>>>>>> Most likely one or more of them will fit your needs. > >>>>>>> > >>>>>>> There are also associated flags to bind processes to cores, > >>>>>>> to sockets, etc, to report the bindings, and so on. > >>>>>>> > >>>>>>> Check the mpiexec man page for details. > >>>>>>> > >>>>>>> Nevertheless, I am surprised that modifying the > >>>>>>> $PBS_NODEFILE doesn't work for you in OMPI 1.7. > >>>>>>> I have done this many times in older versions of OMPI. > >>>>>>> > >>>>>>> Would it work for you to go back to the stable OMPI 1.6.X, > >>>>>>> or does it lack any special feature that you need? > >>>>>>> > >>>>>>> I hope this helps, > >>>>>>> Gus Correa > >>>>>>> > >>>>>>> On 03/19/2013 03:00 AM, tmish...@jcity.maeda.co.jp wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> Hi Jeff, > >>>>>>>> > >>>>>>>> I didn't have much time to test this morning. So, I checked it > >> again > >>>>>>>> now. Then, the trouble seems to depend on the number of nodes to > >> use. > >>>>>>>> > >>>>>>>> This works(nodes< 4): > >>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 > >>>>>>>> (OMP_NUM_THREADS=4) > >>>>>>>> > >>>>>>>> This causes error(nodes>= 4): > >>>>>>>> mpiexec -bynode -np 8 ./my_program&& #PBS -l nodes=4:ppn=8 > >>>>>>>> (OMP_NUM_THREADS=4) > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Tetsuya Mishima > >>>>>>>> > >>>>>>>>> Oy; that's weird. > >>>>>>>>> > >>>>>>>>> I'm afraid we're going to have to wait for Ralph to answer why > >> that > >>>> is > >>>>>>>> happening -- sorry! > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mar 18, 2013, at 4:45 PM,<tmish...@jcity.maeda.co.jp> wrote: > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Hi Correa and Jeff, > >>>>>>>>>> > >>>>>>>>>> Thank you for your comments. I quickly checked your suggestion. > >>>>>>>>>> > >>>>>>>>>> As a result, my simple example case worked well. > >>>>>>>>>> export OMP_NUM_THREADS=4 > >>>>>>>>>> mpiexec -bynode -np 2 ./my_program&& #PBS -l nodes=2:ppn=4 > >>>>>>>>>> > >>>>>>>>>> But, practical case that more than 1 process was allocated to a > >>>> node > >>>>>>>> like > >>>>>>>>>> below did not work. > >>>>>>>>>> export OMP_NUM_THREADS=4 > >>>>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 > >>>>>>>>>> > >>>>>>>>>> The error message is as follows: > >>>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is > >>>>>>>>>> attempting to be sent to a process whose contact infor > >>>>>>>>>> mation is unknown in file rml_oob_send.c at line 316 > >>>>>>>>>> [node08.cluster:11946] [[30666,0],3] unable to find address for > >>>>>>>>>> [[30666,0],1] > >>>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is > >>>>>>>>>> attempting to be sent to a process whose contact infor > >>>>>>>>>> mation is unknown in file base/grpcomm_base_rollup.c at line 123 > >>>>>>>>>> > >>>>>>>>>> Here is our openmpi configuration: > >>>>>>>>>> ./configure \ > >>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \ > >>>>>>>>>> --with-tm \ > >>>>>>>>>> --with-verbs \ > >>>>>>>>>> --disable-ipv6 \ > >>>>>>>>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \ > >>>>>>>>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ > >>>>>>>>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \ > >>>>>>>>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e" > >>>>>>>>>> > >>>>>>>>>> Regards, > >>>>>>>>>> Tetsuya Mishima > >>>>>>>>>> > >>>>>>>>>>> On Mar 17, 2013, at 10:55 PM, Gustavo > >>>> Correa<g...@ldeo.columbia.edu> > >>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> In your example, have you tried not to modify the node file, > >>>>>>>>>>>> launch two mpi processes with mpiexec, and request a "-bynode" > >>>>>>>>>> distribution of processes: > >>>>>>>>>>>> > >>>>>>>>>>>> mpiexec -bynode -np 2 ./my_program > >>>>>>>>>>> > >>>>>>>>>>> This should work in 1.7, too (I use these kinds of options with > >>>>>> SLURM > >>>>>>>> all > >>>>>>>>>> the time). > >>>>>>>>>>> > >>>>>>>>>>> However, we should probably verify that the hostfile > >> functionality > >>>>>> in > >>>>>>>>>> batch jobs hasn't been broken in 1.7, too, because I'm pretty > >> sure > >>>>>> that > >>>>>>>>>> what you described should work. However, Ralph, our > >>>>>>>>>>> run-time guy, is on vacation this week. There might be a delay > >> in > >>>>>>>>>> checking into this. > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Jeff Squyres > >>>>>>>>>>> jsquy...@cisco.com > >>>>>>>>>>> For corporate legal information go to: > >>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> users mailing list > >>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> users mailing list > >>>>>>>>>> us...@open-mpi.org > >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Jeff Squyres > >>>>>>>>> jsquy...@cisco.com > >>>>>>>>> For corporate legal information go to: > >>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> us...@open-mpi.org > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users