Please try it again with the attached patch. The --disable-vt is fine. Thanks Ralph
user2.diff
Description: Binary data
On Mar 20, 2013, at 7:47 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > I have completed rebuild of openmpi1.7rc8. > To save time, I added --disable-vt. ( Is it OK? ) > > Well, what shall I do ? > > ./configure \ > --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \ > --with-tm \ > --with-verbs \ > --disable-ipv6 \ > --disable-vt \ > --enable-debug \ > CC=pgcc CFLAGS="-fast -tp k8-64e" \ > CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ > F77=pgfortran FFLAGS="-fast -tp k8-64e" \ > FC=pgfortran FCFLAGS="-fast -tp k8-64e" > > Note: > I tried patch user.diff after rebuiding openmpi1.7rc8. > But, I got an error and could not go foward. > > $ patch -p0 < user.diff # this is OK > $ make # I got an error > > CC util/hostfile/hostfile.lo > PGC-S-0037-Syntax error: Recovery attempted by deleting <string> > (util/hostfile/hostfile.c: 728) > PGC/x86-64 Linux 12.9-0: compilation completed with severe errors > > Regards, > Tetsuya Mishima > >> Could you please apply the attached patch and try it again? If you > haven't had time to configure with --enable-debug, that is fine - this will > output regardless. >> >> Thanks >> Ralph >> >> - user.diff >> >> >> On Mar 20, 2013, at 4:59 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> You obviously have some MCA params set somewhere: >>> >>>> > -------------------------------------------------------------------------- >>>> A deprecated MCA parameter value was specified in an MCA parameter >>>> file. Deprecated MCA parameters should be avoided; they may disappear >>>> in future releases. >>>> >>>> Deprecated parameter: orte_rsh_agent >>>> > -------------------------------------------------------------------------- >>> >>> Check your environment for anything with OMPI_MCA_xxx, and your default > MCA parameter file to see what has been specified. >>> >>> The allocation looks okay - I'll have to look for other debug flags you > can set. Meantime, can you please add --enable-debug to your configure cmd > line and rebuild? >>> >>> Thanks >>> Ralph >>> >>> >>> On Mar 20, 2013, at 4:39 PM, tmish...@jcity.maeda.co.jp wrote: >>> >>>> >>>> >>>> Hi Ralph, >>>> >>>> Here is a result of rerun with --display-allocation. >>>> I set OMP_NUM_THREADS=1 to make the problem clear. >>>> >>>> Regards, >>>> Tetsuya Mishima >>>> >>>> P.S. As far as I checked, these 2 cases are OK(no problem). >>>> (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation >>>> ~/Ducom/testbed/mPre m02-ld >>>> (2)mpirun -v -x OMP_NUM_THREADS --display-allocation > ~/Ducom/testbed/mPre >>>> m02-ld >>>> >>>> Script File: >>>> >>>> #!/bin/sh >>>> #PBS -A tmishima >>>> #PBS -N Ducom-run >>>> #PBS -j oe >>>> #PBS -l nodes=2:ppn=4 >>>> export OMP_NUM_THREADS=1 >>>> cd $PBS_O_WORKDIR >>>> cp $PBS_NODEFILE pbs_hosts >>>> NPROCS=`wc -l < pbs_hosts` >>>> mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS >>>> --display-allocation ~/Ducom/testbed/mPre m02-ld >>>> >>>> Output: >>>> > -------------------------------------------------------------------------- >>>> A deprecated MCA parameter value was specified in an MCA parameter >>>> file. Deprecated MCA parameters should be avoided; they may disappear >>>> in future releases. >>>> >>>> Deprecated parameter: orte_rsh_agent >>>> > -------------------------------------------------------------------------- >>>> >>>> ====================== ALLOCATED NODES ====================== >>>> >>>> Data for node: node06 Num slots: 4 Max slots: 0 >>>> Data for node: node05 Num slots: 4 Max slots: 0 >>>> >>>> ================================================================= >>>> > -------------------------------------------------------------------------- >>>> A hostfile was provided that contains at least one node not >>>> present in the allocation: >>>> >>>> hostfile: pbs_hosts >>>> node: node06 >>>> >>>> If you are operating in a resource-managed environment, then only >>>> nodes that are in the allocation can be used in the hostfile. You >>>> may find relative node syntax to be a useful alternative to >>>> specifying absolute node names see the orte_hosts man page for >>>> further information. >>>> > -------------------------------------------------------------------------- >>>> >>>> >>>>> I've submitted a patch to fix the Torque launch issue - just some >>>> leftover garbage that existed at the time of the 1.7.0 branch and > didn't >>>> get removed. >>>>> >>>>> For the hostfile issue, I'm stumped as I can't see how the problem > would >>>> come about. Could you please rerun your original test and add >>>> "--display-allocation" to your cmd line? Let's see if it is >>>>> correctly finding the original allocation. >>>>> >>>>> Thanks >>>>> Ralph >>>>> >>>>> On Mar 19, 2013, at 5:08 PM, tmish...@jcity.maeda.co.jp wrote: >>>>> >>>>>> >>>>>> >>>>>> Hi Gus, >>>>>> >>>>>> Thank you for your comments. I understand your advice. >>>>>> Our script used to be --npernode type as well. >>>>>> >>>>>> As I told before, our cluster consists of nodes having 4, 8, >>>>>> and 32 cores, although it used to be homogeneous at the >>>>>> starting time. Furthermore, since performance of each core >>>>>> is almost same, a mixed use of nodes with different number >>>>>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8. >>>>>> >>>>>> --npernode type is not applicable to such a mixed use. >>>>>> That's why I'd like to continue to use modified hostfile. >>>>>> >>>>>> By the way, the problem I reported to Jeff yesterday >>>>>> was that openmpi-1.7 with torque is something wrong, >>>>>> because it caused error against such a simple case as >>>>>> shown below, which surprised me. Now, the problem is not >>>>>> limited to modified hostfile, I guess. >>>>>> >>>>>> #PBS -l nodes=4:ppn=8 >>>>>> mpirun -np 8 ./my_program >>>>>> (OMP_NUM_THREADS=4) >>>>>> >>>>>> Regards, >>>>>> Tetsuya Mishima >>>>>> >>>>>>> Hi Tetsuya >>>>>>> >>>>>>> Your script that edits $PBS_NODEFILE into a separate hostfile >>>>>>> is very similar to some that I used here for >>>>>>> hybrid OpenMP+MPI programs on older versions of OMPI. >>>>>>> I haven't tried this in 1.6.X, >>>>>>> but it looks like you did and it works also. >>>>>>> I haven't tried 1.7 either. >>>>>>> Since we run production machines, >>>>>>> I try to stick to the stable versions of OMPI (even numbered: >>>>>>> 1.6.X, 1.4.X, 1.2.X). >>>>>>> >>>>>>> I believe you can get the same effect even if you >>>>>>> don't edit your $PBS_NODEFILE and let OMPI use it as is. >>>>>>> Say, if you choose carefully the values in your >>>>>>> #PBS -l nodes=?:ppn=? >>>>>>> of your >>>>>>> $OMP_NUM_THREADS >>>>>>> and use an mpiexec with --npernode or --cpus-per-proc. >>>>>>> >>>>>>> For instance, for twelve MPI processes, with two threads each, >>>>>>> on nodes with eight cores each, I would try >>>>>>> (but I haven't tried!): >>>>>>> >>>>>>> #PBS -l nodes=3:ppn=8 >>>>>>> >>>>>>> export $OMP_NUM_THREADS=2 >>>>>>> >>>>>>> mpiexec -np 12 -npernode 4 >>>>>>> >>>>>>> or perhaps more tightly: >>>>>>> >>>>>>> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2 >>>>>>> >>>>>>> I hope this helps, >>>>>>> Gus Correa >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 03/19/2013 03:12 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>> >>>>>>>> >>>>>>>> Hi Reuti and Gus, >>>>>>>> >>>>>>>> Thank you for your comments. >>>>>>>> >>>>>>>> Our cluster is a little bit heterogeneous, which has nodes with 4, > 8, >>>>>> 32 >>>>>>>> cores. >>>>>>>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for > "-l >>>>>>>> nodes=2:ppn=4". >>>>>>>> (strictly speaking, Torque picked up proper nodes.) >>>>>>>> >>>>>>>> As I mentioned before, I usually use openmpi-1.6.x, which has no >>>> troble >>>>>>>> against that kind >>>>>>>> of use. I encountered the issue when I was evaluating openmpi-1.7 > to >>>>>> check >>>>>>>> when we could >>>>>>>> move on to it, although we have no positive reason to do that at > this >>>>>>>> moment. >>>>>>>> >>>>>>>> As Gus pointed out, I use a script file as shown below for a >>>> practical >>>>>> use >>>>>>>> of openmpi-1.6.x. >>>>>>>> >>>>>>>> #PBS -l nodes=2:ppn=32 # even "-l nodes=1:ppn=32+4:ppn=8" works > fine >>>>>>>> export OMP_NUM_THREADS=4 >>>>>>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 > lines >>>>>> here >>>>>>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 > -report-bindings \ >>>>>>>> -x OMP_NUM_THREADS ./my_program # 32-core node has 8 numanodes, >>>> 8-core >>>>>>>> node has 2 numanodes >>>>>>>> >>>>>>>> It works well under the combination of openmpi-1.6.x and Torque. > The >>>>>>>> problem is just >>>>>>>> openmpi-1.7's behavior. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Tetsuya Mishima >>>>>>>> >>>>>>>>> Hi Tetsuya Mishima >>>>>>>>> >>>>>>>>> Mpiexec offers you a number of possibilities that you could try: >>>>>>>>> --bynode, >>>>>>>>> --pernode, >>>>>>>>> --npernode, >>>>>>>>> --bysocket, >>>>>>>>> --bycore, >>>>>>>>> --cpus-per-proc, >>>>>>>>> --cpus-per-rank, >>>>>>>>> --rankfile >>>>>>>>> and more. >>>>>>>>> >>>>>>>>> Most likely one or more of them will fit your needs. >>>>>>>>> >>>>>>>>> There are also associated flags to bind processes to cores, >>>>>>>>> to sockets, etc, to report the bindings, and so on. >>>>>>>>> >>>>>>>>> Check the mpiexec man page for details. >>>>>>>>> >>>>>>>>> Nevertheless, I am surprised that modifying the >>>>>>>>> $PBS_NODEFILE doesn't work for you in OMPI 1.7. >>>>>>>>> I have done this many times in older versions of OMPI. >>>>>>>>> >>>>>>>>> Would it work for you to go back to the stable OMPI 1.6.X, >>>>>>>>> or does it lack any special feature that you need? >>>>>>>>> >>>>>>>>> I hope this helps, >>>>>>>>> Gus Correa >>>>>>>>> >>>>>>>>> On 03/19/2013 03:00 AM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Jeff, >>>>>>>>>> >>>>>>>>>> I didn't have much time to test this morning. So, I checked it >>>> again >>>>>>>>>> now. Then, the trouble seems to depend on the number of nodes to >>>> use. >>>>>>>>>> >>>>>>>>>> This works(nodes< 4): >>>>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 >>>>>>>>>> (OMP_NUM_THREADS=4) >>>>>>>>>> >>>>>>>>>> This causes error(nodes>= 4): >>>>>>>>>> mpiexec -bynode -np 8 ./my_program&& #PBS -l nodes=4:ppn=8 >>>>>>>>>> (OMP_NUM_THREADS=4) >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Tetsuya Mishima >>>>>>>>>> >>>>>>>>>>> Oy; that's weird. >>>>>>>>>>> >>>>>>>>>>> I'm afraid we're going to have to wait for Ralph to answer why >>>> that >>>>>> is >>>>>>>>>> happening -- sorry! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mar 18, 2013, at 4:45 PM,<tmish...@jcity.maeda.co.jp> > wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Correa and Jeff, >>>>>>>>>>>> >>>>>>>>>>>> Thank you for your comments. I quickly checked your > suggestion. >>>>>>>>>>>> >>>>>>>>>>>> As a result, my simple example case worked well. >>>>>>>>>>>> export OMP_NUM_THREADS=4 >>>>>>>>>>>> mpiexec -bynode -np 2 ./my_program&& #PBS -l nodes=2:ppn=4 >>>>>>>>>>>> >>>>>>>>>>>> But, practical case that more than 1 process was allocated to > a >>>>>> node >>>>>>>>>> like >>>>>>>>>>>> below did not work. >>>>>>>>>>>> export OMP_NUM_THREADS=4 >>>>>>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8 >>>>>>>>>>>> >>>>>>>>>>>> The error message is as follows: >>>>>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message > is >>>>>>>>>>>> attempting to be sent to a process whose contact infor >>>>>>>>>>>> mation is unknown in file rml_oob_send.c at line 316 >>>>>>>>>>>> [node08.cluster:11946] [[30666,0],3] unable to find address > for >>>>>>>>>>>> [[30666,0],1] >>>>>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message > is >>>>>>>>>>>> attempting to be sent to a process whose contact infor >>>>>>>>>>>> mation is unknown in file base/grpcomm_base_rollup.c at line > 123 >>>>>>>>>>>> >>>>>>>>>>>> Here is our openmpi configuration: >>>>>>>>>>>> ./configure \ >>>>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \ >>>>>>>>>>>> --with-tm \ >>>>>>>>>>>> --with-verbs \ >>>>>>>>>>>> --disable-ipv6 \ >>>>>>>>>>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \ >>>>>>>>>>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ >>>>>>>>>>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \ >>>>>>>>>>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e" >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>> >>>>>>>>>>>>> On Mar 17, 2013, at 10:55 PM, Gustavo >>>>>> Correa<g...@ldeo.columbia.edu> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> In your example, have you tried not to modify the node file, >>>>>>>>>>>>>> launch two mpi processes with mpiexec, and request a > "-bynode" >>>>>>>>>>>> distribution of processes: >>>>>>>>>>>>>> >>>>>>>>>>>>>> mpiexec -bynode -np 2 ./my_program >>>>>>>>>>>>> >>>>>>>>>>>>> This should work in 1.7, too (I use these kinds of options > with >>>>>>>> SLURM >>>>>>>>>> all >>>>>>>>>>>> the time). >>>>>>>>>>>>> >>>>>>>>>>>>> However, we should probably verify that the hostfile >>>> functionality >>>>>>>> in >>>>>>>>>>>> batch jobs hasn't been broken in 1.7, too, because I'm pretty >>>> sure >>>>>>>> that >>>>>>>>>>>> what you described should work. However, Ralph, our >>>>>>>>>>>>> run-time guy, is on vacation this week. There might be a > delay >>>> in >>>>>>>>>>>> checking into this. >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Jeff Squyres >>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>> For corporate legal information go to: >>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users