Am 22.11.2013 um 19:34 schrieb Jason Gans: > On 11/22/13 11:18 AM, Lloyd Brown wrote: >> As far as I understand, the mpirun will assign processes to hosts in the >> hostlist ($PBS_NODEFILE) sequentially, and if it runs out of hosts in >> the list, it starts over at the top of the file. >> >> Theoretically, you should be able to request specific hostnames, and the >> processor counts per hostname, in your torque submit request. I'm not >> sure if this is correct (we don't use Torque here anymore, and I'm going >> off memory), but it should be approximately correct: >> >>> qsub -l >>> nodes=n0000:2+n0001:2+n0002:8+n0003:8+n0004:2+n0005:2+n0006:2+n0007:4 ... > Thanks! This is awkward, but it did the trick. To get the desired behavior I > first > had to provide a "fake" nodes file to Torque (where all of the nodes were > listed > as having a large number of processors, i.e. np=8). Now I can submit jobs > using: > > qsub -I -l nodes=n0000:ppn=2+n0001:ppn=2+n0002:ppn=8+...
This shouldn't be necessary when Torque knows the number of cores in each machine and you request the suggested 24 ones. -- Reuti > > and get the expected behavior (including the expected $PBS_NODFILE, where the > name of each node appears "ppn" number of times). > > Thanks to everyone who responded! > > Regards, > > Jason >> Granted, that's awkward, but I'm not sure if there's another way in >> Torque to request different numbers of processors per node. You might >> ask on the Torque Users list. They might tell you to change the nodes >> file to reflect the number of actual processes you want on each node, >> rather than the number of physical processors on the hosts. Whether >> this works for you, depends on whether you want this type of >> oversubscription to happen all the time, or on a per-job basis, etc. >> >> >> Lloyd Brown >> Systems Administrator >> Fulton Supercomputing Lab >> Brigham Young University >> http://marylou.byu.edu >> >> On 11/22/2013 11:11 AM, Gans, Jason D wrote: >>> I have tried the 1.7 series (specifically 1.7.3) and I get the same >>> behavior. >>> >>> When I run "mpirun -oversubscribe -np 24 hostname", three instances of >>> "hostname" are run on each node. >>> >>> The contents of the $PBS_NODEFILE are: >>> n0007 >>> n0006 >>> n0005 >>> n0004 >>> n0003 >>> n0002 >>> n0001 >>> n0000 >>> >>> but, since I have compiled OpenMPI using the "--with-tm", it appears >>> that OpenMPI is not using the $PBS_NODEFILE (which I tested by modifying >>> the torque pbs_mom to write a $PBS_NODEFILE that contained "slot=xx" >>> information for each node. mpirun complained when I did this). >>> >>> Regards, >>> >>> Jason >>> >>> ------------------------------------------------------------------------ >>> *From:* users [users-boun...@open-mpi.org] on behalf of Ralph Castain >>> [r...@open-mpi.org] >>> *Sent:* Friday, November 22, 2013 11:04 AM >>> *To:* Open MPI Users >>> *Subject:* Re: [OMPI users] Oversubscription of nodes with Torque and >>> OpenMPI >>> >>> Really shouldn't matter - this is clearly a bug in OMPI if it is doing >>> mapping as you describe. Out of curiosity, have you tried the 1.7 >>> series? Does it behave the same? >>> >>> I can take a look at the code later today and try to figure out what >>> happened. >>> >>> On Nov 22, 2013, at 9:56 AM, Jason Gans <jg...@lanl.gov >>> <mailto:jg...@lanl.gov>> wrote: >>> >>>> On 11/22/13 10:47 AM, Reuti wrote: >>>>> Hi, >>>>> >>>>> Am 22.11.2013 um 17:32 schrieb Gans, Jason D: >>>>> >>>>>> I would like to run an instance of my application on every *core* of >>>>>> a small cluster. I am using Torque 2.5.12 to run jobs on the >>>>>> cluster. The cluster in question is a heterogeneous collection of >>>>>> machines that are all past their prime. Specifically, the number of >>>>>> cores ranges from 2-8. Here is the Torque "nodes" file: >>>>>> >>>>>> n0000 np=2 >>>>>> n0001 np=2 >>>>>> n0002 np=8 >>>>>> n0003 np=8 >>>>>> n0004 np=2 >>>>>> n0005 np=2 >>>>>> n0006 np=2 >>>>>> n0007 np=4 >>>>>> >>>>>> When I use openmpi-1.6.3, I can oversubscribe nodes but the tasks >>>>>> are allocated to nodes without regard to the number of cores on each >>>>>> node (specified by the "np=xx" in the nodes file). For example, when >>>>>> I run "mpirun -np 24 hostname", mpirun places three instances of >>>>>> "hostname" on each node, despite the fact that some nodes only have >>>>>> two processors and some have more. >>>>> You submitted the job itself by requesting 24 cores for it too? >>>>> >>>>> -- Reuti >>>> Since there are only 8 Torque nodes in the cluster, I submitted the >>>> job by requesting 8 nodes, i.e. "qsub -I -l nodes=8". >>>>> >>>>>> Is there a way to have OpenMPI "gracefully" oversubscribe nodes by >>>>>> allocating instances based on the "np=xx" information in the Torque >>>>>> nodes file? It this a Torque problem? >>>>>> >>>>>> p.s. I do get the desired behavior when I run *without* Torque and >>>>>> specify the following machine file to mpirun: >>>>>> >>>>>> n0000 slots=2 >>>>>> n0001 slots=2 >>>>>> n0002 slots=8 >>>>>> n0003 slots=8 >>>>>> n0004 slots=2 >>>>>> n0005 slots=2 >>>>>> n0006 slots=2 >>>>>> n0007 slots=4 >>>>>> >>>>>> Regards, >>>>>> >>>>>> Jason >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users