On Nov 22, 2013, at 10:26 AM, Jason Gans <jg...@lanl.gov> wrote: > On 11/22/13 11:15 AM, Ralph Castain wrote: >> >> On Nov 22, 2013, at 10:03 AM, Reuti <re...@staff.uni-marburg.de> wrote: >> >>> Am 22.11.2013 um 18:56 schrieb Jason Gans: >>> >>>> On 11/22/13 10:47 AM, Reuti wrote: >>>>> Hi, >>>>> >>>>> Am 22.11.2013 um 17:32 schrieb Gans, Jason D: >>>>> >>>>>> I would like to run an instance of my application on every *core* of a >>>>>> small cluster. I am using Torque 2.5.12 to run jobs on the cluster. The >>>>>> cluster in question is a heterogeneous collection of machines that are >>>>>> all past their prime. Specifically, the number of cores ranges from 2-8. >>>>>> Here is the Torque "nodes" file: >>>>>> >>>>>> n0000 np=2 >>>>>> n0001 np=2 >>>>>> n0002 np=8 >>>>>> n0003 np=8 >>>>>> n0004 np=2 >>>>>> n0005 np=2 >>>>>> n0006 np=2 >>>>>> n0007 np=4 >>>>>> >>>>>> When I use openmpi-1.6.3, I can oversubscribe nodes but the tasks are >>>>>> allocated to nodes without regard to the number of cores on each node >>>>>> (specified by the "np=xx" in the nodes file). For example, when I run >>>>>> "mpirun -np 24 hostname", mpirun places three instances of "hostname" on >>>>>> each node, despite the fact that some nodes only have two processors and >>>>>> some have more. >>>>> You submitted the job itself by requesting 24 cores for it too? >>>>> >>>>> -- Reuti >>>> Since there are only 8 Torque nodes in the cluster, I submitted the job by >>>> requesting 8 nodes, i.e. "qsub -I -l nodes=8". >>> >>> No, AFAICT it's necessary to request there 24 too. To investigate it >>> further it would also be good to copy the $PBS_NODEFILE in your job script >>> for later inspection to your home directory. I.e. whether you are getting >>> the correct values there already. >> >> Not really - we take the number of slots on each node and add them together. >> >> Question: is that a copy/paste of the actual PBS_NODEFILE? It doesn't look >> right to me - there is supposed to be one node entry for each slot. In other >> words, it should have looked like this: >> >>> n0000 >>> n0000 >>> n0001 >>> n0001 >>> n0002 >>> n0002 >> ... >> > That is what I expected -- however, the $PBS_NODEFILE lists each node just > once.
Been a while since I used Torque, but I suspect Reuti is right - you have to ask for 24 slots. Sounds like Torque is only assigning you one slot/node. >> >> >>> >>> -- Reuti >>> >>>>> >>>>> >>>>>> Is there a way to have OpenMPI "gracefully" oversubscribe nodes by >>>>>> allocating instances based on the "np=xx" information in the Torque >>>>>> nodes file? It this a Torque problem? >>>>>> >>>>>> p.s. I do get the desired behavior when I run *without* Torque and >>>>>> specify the following machine file to mpirun: >>>>>> >>>>>> n0000 slots=2 >>>>>> n0001 slots=2 >>>>>> n0002 slots=8 >>>>>> n0003 slots=8 >>>>>> n0004 slots=2 >>>>>> n0005 slots=2 >>>>>> n0006 slots=2 >>>>>> n0007 slots=4 >>>>>> >>>>>> Regards, >>>>>> >>>>>> Jason >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users