On Nov 2, 2007, at 11:02 AM, himanshu khandelia wrote:
This question is about the use of a simulation package called GROMACS. PS: On our cluster (quad-core nodes), GROMACS does not scale well beyond 4 cpus. So, I wish to run two different simulations, while requesting 2 nodes (1 simulation on each node) to best exploit the policies of our MAUI scheduler. So, I am requesting 2 4-cpu nodes on a cluster using PBS. I want to run a separate simulation on each 4-cpu node. However, on 2 nodes, the speed for each simulation decreases (50 to 100%) if compared to a simulation which runs in a job which requests only one node. I am guessing this is because openmpi fails to assign all cpus of the same node to one simulation ? Instead, cpus from different nodes are being used to run simulation. This is what I have in the PBS script 1. ######## mpirun -np 4 my-gromacs-executable-for-simulation-1 -np 4 & mpirun -np 4 my-gromacs-executable-for-simulation-1-np 4 &
In this case, Open MPI does not realize that you have executed 2 mpiruns and therefore assigns both the first and second job to the same 4 processors. Hence, they run at half speed (or slower) because they're competing for the same CPUs.
# (THE GROMACS EXECUTABLE DOES REQUIRE A REPEAT REQUEST FOR THE NUMBER OF PROCESSORS) wait ######## OPENMPI does have a mechanism whereby one can assign specific processes to specific nodes http://www.open-mpi.org/faq/?category=running#mpirun-scheduling So, I also tried all of the following in the PBS script where the --bynode or the --byslot option is used 2. ######## mpirun -np 4 --bynode my-gromacs-executable-for-simulation-1 -np 4 & mpirun -np 4 --bynode my-gromacs-executable-for-simulation-2 -np 4 & wait ######## 3. ######## mpirun -np 4 --byslot my-gromacs-executable-for-simulation-1 -np 4 & mpirun -np 4 --byslot my-gromacs-executable-for-simulation-2 -np 4 & wait ######## But these methods also result in similar performance losses.
The same thing will happen here -- OMPI is unaware of the 2 mpiruns, and therefore schedules on the same first 4 nodes or slots.
It sounds like you really want to run two torque jobs, not one. Is there a reason you're not doing that?
Failing that, you could do an end-run around the Open MPI torque support and use the rsh launcher to precisely control where you launch jobs, but that's a bunch of trouble and not really how we intended the system to be used.
So how does one assign the cpus properly using mpirun if running different simulations in the same PBS job ?? Thank you for the help, -Himanshu _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
-- Jeff Squyres Cisco Systems