[OMPI users] nodes are oversubscribed in 1.1.1

2007-01-23 Thread Geoff Galitz



Hello,

On the following system:

OpenMPI 1.1.1
SGE 6.0 (with tight integration)
Scientific Linux 4.3
Dual Dual-Core Opterons


MPI jobs are oversubscribing to the nodes.  No matter where jobs are  
launched by the scheduler, they always stack up on the first node  
(node00) and continue to stack even though the system load exceeds 6  
(on a 4 processor box).  Eeach node is defined as 4 slots with 4 max  
slots.  The MPI jobs launch via "mpirun -np (some-number-of- 
processors)" from within the scheduler.


It seems to me that MPI is not detecting that the nodes are  
overloaded and that due to the way the job slots are defined and how  
mpirun is being called.  If I read the documentation correctly, a  
single mpirun run consumes one job slot no matter the number of  
processes which are launched.  We can chagne the number of job slots,  
but then we expect to waste processors since only one mpirun job will  
run on any node, even if the job is only a two processor job.


Can someone enlighten me?

-geoff




Re: [OMPI users] nodes are oversubscribed in 1.1.1

2007-01-24 Thread Geoff Galitz


On Jan 24, 2007, at 7:03 AM, Pak Lui wrote:


Geoff Galitz wrote:

Hello,
On the following system:
OpenMPI 1.1.1
SGE 6.0 (with tight integration)
Scientific Linux 4.3
Dual Dual-Core Opterons
MPI jobs are oversubscribing to the nodes.  No matter where jobs  
are  launched by the scheduler, they always stack up on the first  
node  (node00) and continue to stack even though the system load  
exceeds 6  (on a 4 processor box).  Eeach node is defined as 4  
slots with 4 max  slots.  The MPI jobs launch via "mpirun -np  
(some-number-of- processors)" from within the scheduler.


Hi Geoff,

I think we first start having SGE support in 1.2, not in 1.1.1.  
Unless you did some modification on your own to include the  
gridengine ras/pls modules from v1.2, you probably are not using  
the SGE tight integration. So even though you start mpirun in the  
SGE parallel environment, ORTE does not have the gridengine modules  
for allocating and launching the jobs, so that could be why all  
processes are launched on the same node. (because there's no node  
list available from gridengine and it defaults to single node)




I have used the backport instructions provided by Olli-Pekka Lehto.   
Of course, if it is running properly in my case, I can't say as I am  
certainly not getting the expected behavior, although the jobs do run.


On a related note, there is a way for SGE to allocate and assign  
slots for launching tasks. It is done by setting the allocation  
rule in the parallel environment (PE). If all of the slots are  
allocated on the same node, it sounds like the allocation rule has  
been set to $fill_up. Maybe you can try with $round_robin instead?





If I use $round_robin, one MPI process starts up per node and then  
wraps around the cluster.  So if I have 4 process MPI job, it starts  
1 process on 4 nodes which is certainly not the most efficient method.


It seems to me that MPI is not detecting that the nodes are   
overloaded and that due to the way the job slots are defined and  
how  mpirun is being called.  If I read the documentation  
correctly, a  single mpirun run consumes one job slot no matter  
the number of  processes which are launched.  We can chagne the  
number of job slots,  but then we expect to waste processors since  
only one mpirun job will  run on any node, even if the job is only  
a two processor job.


As for oversubscription, I remember we start having that - 
nooversubscribe option in v1.2 so if you want to limit ORTE from  
oversubscribing because by default oversubscription is allowed.




So it seems the real story for me is that there is no logic that  
detects the oversubscription condition and re-schedules the job for  
another node in the MPI nodelist in OpenMPI 1.1.1?  If so, that would  
certainly explain what I am seeing.  Is that correct?


-geoff


[OMPI users] hostfile syntax

2007-03-22 Thread Geoff Galitz


Does the hostfile understand the syntax:

mybox cpu=4

I have some legacy code and scripts that I'd like to move without  
modifying if possible.  I understand the syntax is supposed to be:


mybox slots=4

but using "cpu" seems to work.  Does that achieve the same thing?

-geoff



[OMPI users] migration FAQ

2007-03-30 Thread Geoff Galitz



Sidenote -- maybe I should create a "I used to be a LAM user" section
of the FAQ...



Actually a migration FAQ would be a good idea.  I am another former  
LAM user and had lots of questions about parameter syntax and "I did  
it in LAM this way, how do I do it here?"  I had the luxury of time  
to do some empirical testing but a migration FAQ would be useful to  
folks, I think.


-geoff

Geoff Galitz
ge...@galitz.org