Hello,

On the following system:

OpenMPI 1.1.1
SGE 6.0 (with tight integration)
Scientific Linux 4.3
Dual Dual-Core Opterons


MPI jobs are oversubscribing to the nodes. No matter where jobs are launched by the scheduler, they always stack up on the first node (node00) and continue to stack even though the system load exceeds 6 (on a 4 processor box). Eeach node is defined as 4 slots with 4 max slots. The MPI jobs launch via "mpirun -np (some-number-of- processors)" from within the scheduler.

It seems to me that MPI is not detecting that the nodes are overloaded and that due to the way the job slots are defined and how mpirun is being called. If I read the documentation correctly, a single mpirun run consumes one job slot no matter the number of processes which are launched. We can chagne the number of job slots, but then we expect to waste processors since only one mpirun job will run on any node, even if the job is only a two processor job.

Can someone enlighten me?

-geoff


Reply via email to