Re: [OMPI users] Over committing?

Ralph Castain Wed, 13 Apr 2011 11:21:03 -0400

Afraid I have no idea - we regularly run on Torque machines with the nodes 
fully populated. While most runs are only for a few hours, some runs go for 
days.



How was OMPI configured? What OS version?


On Apr 13, 2011, at 9:09 AM, Rushton Martin wrote:

> Version 1.3.2
> 
> Consider a job that will run with 28 processes.  The user submits it
> with:
> 
> $ qsub -l nodes=4:ppn=7 ...
> 
> which reserves 7 cores on (in this case) each of x3550x014 x3550x015 and
> x3550x016 x3550x020.  Torque generates a file (PBS_NODEFILE) which lists
> each node 7 times.
> 
> The mpirun command given within the batch script is:
> 
> $ mpirun -np 28 -machinefile $PBS_NODEFILE <executable>
> 
> This is what I would refer to as 7+7+7+7, and it runs fine.
> 
> The problem occurs if, for instance, a 24 core job is attempted.  qsub
> gets nodes=3:ppn=8 and mpirun has -np 24.  The job is now running on
> three nodes, using all eight cores on each node - 8+8+8.  This sort of
> job will eventually hang and has to be killed off.
> 
> Cores Nodes   Ppn     Result
> ----- -----   ---     ------
> 8     1       any     works
> 8     >1      1-7     works
> 8     >1      8       hangs
> 16    1       any     works
> 16    >1      1-15    works
> 16    >1      16      hangs
> 
> We have also tried test jobs on 8+7 (or 7+8) with inconclusive results.
> Some of the live jobs run for a month or more and cut down versions do
> not model well.
> 
> Martin Rushton
> HPC System Manager, Weapons Technologies
> Tel: 01959 514777, Mobile: 07939 219057
> email: jmrush...@qinetiq.com
> www.QinetiQ.com
> QinetiQ - Delivering customer-focused solutions
> 
> Please consider the environment before printing this email.
> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Ralph Castain
> Sent: 13 April 2011 15:34
> To: Open MPI Users
> Subject: Re: [OMPI users] Over committing?
> 
> 
> On Apr 13, 2011, at 8:13 AM, Rushton Martin wrote:
> 
>> The bulk of our compute nodes are 8 cores (twin 4-core IBM x3550-m2).
>> Jobs are submitted by Torque/MOAB.  When run with up to np=8 there is 
>> good performance.  Attempting to run with more processors brings 
>> problems, specifically if any one node of a group of nodes has all 8 
>> cores in use the job hangs.  For instance running with 14 cores (7+7) 
>> is fine, but running with 16 (8+8) hangs.
>> 
>>> From the FAQs I note the issues of over committing and aggressive
>> scheduling.  Is it possible for mpirun (or orted on the remote nodes) 
>> to be blocked from progressing by a fully committed node?  We have a 
>> few
>> x3755-m2 machines with 16 cores, and we have detected a similar issue 
>> with 16+16.
> 
> I'm not entirely sure I understand your notation, but we have never seen
> an issue when running with fully loaded nodes (i.e., where the number of
> MPI procs on the node = the number of cores).
> 
> What version of OMPI are you using? Are you binding the procs?
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is 
> addressed. If you are not the intended recipient of this email,
> you must neither take any action based upon its contents, nor 
> copy or show it to anyone. Please contact the sender if you 
> believe you have received this email in error. QinetiQ may 
> monitor email traffic data and also the content of email for 
> the purposes of security. QinetiQ Limited (Registered in England
> & Wales: Company Number: 3796233) Registered office: Cody Technology 
> Park, Ively Road, Farnborough, Hampshire, GU14 0LX  http://www.qinetiq.com.
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Over committing?

Reply via email to