Re: [OMPI users] Over committing?

Ralph Castain Thu, 14 Apr 2011 09:20:02 -0400

...and note that it matters greatly where those barriers are - i.e., if a loop 
over a collective exists, and the barriers aren't in the loop, then it has 
nothing to do with the problem I described.



On Apr 14, 2011, at 5:27 AM, Jeff Squyres wrote:

> I think the next reasonable step is to use some kind of diagnostic to find 
> out where and why the application is hung.  padb is a great free/open source 
> tool that can be used here.
> 
> 
> On Apr 14, 2011, at 4:46 AM, Rushton Martin wrote:
> 
>> I forwarded your question to the code custodian and received the
>> following reply (GRIM is the major code, the one which shows the
>> problem): "I've not tried the debugger but GRIM does have a number of
>> mpi_barrier calls in it so I would think we are safe there. There is of
>> course a performance downside with an over-use of barriers! As mentioned
>> in the e-trail."
>> 
>> 
>> Martin Rushton
>> HPC System Manager, Weapons Technologies
>> Tel: 01959 514777, Mobile: 07939 219057
>> email: jmrush...@qinetiq.com
>> www.QinetiQ.com
>> QinetiQ - Delivering customer-focused solutions
>> 
>> Please consider the environment before printing this email.
>> -----Original Message-----
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>> Behalf Of Ralph Castain
>> Sent: 14 April 2011 04:55
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Over committing?
>> 
>> Have you folks used a debugger such as TotalView or padb to look at
>> these stalls?
>> 
>> I ask because we discovered a long time ago that MPI collectives can
>> "hang" in the scenario you describe. It is caused by one rank falling
>> behind, and then never catching up due to resource allocations - i.e..,
>> once you fall behind due to the processor being used by something else,
>> you never catch up.
>> 
>> The code that causes this is generally a loop around a collective such
>> as Allreduce. The solution was to inject a "barrier" operation in the
>> loop periodically, thus ensuring that all ranks had an opportunity to
>> catch up.
>> 
>> There is an MCA param you can set that will inject the barrier - it
>> specifies to inject it every N collective operations (either before or
>> after the Nth op):
>> 
>> -mca coll_sync_barrier_before N
>> 
>> or 
>> 
>> -mca coll_sync_barrier_after N
>> 
>> It'll slow the job down a little, depending upon how often you inject
>> the barrier. But it did allow us to run jobs reliably to completion when
>> the code involved such issues.
>> 
>> 
>> On Apr 13, 2011, at 10:07 AM, Rushton Martin wrote:
>> 
>>> The 16 cores refers to x3755-m2s.  We have a mix of 3550s and 3755s in
>> 
>>> the cluster.
>>> 
>>> It could be memory, but I think not.  The jobs are well within memory 
>>> capacity, and the memory is mainly static.  If out of memory then the 
>>> jobs would be first candidate for the job.  Larger jobs run on the 
>>> 3755s which as well as more memory have local disks for paging to.
>>> 
>>> 
>>> Martin Rushton
>>> HPC System Manager, Weapons Technologies
>>> Tel: 01959 514777, Mobile: 07939 219057
>>> email: jmrush...@qinetiq.com
>>> www.QinetiQ.com
>>> QinetiQ - Delivering customer-focused solutions
>>> 
>>> Please consider the environment before printing this email.
>>> -----Original Message-----
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] 
>>> On Behalf Of Reuti
>>> Sent: 13 April 2011 16:53
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Over committing?
>>> 
>>> Am 13.04.2011 um 17:09 schrieb Rushton Martin:
>>> 
>>>> Version 1.3.2
>>>> 
>>>> Consider a job that will run with 28 processes.  The user submits it
>>>> with:
>>>> 
>>>> $ qsub -l nodes=4:ppn=7 ...
>>>> 
>>>> which reserves 7 cores on (in this case) each of x3550x014 x3550x015 
>>>> and
>>>> x3550x016 x3550x020.  Torque generates a file (PBS_NODEFILE) which 
>>>> lists each node 7 times.
>>>> 
>>>> The mpirun command given within the batch script is:
>>>> 
>>>> $ mpirun -np 28 -machinefile $PBS_NODEFILE <executable>
>>>> 
>>>> This is what I would refer to as 7+7+7+7, and it runs fine.
>>>> 
>>>> The problem occurs if, for instance, a 24 core job is attempted.  
>>>> qsub
>>> 
>>>> gets nodes=3:ppn=8 and mpirun has -np 24.  The job is now running on 
>>>> three nodes, using all eight cores on each node - 8+8+8.  This sort 
>>>> of
>>> 
>>>> job will eventually hang and has to be killed off.
>>>> 
>>>> Cores      Nodes   Ppn     Result
>>>> -----      -----   ---     ------
>>>> 8  1       any     works
>>>> 8  >1      1-7     works
>>>> 8  >1      8       hangs
>>>> 16 1       any     works
>>>> 16 >1      1-15    works
>>>> 16 >1      16      hangs
>>> 
>>> How many cores do you have in each system? Looks like 8 is the maximum
>> 
>>> IBM offers from their datasheet, and still you can request 16 per
>> node?
>>> 
>>> Can it be a memory porblem?
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> We have also tried test jobs on 8+7 (or 7+8) with inconclusive
>>> results.
>>>> Some of the live jobs run for a month or more and cut down versions 
>>>> do
>>> 
>>>> not model well.
>>>> 
>>>> Martin Rushton
>>>> HPC System Manager, Weapons Technologies
>>>> Tel: 01959 514777, Mobile: 07939 219057
>>>> email: jmrush...@qinetiq.com
>>>> www.QinetiQ.com
>>>> QinetiQ - Delivering customer-focused solutions
>>>> 
>>>> Please consider the environment before printing this email.
>>>> -----Original Message-----
>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>>>> On Behalf Of Ralph Castain
>>>> Sent: 13 April 2011 15:34
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] Over committing?
>>>> 
>>>> 
>>>> On Apr 13, 2011, at 8:13 AM, Rushton Martin wrote:
>>>> 
>>>>> The bulk of our compute nodes are 8 cores (twin 4-core IBM
>> x3550-m2).
>>>>> Jobs are submitted by Torque/MOAB.  When run with up to np=8 there 
>>>>> is
>>> 
>>>>> good performance.  Attempting to run with more processors brings 
>>>>> problems, specifically if any one node of a group of nodes has all 8
>> 
>>>>> cores in use the job hangs.  For instance running with 14 cores 
>>>>> (7+7)
>>> 
>>>>> is fine, but running with 16 (8+8) hangs.
>>>>> 
>>>>>> From the FAQs I note the issues of over committing and aggressive
>>>>> scheduling.  Is it possible for mpirun (or orted on the remote 
>>>>> nodes)
>>> 
>>>>> to be blocked from progressing by a fully committed node?  We have a
>> 
>>>>> few
>>>>> x3755-m2 machines with 16 cores, and we have detected a similar 
>>>>> issue
>>> 
>>>>> with 16+16.
>>>> 
>>>> I'm not entirely sure I understand your notation, but we have never 
>>>> seen an issue when running with fully loaded nodes (i.e., where the 
>>>> number of MPI procs on the node = the number of cores).
>>>> 
>>>> What version of OMPI are you using? Are you binding the procs?
>>>> This email and any attachments to it may be confidential and are 
>>>> intended solely for the use of the individual to whom it is
>> addressed.
>>> 
>>>> If you are not the intended recipient of this email, you must neither
>> 
>>>> take any action based upon its contents, nor copy or show it to 
>>>> anyone. Please contact the sender if you believe you have received 
>>>> this email in error. QinetiQ may monitor email traffic data and also 
>>>> the content of email for the purposes of security. QinetiQ Limited 
>>>> (Registered in England & Wales: Company Number: 3796233) Registered
>>>> office: Cody Technology Park, Ively Road, Farnborough, Hampshire, 
>>>> GU14
>>> 
>>>> 0LX  http://www.qinetiq.com.
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> The QinetiQ e-mail privacy policy and company information is detailed
>> elsewhere in the body of this email.
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> The QinetiQ e-mail privacy policy and company information is detailed 
>> elsewhere in the body of this email.
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Over committing?

Reply via email to