...and note that it matters greatly where those barriers are - i.e., if a loop over a collective exists, and the barriers aren't in the loop, then it has nothing to do with the problem I described.
On Apr 14, 2011, at 5:27 AM, Jeff Squyres wrote: > I think the next reasonable step is to use some kind of diagnostic to find > out where and why the application is hung. padb is a great free/open source > tool that can be used here. > > > On Apr 14, 2011, at 4:46 AM, Rushton Martin wrote: > >> I forwarded your question to the code custodian and received the >> following reply (GRIM is the major code, the one which shows the >> problem): "I've not tried the debugger but GRIM does have a number of >> mpi_barrier calls in it so I would think we are safe there. There is of >> course a performance downside with an over-use of barriers! As mentioned >> in the e-trail." >> >> >> Martin Rushton >> HPC System Manager, Weapons Technologies >> Tel: 01959 514777, Mobile: 07939 219057 >> email: jmrush...@qinetiq.com >> www.QinetiQ.com >> QinetiQ - Delivering customer-focused solutions >> >> Please consider the environment before printing this email. >> -----Original Message----- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >> Behalf Of Ralph Castain >> Sent: 14 April 2011 04:55 >> To: Open MPI Users >> Subject: Re: [OMPI users] Over committing? >> >> Have you folks used a debugger such as TotalView or padb to look at >> these stalls? >> >> I ask because we discovered a long time ago that MPI collectives can >> "hang" in the scenario you describe. It is caused by one rank falling >> behind, and then never catching up due to resource allocations - i.e.., >> once you fall behind due to the processor being used by something else, >> you never catch up. >> >> The code that causes this is generally a loop around a collective such >> as Allreduce. The solution was to inject a "barrier" operation in the >> loop periodically, thus ensuring that all ranks had an opportunity to >> catch up. >> >> There is an MCA param you can set that will inject the barrier - it >> specifies to inject it every N collective operations (either before or >> after the Nth op): >> >> -mca coll_sync_barrier_before N >> >> or >> >> -mca coll_sync_barrier_after N >> >> It'll slow the job down a little, depending upon how often you inject >> the barrier. But it did allow us to run jobs reliably to completion when >> the code involved such issues. >> >> >> On Apr 13, 2011, at 10:07 AM, Rushton Martin wrote: >> >>> The 16 cores refers to x3755-m2s. We have a mix of 3550s and 3755s in >> >>> the cluster. >>> >>> It could be memory, but I think not. The jobs are well within memory >>> capacity, and the memory is mainly static. If out of memory then the >>> jobs would be first candidate for the job. Larger jobs run on the >>> 3755s which as well as more memory have local disks for paging to. >>> >>> >>> Martin Rushton >>> HPC System Manager, Weapons Technologies >>> Tel: 01959 514777, Mobile: 07939 219057 >>> email: jmrush...@qinetiq.com >>> www.QinetiQ.com >>> QinetiQ - Delivering customer-focused solutions >>> >>> Please consider the environment before printing this email. >>> -----Original Message----- >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >>> On Behalf Of Reuti >>> Sent: 13 April 2011 16:53 >>> To: Open MPI Users >>> Subject: Re: [OMPI users] Over committing? >>> >>> Am 13.04.2011 um 17:09 schrieb Rushton Martin: >>> >>>> Version 1.3.2 >>>> >>>> Consider a job that will run with 28 processes. The user submits it >>>> with: >>>> >>>> $ qsub -l nodes=4:ppn=7 ... >>>> >>>> which reserves 7 cores on (in this case) each of x3550x014 x3550x015 >>>> and >>>> x3550x016 x3550x020. Torque generates a file (PBS_NODEFILE) which >>>> lists each node 7 times. >>>> >>>> The mpirun command given within the batch script is: >>>> >>>> $ mpirun -np 28 -machinefile $PBS_NODEFILE <executable> >>>> >>>> This is what I would refer to as 7+7+7+7, and it runs fine. >>>> >>>> The problem occurs if, for instance, a 24 core job is attempted. >>>> qsub >>> >>>> gets nodes=3:ppn=8 and mpirun has -np 24. The job is now running on >>>> three nodes, using all eight cores on each node - 8+8+8. This sort >>>> of >>> >>>> job will eventually hang and has to be killed off. >>>> >>>> Cores Nodes Ppn Result >>>> ----- ----- --- ------ >>>> 8 1 any works >>>> 8 >1 1-7 works >>>> 8 >1 8 hangs >>>> 16 1 any works >>>> 16 >1 1-15 works >>>> 16 >1 16 hangs >>> >>> How many cores do you have in each system? Looks like 8 is the maximum >> >>> IBM offers from their datasheet, and still you can request 16 per >> node? >>> >>> Can it be a memory porblem? >>> >>> -- Reuti >>> >>> >>>> We have also tried test jobs on 8+7 (or 7+8) with inconclusive >>> results. >>>> Some of the live jobs run for a month or more and cut down versions >>>> do >>> >>>> not model well. >>>> >>>> Martin Rushton >>>> HPC System Manager, Weapons Technologies >>>> Tel: 01959 514777, Mobile: 07939 219057 >>>> email: jmrush...@qinetiq.com >>>> www.QinetiQ.com >>>> QinetiQ - Delivering customer-focused solutions >>>> >>>> Please consider the environment before printing this email. >>>> -----Original Message----- >>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >>>> On Behalf Of Ralph Castain >>>> Sent: 13 April 2011 15:34 >>>> To: Open MPI Users >>>> Subject: Re: [OMPI users] Over committing? >>>> >>>> >>>> On Apr 13, 2011, at 8:13 AM, Rushton Martin wrote: >>>> >>>>> The bulk of our compute nodes are 8 cores (twin 4-core IBM >> x3550-m2). >>>>> Jobs are submitted by Torque/MOAB. When run with up to np=8 there >>>>> is >>> >>>>> good performance. Attempting to run with more processors brings >>>>> problems, specifically if any one node of a group of nodes has all 8 >> >>>>> cores in use the job hangs. For instance running with 14 cores >>>>> (7+7) >>> >>>>> is fine, but running with 16 (8+8) hangs. >>>>> >>>>>> From the FAQs I note the issues of over committing and aggressive >>>>> scheduling. Is it possible for mpirun (or orted on the remote >>>>> nodes) >>> >>>>> to be blocked from progressing by a fully committed node? We have a >> >>>>> few >>>>> x3755-m2 machines with 16 cores, and we have detected a similar >>>>> issue >>> >>>>> with 16+16. >>>> >>>> I'm not entirely sure I understand your notation, but we have never >>>> seen an issue when running with fully loaded nodes (i.e., where the >>>> number of MPI procs on the node = the number of cores). >>>> >>>> What version of OMPI are you using? Are you binding the procs? >>>> This email and any attachments to it may be confidential and are >>>> intended solely for the use of the individual to whom it is >> addressed. >>> >>>> If you are not the intended recipient of this email, you must neither >> >>>> take any action based upon its contents, nor copy or show it to >>>> anyone. Please contact the sender if you believe you have received >>>> this email in error. QinetiQ may monitor email traffic data and also >>>> the content of email for the purposes of security. QinetiQ Limited >>>> (Registered in England & Wales: Company Number: 3796233) Registered >>>> office: Cody Technology Park, Ively Road, Farnborough, Hampshire, >>>> GU14 >>> >>>> 0LX http://www.qinetiq.com. >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> The QinetiQ e-mail privacy policy and company information is detailed >> elsewhere in the body of this email. >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> The QinetiQ e-mail privacy policy and company information is detailed >> elsewhere in the body of this email. >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users