Afraid I have no idea - we regularly run on Torque machines with the nodes fully populated. While most runs are only for a few hours, some runs go for days.
How was OMPI configured? What OS version? On Apr 13, 2011, at 9:09 AM, Rushton Martin wrote: > Version 1.3.2 > > Consider a job that will run with 28 processes. The user submits it > with: > > $ qsub -l nodes=4:ppn=7 ... > > which reserves 7 cores on (in this case) each of x3550x014 x3550x015 and > x3550x016 x3550x020. Torque generates a file (PBS_NODEFILE) which lists > each node 7 times. > > The mpirun command given within the batch script is: > > $ mpirun -np 28 -machinefile $PBS_NODEFILE <executable> > > This is what I would refer to as 7+7+7+7, and it runs fine. > > The problem occurs if, for instance, a 24 core job is attempted. qsub > gets nodes=3:ppn=8 and mpirun has -np 24. The job is now running on > three nodes, using all eight cores on each node - 8+8+8. This sort of > job will eventually hang and has to be killed off. > > Cores Nodes Ppn Result > ----- ----- --- ------ > 8 1 any works > 8 >1 1-7 works > 8 >1 8 hangs > 16 1 any works > 16 >1 1-15 works > 16 >1 16 hangs > > We have also tried test jobs on 8+7 (or 7+8) with inconclusive results. > Some of the live jobs run for a month or more and cut down versions do > not model well. > > Martin Rushton > HPC System Manager, Weapons Technologies > Tel: 01959 514777, Mobile: 07939 219057 > email: jmrush...@qinetiq.com > www.QinetiQ.com > QinetiQ - Delivering customer-focused solutions > > Please consider the environment before printing this email. > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: 13 April 2011 15:34 > To: Open MPI Users > Subject: Re: [OMPI users] Over committing? > > > On Apr 13, 2011, at 8:13 AM, Rushton Martin wrote: > >> The bulk of our compute nodes are 8 cores (twin 4-core IBM x3550-m2). >> Jobs are submitted by Torque/MOAB. When run with up to np=8 there is >> good performance. Attempting to run with more processors brings >> problems, specifically if any one node of a group of nodes has all 8 >> cores in use the job hangs. For instance running with 14 cores (7+7) >> is fine, but running with 16 (8+8) hangs. >> >>> From the FAQs I note the issues of over committing and aggressive >> scheduling. Is it possible for mpirun (or orted on the remote nodes) >> to be blocked from progressing by a fully committed node? We have a >> few >> x3755-m2 machines with 16 cores, and we have detected a similar issue >> with 16+16. > > I'm not entirely sure I understand your notation, but we have never seen > an issue when running with fully loaded nodes (i.e., where the number of > MPI procs on the node = the number of cores). > > What version of OMPI are you using? Are you binding the procs? > This email and any attachments to it may be confidential and are > intended solely for the use of the individual to whom it is > addressed. If you are not the intended recipient of this email, > you must neither take any action based upon its contents, nor > copy or show it to anyone. Please contact the sender if you > believe you have received this email in error. QinetiQ may > monitor email traffic data and also the content of email for > the purposes of security. QinetiQ Limited (Registered in England > & Wales: Company Number: 3796233) Registered office: Cody Technology > Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com. > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users