Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

Joshua Baker-LePain Thu, 15 Mar 2012 13:14:59 -0400

On Thu, 15 Mar 2012 at 1:53pm, Reuti wrote

PS: In your example you also had the case 2 slots in the low priorityqueue, what is the actual setup in your cluster?


Our actual setup is:

 o lab.q, slots=numprocs, load_thresholds=np_load_avg=1.5, labs (=SGE
   projects) limited by RQS to a number of slots equal to their "share" of
   the cluster, seq_no=0, priority=0.

 o long.q, slots=numprocs, load_thresholds=np_load_avg=0.9, seq_no=1,
   priority=19

 o short.q, slots=numprocs, load_thresholds=np_load_avg=1.25, users
   limited by RQS to 200 slots, runtime limited to 30 minutes, seq_no=2,
   priority=10

Users are instructed to not select a queue when submitting jobs. Thetheory is that even if non-contributing users have filled the cluster withlong.q jobs, contributing users will still have instant access to "their"lab.q slots, overloading nodes with jobs running at a higher priority thanthe long.q jobs. long.q jobs won't start on nodes full of lab.q jobs.And short.q is for quick, high priority jobs regardless of cluster status(the main use case being processing MRI data into images while a patientis physically in the scanner).

The truth is our cluster is primarily used for, and thus SGE is tuned for,large numbers of serial jobs. We do have *some* folks running parallelcode, and it *is* starting to get to the point where I need to reconfigurethings to make that part work better.


--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

Reply via email to