On Thu, 15 Mar 2012 at 1:53pm, Reuti wrote

PS: In your example you also had the case 2 slots in the low priority queue, what is the actual setup in your cluster?

Our actual setup is:

 o lab.q, slots=numprocs, load_thresholds=np_load_avg=1.5, labs (=SGE
   projects) limited by RQS to a number of slots equal to their "share" of
   the cluster, seq_no=0, priority=0.

 o long.q, slots=numprocs, load_thresholds=np_load_avg=0.9, seq_no=1,
   priority=19

 o short.q, slots=numprocs, load_thresholds=np_load_avg=1.25, users
   limited by RQS to 200 slots, runtime limited to 30 minutes, seq_no=2,
   priority=10

Users are instructed to not select a queue when submitting jobs. The theory is that even if non-contributing users have filled the cluster with long.q jobs, contributing users will still have instant access to "their" lab.q slots, overloading nodes with jobs running at a higher priority than the long.q jobs. long.q jobs won't start on nodes full of lab.q jobs. And short.q is for quick, high priority jobs regardless of cluster status (the main use case being processing MRI data into images while a patient is physically in the scanner).

The truth is our cluster is primarily used for, and thus SGE is tuned for, large numbers of serial jobs. We do have *some* folks running parallel code, and it *is* starting to get to the point where I need to reconfigure things to make that part work better.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF

Reply via email to