Heya, On 19:21 Wed 26 Mar , Gus Correa wrote: > On 03/26/2014 05:26 PM, Ross Boylan wrote: > > [Main part is at the bottom] > > On Wed, 2014-03-26 at 19:28 +0100, Andreas Schäfer wrote: > >> On 09:08 Wed 26 Mar , Ross Boylan wrote: > >>> Second, we do not operate in a batch queuing environment > >> Why not fix that? > > I'm not the sysadmin, though I'm involved in the group that sets policy. > > At one point we were using Sun's grid engine, but I don't think it's > > installed now. I'm not sure why. > > > > We have discussed putting in a batch queuing system and nobody was > > really pushing for it. My impression was (and probably still is) that > > it was more pain than gain. There is hassle not only for the sysadmin > > to set it up (and, I suppose, monitor it), but for users. Personally I > > run a lot of interactive parallel jobs (the interaction is on rank 0 > > only). I have the impression that won't work under a batch system, > > though I could be wrong. I also had the impression we'd need to have an > > estimate of how long the job would run when we submit, and we don't > > always know. > > But I've never really used such a system, and may not appreciate what it > would get us. The other reason we haven't bothered is that the load on > the cluster was relatively light and contention was low. That is less > and less true, which probably starts tipping the balance toward a > queuing system. > > This is wandering off topic, but if you or anyone else could say more > about why you regard the absence of a queuing system as a problem that > should be fixed, I'd love to hear it. > > Ross > > Hi Ross > > Some pros: > (I don't know of any cons.)
I second Gus' statement that there are no real downsides for a queueing system. These systems actually relieves both, users and admins from a lot of tedious fiddling and debugging. If you're doing a fresh install, then I'd suggest you to use Slurm[1]. It's a breeze to install and easy to maintain. It also integrates well with all major MPI implementations. Yes, the admin and users need to invest to time to learn the ropes, but they payoff is almost instant. Source: I'm the sysadmin for our research clusters. > Queue systems won't allow resources to be oversubscribed. I'm fairly confident that you can configure Slurm to oversubscribe nodes: just specify more cores for a node than are actually present. > Queue systems do support interactive jobs (even with X-windows GUIs, if > needed). Right, actually we've just moved a couple of systems, which are primarily running interactive jobs, to Slurm to ease arbitration of resources. Previously users were frequently stepping on each others toes (Who's pinning jobs to which core? Who's using which GPU? How much RAM do you consume?) These problems are gone now. Cheers -Andreas [1] https://computing.llnl.gov/linux/slurm/ -- ========================================================== Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!
signature.asc
Description: Digital signature