[slurm-dev] Using control groups to restrict resource usage on BG/Q launch node?

Christopher Samuel Mon, 22 Sep 2014 06:08:13 -0700

Hi folks,

We have had a number of occasions where users run front-end code in
their Slurm jobs on our launch nodes for our BlueGene/Q system and use
up a large amount of memory/CPU by error.


Now on a BG/Q system all the jobs on the launch node are meant to be
doing are to be starting back end applications via srun which then run
on the compute nodes.   So if a user manages to run the system out of
memory and trigger an OOM and that (say) kills GPFS then we lose
everyones jobs that are running on that launch node.

I'd like to be able to use control groups to limit the amount of memory
any single job can use on the launch node and wondering if any one else
is doing this?

I can see you can set cores to be unconstrained (which is important as
our front end node doesn't have 65,535 cores to match our 4 rack system)
so it would appear possible, but I'd love to hear from anyone who is...

Of course there may be occasions where only part of the application runs
on the back end and other parts the users might need will need to run on
the front end (yes GROMACS, I'm looking at you) but if we permit users
to request memory then we are likely to be able to handle this gracefully.

cheers!
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

[slurm-dev] Using control groups to restrict resource usage on BG/Q launch node?

Reply via email to