Hi folks, We have had a number of occasions where users run front-end code in their Slurm jobs on our launch nodes for our BlueGene/Q system and use up a large amount of memory/CPU by error.
Now on a BG/Q system all the jobs on the launch node are meant to be doing are to be starting back end applications via srun which then run on the compute nodes. So if a user manages to run the system out of memory and trigger an OOM and that (say) kills GPFS then we lose everyones jobs that are running on that launch node. I'd like to be able to use control groups to limit the amount of memory any single job can use on the launch node and wondering if any one else is doing this? I can see you can set cores to be unconstrained (which is important as our front end node doesn't have 65,535 cores to match our 4 rack system) so it would appear possible, but I'd love to hear from anyone who is... Of course there may be occasions where only part of the application runs on the back end and other parts the users might need will need to run on the front end (yes GROMACS, I'm looking at you) but if we permit users to request memory then we are likely to be able to handle this gracefully. cheers! Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci