I think cgroups is prob more elegant .......... but here is another script
https://github.com/FredHutch/IT/blob/master/py/loadwatcher.py#L59 The email text is hard coded so please change before using. We put this in place in Oct 2017 when things where getting out of control because folks were using much more multithreaded software than before. Since then we had 95 users removed from one of the login nodes and several 100 warnings sent. The killall -9 -v -g -u username has been very effective. We have 3 login nodes with 28 cores and almost 400G RAM. Dirk -----Original Message----- From: hpcxxx...@lists.fhcrc.org [mailto:hpcxxxxxxxx...@lists.fhcrc.org] On Behalf Of loadwatchxxxxxxxx...@fhcrc.org Sent: Tuesday, November 14, 2017 11:45 AM To: Doe, John <xxxxxxxxx @fredhutch.org> Subject: [hpcpol] RHINO3: Your jobs have been removed! This is a notification message from loadwatcher.py, running on host RHINO3. Please review the following message: jdoe, your CPU utilization on rhino3 is currently 4499 %! For short term jobs you can use no more than 400 % or 4.0 CPU cores on the Rhino machines. We have removed all your processes from this computer. Please try again and submit batch jobs or use the 'grabnode' command for interactive jobs. see http://scicomp.fhcrc.org/Gizmo%20Cluster%20Quickstart.aspx or http://scicomp.fhcrc.org/Grab%20Commands.aspx or http://scicomp.fhcrc.org/SciComp%20Office%20Hours.aspx If output is being captured, you may find additional information in your logs. Dirk Petersen Scientific Computing Director Fred Hutch 1100 Fairview Ave. N. Mail Stop M4-A882 Seattle, WA 98109 Phone: 206.667.5926 Skype: internetchen [cid:8C6A9079-96CB-447C-94D9-DD59438042C1]