I think cgroups is prob more elegant  .......... but here is another script

https://github.com/FredHutch/IT/blob/master/py/loadwatcher.py#L59

The email text is hard coded so please change before using.   We put this in 
place in Oct 2017 when things where getting out of control because folks were 
using much more multithreaded software than before. Since then we had 95 users 
removed from one of the login nodes and several 100 warnings sent.  The killall 
-9 -v -g -u username
has been very effective. We have 3 login nodes with 28 cores and almost 400G 
RAM.

Dirk


-----Original Message-----
From: hpcxxx...@lists.fhcrc.org [mailto:hpcxxxxxxxx...@lists.fhcrc.org] On 
Behalf Of loadwatchxxxxxxxx...@fhcrc.org
Sent: Tuesday, November 14, 2017 11:45 AM
To: Doe, John <xxxxxxxxx @fredhutch.org>
Subject: [hpcpol] RHINO3: Your jobs have been removed!



This is a notification message from loadwatcher.py, running on host RHINO3. 
Please review the following message:



jdoe, your CPU utilization on rhino3 is currently 4499 %!



For short term jobs you can use no more than 400 % or 4.0 CPU cores on the 
Rhino machines.

We have removed all your processes from this computer.

Please try again and submit batch jobs

or use the 'grabnode' command for interactive jobs.



see http://scicomp.fhcrc.org/Gizmo%20Cluster%20Quickstart.aspx

or http://scicomp.fhcrc.org/Grab%20Commands.aspx

or http://scicomp.fhcrc.org/SciComp%20Office%20Hours.aspx



If output is being captured, you may find additional information in your logs.





Dirk Petersen
Scientific Computing Director
Fred Hutch
1100 Fairview Ave. N.
Mail Stop M4-A882
Seattle, WA 98109
Phone: 206.667.5926
Skype: internetchen

[cid:8C6A9079-96CB-447C-94D9-DD59438042C1]

Reply via email to