On May 4, 2021, at 12:41, Bill Anderson via lustre-discuss 
<[email protected]<mailto:[email protected]>> wrote:

   Hi All,

   Can you recommend good ways to identify Lustre client hosts that might be 
causing stability or performance problems for the entire filesystem?

   For example, if a user is inadvertently doing something that's creating an 
RPC storm, what are good ways to identify the client host that has triggered 
the storm?

If you have a JobID enabled on the clients (which can be done even if they are 
not batch scheduled, like "procname_uid" for login nodes), then you can watch 
"lctl get_param *.*.job_stats | grep -v ' 0, unit:'" (to filter out unused 
stats) to see if there are *jobs* which put a high RPC load on that server.

If you are looking for a particular *client* you can look at "lctl get_param 
*.*.exports.*.stats" to see if any are driving a lot of RPCs, possibly after 
clearing those stats with "lctl set_param *.*.exports.*.stats=0".

If you feel inclined, it would be quite useful to add a mode to the "llstat" 
utility to be able to read and aggregate stats from e.g. all the 
"exports.*.stats" files and show the top users by NID and RPC count.  I think 
several people have made scripts to this effect (you might even find some on 
Github), but nobody has ever submitted it to be included into the repo for 
everyone to use.  There are more elaborate monitoring systems (e.g. IML, lltop, 
Graphana that need agents installed, central monitoring, etc.), but having a 
simple "check load on the local node like 'top'" tool would still be helpful.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to