Thank you so much!
On Tue, May 4, 2021 at 1:31 PM Andreas Dilger <[email protected]> wrote: > On May 4, 2021, at 12:41, Bill Anderson via lustre-discuss < > [email protected]> wrote: > > > Hi All, > > Can you recommend good ways to identify Lustre client hosts that might > be causing stability or performance problems for the entire filesystem? > > For example, if a user is inadvertently doing something that's creating > an RPC storm, what are good ways to identify the client host that has > triggered the storm? > > > If you have a JobID enabled on the clients (which can be done even if they > are not batch scheduled, like "procname_uid" for login nodes), then you can > watch "lctl get_param *.*.job_stats | grep -v ' 0, unit:'" (to filter out > unused stats) to see if there are *jobs* which put a high RPC load on that > server. > > If you are looking for a particular *client* you can look at "lctl > get_param *.*.exports.*.stats" to see if any are driving a lot of RPCs, > possibly after clearing those stats with "lctl set_param > *.*.exports.*.stats=0". > > If you feel inclined, it would be quite useful to add a mode to the > "llstat" utility to be able to read and aggregate stats from e.g. all the > "exports.*.stats" files and show the top users by NID and RPC count. I > think several people have made scripts to this effect (you might even find > some on Github), but nobody has ever submitted it to be included into the > repo for everyone to use. There are more elaborate monitoring systems > (e.g. IML, lltop, Graphana that need agents installed, central monitoring, > etc.), but having a simple "check load on the local node like 'top'" tool > would still be helpful. > > Cheers, Andreas > -- > Andreas Dilger > Principal Lustre Architect > Whamcloud > > > > > > >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
