We're moving to Prometheus for lots of our monitoring functions. We've got nagios and ganglia in place, but Prometheus and Grafana makes a really nice combo for monitoring and alerting.
There's even an exporter for Slurm- https://github.com/vpenso/prometheus-slurm-exporter that includes node data, job information, and scheduling statistics. Haven't had a chance to install that yet, but I expect we'll be doing that soon: monitoring scheduler performance is one area we need to watch a little closer. Michael On Thu, Jan 18, 2018 at 1:34 PM, Lachlan Musicman <data...@gmail.com> wrote: > On 19 January 2018 at 07:29, Ryan Novosielski <novos...@rutgers.edu> > wrote: > >> Hi all, >> >> Looked back at the mailing list to see if there was a question about this >> already. There was some mention of /using/ Nagios, but no real mention of >> specifics. What do people monitor with Nagios? We monitor, so far, >> slurmctld, slurmdbd, and MySQL, but there are probably some others. Might >> be helpful to run “scontrol ping” for example, or similar, on our login >> nodes. >> >> Does anyone have any plugins they’ve written or ideas they can share? >> Nagios Exchange doesn’t have anything with SLURM anywhere in the name. >> >> Thanks! >> > > > Off the top of my head the only other two that I would want explicitly > would be: > - ntp/chrony and their respective ntpd. Nodes go offline when the timing > slides too far, especially if you are using Munge. > - authentication system - in our case ipa/sssd. Without that, even the > queued jobs will fail. > > We use Zabbix in house. I was under the impression that people were moving > toward icingia2 over Nagios. > > Cheers > L. > > ------ > "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic > civics is the insistence that we cannot ignore the truth, nor should we > panic about it. It is a shared consciousness that our institutions have > failed and our ecosystem is collapsing, yet we are still here — and we are > creative agents who can shape our destinies. Apocalyptic civics is the > conviction that the only way out is through, and the only way through is > together. " > > *Greg Bloom* @greggish https://twitter.com/greggish/ > status/873177525903609857 >