We're using icinga2 storing accounting data in influxdb for grafana dashboards. In terms of monitoring I prefere end-user functionality, so apart from services we also have a plugin that submits a jobs to cluster (to idle nodes, with a few minutes of deadline) the job simply creates files on shared filesystem effectively monitoring slurmctl, slurmd, sssd, filesystems etc.
cheers, Marcin 2018-01-19 5:44 GMT+01:00 Ryan Novosielski <novos...@rutgers.edu>: > > On Jan 18, 2018, at 4:34 PM, Lachlan Musicman <data...@gmail.com> wrote: > > > > On 19 January 2018 at 07:29, Ryan Novosielski <novos...@rutgers.edu> > wrote: > > Hi all, > > > > Looked back at the mailing list to see if there was a question about > this already. There was some mention of /using/ Nagios, but no real mention > of specifics. What do people monitor with Nagios? We monitor, so far, > slurmctld, slurmdbd, and MySQL, but there are probably some others. Might > be helpful to run “scontrol ping” for example, or similar, on our login > nodes. > > > > Does anyone have any plugins they’ve written or ideas they can share? > Nagios Exchange doesn’t have anything with SLURM anywhere in the name. > > > > Thanks! > > > > > > Off the top of my head the only other two that I would want explicitly > would be: > > - ntp/chrony and their respective ntpd. Nodes go offline when the > timing slides too far, especially if you are using Munge. > > - authentication system - in our case ipa/sssd. Without that, even the > queued jobs will fail. > > > > We use Zabbix in house. I was under the impression that people were > moving toward icingia2 over Nagios. > > I wouldn’t mind moving to Icinga2 over Nagios, but really, it’s more or > less a nicer version of the same thing, so I’d have the same question with > Icinga2. > > Thanks for the NTP/Chrony tip though — if I get only that from this > thread, it will have been worth it. That’s caused us trouble more than > once. We do already monitor our LDAP, but SSSD is a good idea. > > -- > ____ > || \\UTGERS, |---------------------------* > O*--------------------------- > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS > Campus > || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark > `' >