Public bug reported: Currently, when using charmhelpers.contrib.charmsupport.nrpe.add_check(), service checks are defined with a max_check_attempts = 4 and retry_check_interval = 1. this means that when a service fault is detected, 4 checks of that service must have a non-OK result to turn into a HARD fault that requires notification through alerting (pagerduty, email, etc).
Some checks defined in NRPE and by other charms have known ebb and flow of threshold crossing that results in self-resolved alerts. One such example might be rabbitmq-server's unconsumed messages threshold, wherein we know that when a nova/neutron node restarts, there is a swelling of queues for up to 30 minutes of unconsumed fanout queues that will be reaped by nova or neutron after an amount of time has passed. It would be very useful to provide different max_check_attempts options to charm developers and nrpe check developers to be able to identify which checks should alert immediately, and which checks should, potentially, not alert unless they've been active for 2 hours. See https://bugs.launchpad.net/charm-hw-health/+bug/1876931 for an example where having the ability to ignore IPMI hardware timeouts for a couple hours would reduce operational overhead for services known to have issues that self-resolve in normal circumstances and would continue well past the check attempt timing if there is an actual issue. ** Affects: charm-nagios Importance: Undecided Status: New ** Affects: charm-nrpe Importance: Undecided Status: New ** Also affects: charm-nagios Importance: Undecided Status: New -- You received this bug notification because you are a member of Nagios Charm developers, which is subscribed to Nagios Charm. https://bugs.launchpad.net/bugs/1877400 Title: Need ability to tune service checks to non-default notification profiles Status in Nagios Charm: New Status in NRPE Charm: New Bug description: Currently, when using charmhelpers.contrib.charmsupport.nrpe.add_check(), service checks are defined with a max_check_attempts = 4 and retry_check_interval = 1. this means that when a service fault is detected, 4 checks of that service must have a non-OK result to turn into a HARD fault that requires notification through alerting (pagerduty, email, etc). Some checks defined in NRPE and by other charms have known ebb and flow of threshold crossing that results in self-resolved alerts. One such example might be rabbitmq-server's unconsumed messages threshold, wherein we know that when a nova/neutron node restarts, there is a swelling of queues for up to 30 minutes of unconsumed fanout queues that will be reaped by nova or neutron after an amount of time has passed. It would be very useful to provide different max_check_attempts options to charm developers and nrpe check developers to be able to identify which checks should alert immediately, and which checks should, potentially, not alert unless they've been active for 2 hours. See https://bugs.launchpad.net/charm-hw-health/+bug/1876931 for an example where having the ability to ignore IPMI hardware timeouts for a couple hours would reduce operational overhead for services known to have issues that self-resolve in normal circumstances and would continue well past the check attempt timing if there is an actual issue. To manage notifications about this bug go to: https://bugs.launchpad.net/charm-nagios/+bug/1877400/+subscriptions -- Mailing list: https://launchpad.net/~nagios-charmers Post to : [email protected] Unsubscribe : https://launchpad.net/~nagios-charmers More help : https://help.launchpad.net/ListHelp

