On Mon, Jan 13, 2014 at 10:53 PM, Michael Friedrich <michael.friedr...@gmail.com> wrote: > On 12.12.2013 13:32, Gerd Radecke wrote: >> >> Hi everybody, >> >> I'm looking at an issue with notifications and I'm unsure whether this >> is working as designed or not. >> >> I'm getting service notifications when a service that has been in that >> state for a long time changes from WARNING; HARD to CRITICAL;HARD >> after one check because of a check timeout. >> Three seconds later, the host check returns with DOWN, SOFT, yet only >> once, so the host never gets to DOWN, HARD. >> >> I thought that if the host is down, no service notifications will be >> sent. >> http://docs.icinga.org/latest/en/checkscheduling.html#hostcheckscheduling >> actually states that "when Icinga is check [sic!] the status of a >> host, it holds off on doing anything else" - so I would expect it to >> also not send the service notification I'm seeing until it's sure what >> the host status is :/ >> >> The Log with comments is here: >> >> # 1. Status - of dbserver;Disk_E is WARNING;HARD and has been so for a >> while(also see the last line in this log) >> >> >> Dec 11 23:05:34 icinga_server icinga: SERVICE ALERT: >> db_server;Disk_E;CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 >> seconds >> # 2. When we get a Critical for Disk_E because of the timeout, the >> status goes to Critical, HARD which conforms to >> http://docs.icinga.org/latest/en/statetypes.html - 5.8.4 and 5.8.5 >> >> # 3. If I understand >> http://docs.icinga.org/latest/en/checkscheduling.html#hostcheckscheduling >> correctly, on every service state change, icinga will do a check of >> the host, to see if its status changed as well. > > > There's a cache involved, not immediately forcing an actual host check > itsself. > http://docs.icinga.org/latest/en/configmain.html#configmain-cached_host_check_horizon > > That said, if a previous failing service check triggered a host check, > resulting in an UP state, it could happen that the service check afterwards > in that given check horizon will result in a "host is assumed UP, please > notify the service". > > >> So in this case, a >> host check should be performed before any further action is taken. >> What actually happens is that the result is processed and a service >> notification is send out immediately >> Dec 11 23:05:34 icinga_server icinga: SERVICE NOTIFICATION: >> prio1;db_server;Disk_E;CRITICAL;notify_service_email_24x7;CRITICAL - >> Socket timeout after 10 seconds > > > Any debug logs for specifically the host, and all surrounding service > checks? (level checks/events or higher, verbosity 2) > > >> >> # 4. Only a few seconds afterswards does icinga show new results for >> the host state, but the are still SOFT. >> Dec 11 23:05:37 icinga_server icinga: HOST ALERT: >> db_server;DOWN;SOFT;1;CRITICAL - Host Unreachable (172.16.28.132) > > > max_check_attempts of that host? what state (log entry) did it have before? > > >> >> # 5. The host is reachable again. >> Dec 11 23:08:44 icinga_server icinga: HOST ALERT: >> db_server;UP;SOFT;2;PING OK - Packet loss = 0%, RTA = 45.00 ms >> >> # 6. Service status goes back to Warning. >> Dec 11 23:20:24 icinga_server icinga: SERVICE ALERT: >> db_server;Disk_E;WARNING;HARD;3;e:\ - total: 180.00 Gb - used: 163.08 >> Gb (91%) - free 16.91 Gb (9%) >> >> >> So I'm wondering: is sending notifications on this described change >> from Warning -> Critical >> a) the correct behavior or >> b) should icinga not send this service notification because the host >> is DOWN and the service state can therefore not be determined. > > > Depends on the host state in that specific situation, and if it changed / > was cached. > > > -- > DI (FH) Michael Friedrich > > mail: michael.friedr...@gmail.com > twitter: https://twitter.com/dnsmichi > jabber: dnsmi...@jabber.ccc.de > irc: irc.freenode.net/icinga dnsmichi > > icinga open source monitoring > position: lead core developer > url: https://www.icinga.org > > _______________________________________________ > icinga-users mailing list > icinga-users@lists.icinga.org > https://lists.icinga.org/mailman/listinfo/icinga-users
Hi Michael, Thanks for reviving this thread. I'll try to give all the detail that is needed - the involved config would be: $ grep horizon /etc/icinga/icinga.cfg cached_host_check_horizon=15 cached_service_check_horizon=15 and debug_level=16 debug_verbosity=2 db_server host object: define host { host_name db_server display_name db_server initial_state o check_command check_ping!1000,5%!4000,20%!1000,5%!4000,20% retry_interval 60 max_check_attempts 3 check_interval 60 passive_checks_enabled 1 event_handler_enabled 1 flap_detection_enabled 1 process_perf_data 1 retain_status_information 1 retain_nonstatus_information 1 notification_interval 0 notification_period 24x7 notifications_enabled 1 failure_prediction_enabled 1 active_checks_enabled 0 alias db_server address 10.10.10.161 notification_options d,u parents db_server_parent contact_groups admins contacts gradecke } So we have max_check_attempts = 3 for the host with a check_interval of 60 (I know it's odd to only check it once an hour, but that's how it was set up and it actually makes it easier to ensure that the cached_host_check_horizon is definitely over) custom check so I can return service states as I wish: define service { host_name db_server service_description Test check with manual result process_perf_data 1 action_url /pnp4nagios/index.php/graph?host=$HOSTNAME$&srv=$SERVICEDESC$ is_volatile 0 max_check_attempts 3 normal_check_interval 5 retry_interval 2 active_checks_enabled 1 passive_checks_enabled 1 check_period 24x7 parallelize_check 1 obsess_over_service 1 check_freshness 0 event_handler_enabled 1 flap_detection_enabled 0 retain_status_information 1 retain_nonstatus_information 1 notification_interval 0 notification_period 24x7 notifications_enabled 1 failure_prediction_enabled 0 display_name Test check with manual result check_command check_by_ssh_using_root!/root/manual_check.sh notification_options w,u,c,r contact_groups admins } There are no other active checks on that host that could trigger a host check. The attached log is a icinga.debug log (I removed any redundant check reaper messages, and stuff not related to this host) where the following happened 1. 08:28 AM - Run a forced host check on db_server to ensure it is up 2. 08:31 AM - A scheduled check of "Test check with manual result" ran - state is WARNING; HARD - 3/3 attempts - next check is scheduled for 08:36 3. 08:33 AM - I took the db_server offline 4. 08:36 AM - A scheduled check of "Test check with manual result" ran - failed with unknown because the is no ssh response (other checks might fail with critical - either way, the state changes from warning to some other non-ok status, that might trigger an alert) 4.a) A check for host db_server is started 4.b) The alert for the service is sent out 4.c) The result of the db_server host check is reaped and it comes back as DOWN, SOFT - attempts 1/3 [timestamp "Test check with manual result" ran The steps above are also included in the log file, prepended by "## " >From how I understand cached_host_check_horizon and what I see in the logs, I'd say the cache is not used. So I guess my question is still: is sending notifications on this described change from Warning -> Critical/Unknown a) the correct behavior or b) should icinga not send this service notification because the host is DOWN and the service state can therefore not be determined. Regards, Gerd
icinga.debug-shortened_final
Description: Binary data
_______________________________________________ icinga-users mailing list icinga-users@lists.icinga.org https://lists.icinga.org/mailman/listinfo/icinga-users