Re: [icinga-users] Notification behavior when host is temporarily DOWN

Gerd Radecke Tue, 14 Jan 2014 00:22:48 -0800

On Mon, Jan 13, 2014 at 10:53 PM, Michael Friedrich
<michael.friedr...@gmail.com> wrote:
> On 12.12.2013 13:32, Gerd Radecke wrote:
>>
>> Hi everybody,
>>
>> I'm looking at an issue with notifications and I'm unsure whether this
>> is working as designed or not.
>>
>> I'm getting service notifications when a service that has been in that
>> state for a long time changes from WARNING; HARD to CRITICAL;HARD
>> after one check because of a check timeout.
>> Three seconds later, the host check returns with DOWN, SOFT, yet only
>> once, so the host never gets to DOWN, HARD.
>>
>> I thought that if the host is down, no service notifications will be
>> sent.
>> http://docs.icinga.org/latest/en/checkscheduling.html#hostcheckscheduling
>> actually states that "when Icinga is check [sic!] the status of a
>> host, it holds off on doing anything else"  - so I would expect it to
>> also not send the service notification I'm seeing until it's sure what
>> the host status is :/
>>
>> The Log with comments is here:
>>
>> # 1. Status - of dbserver;Disk_E is WARNING;HARD and has been so for a
>> while(also see the last line in this log)
>>
>>
>> Dec 11 23:05:34 icinga_server icinga: SERVICE ALERT:
>> db_server;Disk_E;CRITICAL;HARD;3;CRITICAL - Socket timeout after 10
>> seconds
>> # 2. When we get a Critical for Disk_E because of the timeout, the
>> status goes to Critical, HARD which conforms to
>> http://docs.icinga.org/latest/en/statetypes.html - 5.8.4 and 5.8.5
>>
>> # 3. If I understand
>> http://docs.icinga.org/latest/en/checkscheduling.html#hostcheckscheduling
>> correctly, on every service state change, icinga will do a check of
>> the host, to see if its status changed as well.
>
>
> There's a cache involved, not immediately forcing an actual host check
> itsself.
> http://docs.icinga.org/latest/en/configmain.html#configmain-cached_host_check_horizon
>
> That said, if a previous failing service check triggered a host check,
> resulting in an UP state, it could happen that the service check afterwards
> in that given check horizon will result in a "host is assumed UP, please
> notify the service".
>
>
>>   So in this case, a
>> host check should be performed before any further action is taken.
>> What actually happens is that the result is processed and a service
>> notification is send out immediately
>> Dec 11 23:05:34 icinga_server icinga: SERVICE NOTIFICATION:
>> prio1;db_server;Disk_E;CRITICAL;notify_service_email_24x7;CRITICAL -
>> Socket timeout after 10 seconds
>
>
> Any debug logs for specifically the host, and all surrounding service
> checks? (level checks/events or higher, verbosity 2)
>
>
>>
>> # 4. Only a few seconds afterswards does icinga show new results for
>> the host state, but the are still SOFT.
>> Dec 11 23:05:37 icinga_server icinga: HOST ALERT:
>> db_server;DOWN;SOFT;1;CRITICAL - Host Unreachable (172.16.28.132)
>
>
> max_check_attempts of that host? what state (log entry) did it have before?
>
>
>>
>> # 5. The host is reachable again.
>> Dec 11 23:08:44 icinga_server icinga: HOST ALERT:
>> db_server;UP;SOFT;2;PING OK - Packet loss = 0%, RTA = 45.00 ms
>>
>> # 6. Service status goes back to Warning.
>> Dec 11 23:20:24 icinga_server icinga: SERVICE ALERT:
>> db_server;Disk_E;WARNING;HARD;3;e:\ - total: 180.00 Gb - used: 163.08
>> Gb (91%) - free 16.91 Gb (9%)
>>
>>
>> So I'm wondering: is sending notifications on this described change
>> from Warning -> Critical
>> a) the correct behavior or
>> b) should icinga not send this service notification because the host
>> is DOWN and the service state can therefore not be determined.
>
>
> Depends on the host state in that specific situation, and if it changed /
> was cached.
>
>
> --
> DI (FH) Michael Friedrich
>
> mail:     michael.friedr...@gmail.com
> twitter:  https://twitter.com/dnsmichi
> jabber:   dnsmi...@jabber.ccc.de
> irc:      irc.freenode.net/icinga dnsmichi
>
> icinga open source monitoring
> position: lead core developer
> url:      https://www.icinga.org
>
> _______________________________________________
> icinga-users mailing list
> icinga-users@lists.icinga.org
> https://lists.icinga.org/mailman/listinfo/icinga-users


Hi Michael,

Thanks for reviving this thread.

I'll try to give all the detail that is needed - the involved config would be:

$ grep horizon /etc/icinga/icinga.cfg
cached_host_check_horizon=15
cached_service_check_horizon=15

and
debug_level=16
debug_verbosity=2



db_server host object:

define host {
        host_name       db_server
        display_name    db_server
        initial_state   o
        check_command   check_ping!1000,5%!4000,20%!1000,5%!4000,20%
        retry_interval  60
        max_check_attempts      3
        check_interval  60
        passive_checks_enabled  1
        event_handler_enabled   1
        flap_detection_enabled  1
        process_perf_data       1
        retain_status_information       1
        retain_nonstatus_information    1
        notification_interval   0
        notification_period     24x7
        notifications_enabled   1
        failure_prediction_enabled      1
        active_checks_enabled   0
        alias   db_server
        address 10.10.10.161
        notification_options    d,u
    parents     db_server_parent
        contact_groups  admins
        contacts        gradecke
}
So we have max_check_attempts = 3 for the host with a check_interval
of 60 (I know it's odd to only check it once an hour, but that's how
it was set up and it actually makes it easier to ensure that the
cached_host_check_horizon is definitely over)

custom check so I can return service states as I wish:

define service {
        host_name       db_server
        service_description     Test check with manual result
        process_perf_data       1
        action_url
/pnp4nagios/index.php/graph?host=$HOSTNAME$&srv=$SERVICEDESC$
        is_volatile     0
        max_check_attempts      3
        normal_check_interval   5
        retry_interval  2
        active_checks_enabled   1
        passive_checks_enabled  1
        check_period    24x7
        parallelize_check       1
        obsess_over_service     1
        check_freshness 0
        event_handler_enabled   1
        flap_detection_enabled  0
        retain_status_information       1
        retain_nonstatus_information    1
        notification_interval   0
        notification_period     24x7
        notifications_enabled   1
        failure_prediction_enabled      0
        display_name    Test check with manual result
        check_command   check_by_ssh_using_root!/root/manual_check.sh
        notification_options    w,u,c,r
        contact_groups  admins
}

There are no other active checks on that host that could trigger a host check.

The attached log is a icinga.debug log (I removed any redundant check
reaper messages, and stuff not related to this host) where the
following happened

1. 08:28 AM - Run a forced host check on db_server to ensure it is up
2. 08:31 AM - A scheduled check of "Test check with manual result" ran
- state is WARNING; HARD  - 3/3 attempts - next check is scheduled for
08:36
3. 08:33 AM - I took the db_server offline
4. 08:36 AM - A scheduled check of "Test check with manual result" ran
- failed with unknown because the is no ssh response (other checks
might fail with critical - either way, the state changes from warning
to some other non-ok status, that might trigger an alert)
4.a) A check for host db_server is started
4.b) The alert for the service is sent out
4.c) The result of the db_server host check is reaped and it comes
back as DOWN, SOFT - attempts 1/3 [timestamp "Test check with manual
result" ran

The steps above are also included in the log file, prepended by "## "

>From how I understand cached_host_check_horizon and what I see in the
logs, I'd say the cache is not used.

So I guess my question is still:
 is sending notifications on this described change from Warning ->
Critical/Unknown
 a) the correct behavior or
 b) should icinga not send this service notification because the host
is DOWN and the service state can therefore not be determined.

Regards,
 Gerd

icinga.debug-shortened_final
Description: Binary data

_______________________________________________
icinga-users mailing list
icinga-users@lists.icinga.org
https://lists.icinga.org/mailman/listinfo/icinga-users

Re: [icinga-users] Notification behavior when host is temporarily DOWN

Reply via email to