Re: [icinga-users] High latency caused by Notifications?

Michael Friedrich Wed, 03 Dec 2014 01:48:57 -0800

Am 28.11.2014 um 21:14 schrieb Johannes Oettl:

Hi,


I am running a Icinga 1.9.1 installation with IDO, PNP4Nagios and NoMa
(Netways Notification Manager).


Hmmm, 1.9.1 is rather old and may not contain certain improvements from
later versions, to mention the IDO bottleneck.

And yes, every event in Icinga Core 1.x is processed in a sequential
manner, causing the parent process to wait for the child process to
return from its action. Be it check plugins, notification scripts,
eventhandler scripts or even performance data rotating commands.
Therefore long-lasting "scripts" ("commands") may block the core process
and increase overall latency in certain scenarios. While there are
addons and methods for Core 1.x available to partly resolve these
bottlenecks (use an external process/daemon to execute checks for
instance), the old architecture inherited from Nagios is a dead end.

If you're looking for solutions, consider upgrading to Icinga 2 where
this kind of blocking does not happen, as Icinga 2 was designed for
large scale deployments, doing its job asynchronously helped with
threads and work queues. More details at
http://docs.icinga.org/icinga2/latest/doc/module/icinga2/chapter/about-icinga2#icinga2-in-a-nutshell


Stats:

Icinga Stats 1.9.1
Copyright (c) 2009 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 05-22-2013
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /var/lib/icinga/status.dat
Status File Age:                        0d 0h 0m 2s
Status File Version:                    1.9.1

Program Running Time:                   0d 0h 57m 34s
Icinga PID:                             13488
Used/High/Total Command Buffers:        0 / 714 / 32768

Total Services:                         11724
Services Checked:                       11724
Services Scheduled:                     7585
Services Actively Checked:              7586
Services Passively Checked:             4138
Total Service State Change:             0.000 / 26.180 / 0.009 %
Active Service Latency:                 0.001 / 0.626 / 0.173 sec
Active Service Execution Time:          0.004 / 15.314 / 1.740 sec

Looks like there are some plugins where the execution time is rather
high. Any snmp (perl) plugins running for example?

You also should analyse a bit more, and even graph performance over time.
https://wiki.icinga.org/display/howtos/Icinga+performance+analysis

Active Service State Change:            0.000 / 17.630 / 0.008 %
Active Services Last 1/5/15/60 min:     1500 / 7326 / 7532 / 7586
Passive Service Latency:                0.061 / 11.296 / 1.508 sec
Passive Service State Change:           0.000 / 26.180 / 0.012 %
Passive Services Last 1/5/15/60 min:    478 / 3941 / 4021 / 4043
Services Ok/Warn/Unk/Crit:              11682 / 13 / 3 / 26
Services Flapping:                      0
Services In Downtime:                   2

Total Hosts:                            2067
Hosts Checked:                          2066
Hosts Scheduled:                        0
Hosts Actively Checked:                 2067
Host Passively Checked:                 0
Total Host State Change:                0.000 / 11.450 / 0.013 %
Active Host Latency:                    0.000 / 0.624 / 0.000 sec
Active Host Execution Time:             0.000 / 3.038 / 0.023 sec
Active Host State Change:               0.000 / 11.450 / 0.013 %
Active Hosts Last 1/5/15/60 min:        0 / 1 / 3 / 13
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  2064 / 3 / 0
Hosts Flapping:                         0
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     8 / 36 / 102
    Scheduled:                           0 / 0 / 0
    On-demand:                           8 / 36 / 102
    Parallel:                            0 / 1 / 3
    Serial:                              0 / 0 / 0
    Cached:                              8 / 35 / 99
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  1661 / 7552 / 22640
    Scheduled:                           1661 / 7552 / 22640
    On-demand:                           0 / 0 / 0
    Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 370 / 388 / 388

External Commands Last 1/5/15 min:      890 / 4359 / 12693


As you can see, I use a lot of passive checks (Check-mk), I wrote
some similar checks for Cisco Router to have only one SNMP check
per Host, and submit the results as passive checks.


Hmmm the core process will call the checkmk script for collecting all
the data then passively feed to the core. That would explain why some
checks last 15 seconds.

The external commands number looks like that you're using the command
pipe to pass check results from checkmk to icinga core. You'll likely
want to try the checkresult spool dir as a tuning alternative.


I am getting into high latency, when i submit a lot of check results
in the command pipe, and each result creates a notification. When I
write 200 Check results in the command pipe, a notification is send
out every 3 seconds, so for 200 checks it takes 600 seconds to
process them all.


The question is - why should all these 200 checks notify at once? That
looks more of a problem with the notification logic itself, than the
overall process of processing a checkresult, triggering a hard state
change and causing the notification events to be fired.


During this time I got a very high latency, and the load of the
maschine goes up. I also found this in the icinga.log:

# grep -i reaper icinga.log
[1417156402] Warning: Breaking out of check result reaper: max reaper time (60) 
exceeded. Reaped 53 results, but more checkresults to process.
[1417156468] Warning: Breaking out of check result reaper: max reaper time (60) 
exceeded. Reaped 56 results, but more checkresults to process.
[1417156531] Warning: Breaking out of check result reaper: max reaper time (60) 
exceeded. Reaped 24 results, but more checkresults to process.
[1417156597] Warning: Breaking out of check result reaper: max reaper time (60) 
exceeded. Reaped 37 results, but more checkresults to process.
[1417156761] Warning: Breaking out of check result reaper: max reaper time (60) 
exceeded. Reaped 89 results, but more checkresults to process.
[1417156824] Warning: Breaking out of check result reaper: max reaper time (60) 
exceeded. Reaped 19 results, but more checkresults to process.
[1417163827] Warning: Breaking out of check result reaper: max reaper time (60) 
exceeded. Reaped 40 results, but more checkresults to process.
[1417164030] Warning: Breaking out of check result reaper: max reaper time (60) 
exceeded. Reaped 64 results, but more checkresults to process.
[1417164096] Warning: Breaking out of check result reaper: max reaper time (60) 
exceeded. Reaped 58 results, but more checkresults to process.


Which means that your core process already has a huge number of
checkresults in its memory, and cannot reap any further checkresults
before not having processed there. Newer versions of Icinga will also
log the exact number of checkresults in memory waiting for processing
into the debug log.

That requires setting max_check_result_list_items to a different value
than 0 in your icinga.cfg though.

commit 3f3840b9ab77b71a11ca89867117f609d67d0c52
Author: Michael Friedrich <michael.friedr...@gmail.com>
Date:   Mon Sep 10 17:12:59 2012 +0200

    core: only reap results when checkresult_list is not overloaded
(max_check_result_list_items) (thx Thomas Gelf) #3099 - MF

    when the check result reaper event gets called in order to reap new
    check result files from disk, there may be still unprocessed results
    being left on the checkresult list. this is especially the case when
    there's performance issues on the core itsself, not leaving enough
    time to actually process the checkresult lists, but rather reaping
    and reaping too much results onto the list. the larger the list grows
    the worse O(n) will be for e.g. determining the size of the list.

    on various larger setups this could lead into a long in-memory list
    slowing down everything. once you kill the core, you will lose all
    results from memory, also those already being reaped. rather than
    reaping *everything* on the scheduled reaping interval, we should
    only reap a specific number, until the checkresult list is "full".
    max_check_result_list_items will take care of that - once set to
    a value greater than 0, the reaper will only put checkresults onto
    the list until the list items will not exceed the max item number.
    this will allow us to only process smaller chunks of checkresult
    files, waiting for them being processed, and then reaping the rest.
    one might set the max fileage a bit higher, but that should not be
    an issue with the default values.

    in order to stay safe, this feature is disabled by default (same as
    setting max_check_result_list_items=0 in icinga.cfg). the default
    and smaller setups won't need it anyways.

    since the checkresult list is not threadsafe in any way, and neb
    modules such as mod_gearman or dnx fiddle with the checkresult list
    in memory in order to stash their own checkresult queue into the core
    we cannot just add our own counter as e.g. a global variable, as those
    addons do not know how to modify that one. so the patch reads the
    checkresult list length before deciding to bail early or not - make
    sure to find the best value by yourself (reports say 1024 is good
    enough). setting this value too large might double up the performance
    issues you had already before - therefore this config item is tagged
    'experimental'.

    the better solution - have a clean api for stashing checkresults
    into the core, rather than letting neb modules fiddling with inner
    core structures.

    thanks to Thomas Gelf for the initial patch.

    refs #3099


reaper setings in icinga.conf:

# HOST AND SERVICE CHECK REAPER FREQUENCY
#check_result_reaper_frequency=1
#check_result_reaper_frequency=10
check_result_reaper_frequency=5
# MAX CHECK RESULT REAPER TIME
# check result reaper event will be allowed to run before
max_check_result_reaper_time=60


Has anyone else seen such behaviour? Any suggestions, what I can do to 
eliminate this problem?


Skip the command pipe, and use a ramdisk for checkresult spool dir. Ask
the guys at ACOnet & University of Vienna on their setup too.

Kind regards,
Michael


-- 
Michael Friedrich, DI (FH)
Application Developer

NETWAYS GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
Tel: +49 911 92885-0 | Fax: +49 911 92885-77
GF: Julian Hein, Bernd Erk | AG Nuernberg HRB18461
http://www.netways.de | michael.friedr...@netways.de

** OSMC 2014 - November - netways.de/osmc **
** OpenNebula Conf 2014 - Dezember - opennebulaconf.com **
** OSDC 2015 - April - osdc.de **
** Puppet Camp Berlin 2015 - April - netways.de/puppetcamp **
_______________________________________________
icinga-users mailing list
icinga-users@lists.icinga.org
https://lists.icinga.org/mailman/listinfo/icinga-users

Re: [icinga-users] High latency caused by Notifications?

Reply via email to