[icinga-users] How to reduce latency on an unloaded server

Michael Martinez Fri, 12 Dec 2014 10:33:59 -0800

I've got an icinga 1.8 installation that has the following latencies:

Active Service Latency:                 0.000 / 1390.955 / 1077.921 sec
Active Host Latency:                    0.000 / 1383.793 / 1351.764 sec


633 hosts. 3,800 active service checks.

The server is Redhat linux as a Virtual Machine on a hypervisor. Yes, I know we 
shouldn't run Icinga on a Vm, but before you assume this is the problem, note 
that there are no bottlenecks whatsoever on the server. No bottleneck in I/O, 
memory, or CPU, and as you can see from vmstat the machine doesn't even steal 
time from a virtual machine:

:/usr/local/icinga/etc# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
3  0   8536 1065916 457056 4057120    0    0     0    25    0    0  6 26 68  0  0
1  0   8536 1077524 457056 4057080    0    0     0   288 3722 1286 10  7 83  0  0
1  0   8536 1074784 457056 4059056    0    0     0     0 5255 6933 15  8 77  0  0

CPU is healthy (idle most of the time), load average is low:
/usr/local/icinga/etc# uptime
10:19:45 up 254 days,  9:24,  4 users,  load average: 1.65, 1.64, 1.61

Plenty of RAM:
:/usr/local/icinga/etc# free
             total       used       free     shared    buffers     cached
Mem:       8028216    6944252    1083964          0     457092    4068612
-/+ buffers/cache:    2418548    5609668
Swap:      4194296       8536    4185760


-----
I have been trying unsuccessfully to reduce this latency so that all checks are 
completed within roughly a 5-minute timeframe, but I have been unable to get 
anything better than the 1000 second latencies seen above. Things I have done:

*         Status.dat, object.cache and spool/checkresults is on a ramdisk

*         Running icinga with renice -15

*         Pertinent icinga.cfg lines are as follows:

status_update_interval=14
use_syslog=0
service_inter_check_delay_method=n
max_service_check_spread=5
service_interleave_factor=633 # number of hosts monitored
host_inter_check_delay_method=n
max_host_check_spread=5
max_concurrent_checks=501
check_result_reaper_frequency=3
max_check_result_reaper_time=20
cached_host_check_horizon=60
cached_service_check_horizon=60
sleep_time=0.02
check_service_freshness=1
service_freshness_check_interval=60
check_host_freshness=1
host_freshness_check_interval=60
service_check_timeout=80
host_check_timeout=30
event_handler_timeout=90
notification_timeout=120
ocsp_timeout=7
perfdata_timeout=5
enable_embedded_perl=0
use_embedded_perl_implicitly=0
use_retained_program_state=1
use_retained_scheduling_info=0

The above is the current configuration with the 1000s latencies.

For my debug_level I'm using 16|8|64. I'm not seeing anything useful in the 
debug file, perhaps because I don't know what I'm looking for, or perhaps 
because there is nothing informative regarding performance.
In my log file, there are no "orphaned" checks. In the past 10 hours there have 
been a handful of warnings about "Breaking out of check result reaper", but 
nothing else to indicate performance issues.

So, I have no idea what's going on and why I'm not getting better latencies. 
Any thoughts?

_______________________________________________
icinga-users mailing list
icinga-users@lists.icinga.org
https://lists.icinga.org/mailman/listinfo/icinga-users

[icinga-users] How to reduce latency on an unloaded server

Reply via email to