Another thing - while I was digging the Sydney DevOps meetups for a talk about monitoring by a dude from Google, I stumbled across a reference to InfluxDB: http://influxdb.com/.
On 16 June 2014 10:49, Amos Shapira <amos.shap...@gmail.com> wrote: > For a start, it looks like you put both trending and alerting in one > basket. I'd keep them separate though alerting based on collected trending > data is useful (e.g. don't alert just when a load threshold is crossed but > only if the trending average for the part X minutes is above the threshold, > or even only if it's derivative shows that it's not going to get better > soon enough). > > See http://fractio.nl/2013/03/25/data-failures-compartments-pipelines/ > for high level theory about monitoring pipelines, and a bit of a pitch for > Flapjack (and start by reading the first link from it). Lindsay is a very > eloquent speaker and author in general and fun to watch and read. > > Bottom line from the above - I'm currently not aware of a single silver > bullet to do everything you need for proper monitoring. > > Last time I had to setup such a system (monitoring hundreds of servers for > trends AND alerts) I used: > 1. collectd (https://collectd.org/) for trending data - it can sample > things down to once a second if you want > 2. statsd (https://github.com/etsy/statsd/) for event counting (e.g. > every time a Bamboo build plan started or stopped, or failed or succeeded, > or other such events happend, an event was shot over to statsd to coalace > and ship over to graphite). nice overview: > http://codeascraft.com/2011/02/15/measure-anything-measure-everything/ > 3. both of the above send data to graphite ( > https://github.com/graphite-project) > 4. To track things like "upgraded Bamboo" events, we used tricks like > http://codeascraft.com/2010/12/08/track-every-release/. I since then > learned about another project to help stick extra data with events (e.g. > the version that Bamboo was upgraded to), but I can't find it right now. > > Here is a good summary with Graphite tips: > http://kevinmccarthy.org/blog/2013/07/18/10-things-i-learned-deploying-graphite/ > > Alerts were generated by opsview (stay away from it, it was a mistake), > which is yet another Nagios wrapper, many of the checks were based on > reading the Graphite data whenever it was available ( > https://github.com/olivierHa/check_graphite), but many also with plain > old "nrpe" (e.g. "is the collectd/bamboo/apache/mysql/postgres/whatever > process still running?"). > > I don't like nagios specifically and its centralization in general (which > affects all other "nagios replacement" impolementations) and would rather > look for something else, perhaps Sensu (http://sensuapp.org/), though it > wasn't ready last time I evaluated it about a year ago. > > My main beef with Nagios and the other central monitoring systems is that > there is a central server which orchestrates most of the monitoring. This > means that: > 1. There is one server which has to go through all the checks on all > monitored servers in each iteration to trigger a check. With hundreds of > servers and thousands of checks this could take a very long time. It could > be busy checking whether the root filesystem on a throw-away bamboo agent > is full (while the previous check showed that it's far from that) while > your central Maven repository is burning for a few minutes. And it wouldn't > help to say "check Maven repo more often" because it'll be like the IBM vs. > DEC boat race - "row harder!" ( > http://www.panix.com/~clp/humor/computers/programming/dec-ibm.html). > 2. That server is a single point of failure, or you have to start using > complex clustering solutions to keep it (and only one of it!) up - no > parallel servers. > 3. This server has to be very beefy to keep up with all the checks AND > serve the results. In one of my former workplaces (second largest > Australian ISP at the time) there was a cluster of four such servers with > the checks carefully spread among them. Updating the cluster configuration > was a delicate business and keeping them up wasn't pleasant and still it > was very slow to serve the web interface. > 4. The amount of traffic and load on the network and monitored servers is > VERY wasteful - open TCP for each check, fork/exec via the NRPE agent, > process exit, collect results, rinse, repeat, millions of times a day. > > Nagios doesn't encourage what it calls "passive monitoring" (i.e. the > monitored servers initiate checks and send results, whether positive or > negative, to a central server) and in general its protocol (NRPE) means > that the central monitoring data collector is a bottleneck. > > Sensu, on the other hand, works around this by encouraging more "passive > monitoring", i.e. each monitored server is responsible to monitor itself > without the overhead of a central server doing the rounds and loading the > network, it uses RabbitMQ message bus so its data transport and collection > servers are more scalable (it also supports multiple servers), and it's OK > with not sending anything if there is nothing to report (the system will > still has "keepalive" checks (http://sensuapp.org/docs/0.12/keepalives) > to monitor for nodes which went down). > > But my favourite idea for scalability is the one presented in > http://linux-ha.org/source-doc/assimilation/html/index.html - each > monitored host is responsible to monitor itself, without bothering anyone > if there is nothing to write home about (so a bit like Sensu), and a couple > of servers near it, so the "is host is alive" external monitoring is > distributed across the network (and doesn't fall on the servers alone, like > in Sensu), it also saves unnecessary network traffic. Unfortunately, it > seems not to be ready yet ( > http://linux-ha.org/source-doc/assimilation/html/_release_descriptions.html > ). > > More points: > > Lack of VPN - if you can't setup a "proper" vpn then you can always look > at ssh vpn (e.g. Ubuntu instructions: > https://help.ubuntu.com/community/SSH_VPN), and if you can't be bothered > with ssh_config "Tunnel"/"TunnelDevice" (ssh "-w" flag) then even a simple > ssh port redirection with ssh -NT and autossh could do. > > Log concentration - look at Logstash (http://logstash.net/) for proper > log collection and analysis. > > Hope this gives you some ideas. > > --Amos > > On 16 Jun 2014 09:13, "Ori Berger" <linux...@orib.net> wrote: > >> I'm looking for a single system that can track all of a remote server's >> health and performance status, and which stores a detailed >> every-few-seconds history. So far, I haven't found one comprehensive system >> that does it all; also, triggering alarms in "bad" situations (such as no >> disk space, etc). Things I'm interested in (in parentheses - how I track >> them at the moment. Note shinken is a nagios-compatible thing). >> >> Free disk space (shinken) >> Server load (shinken) >> Debian package and security updates (shinken) >> NTP drift (shinken) >> Service ping/reply time (shinken) >> Upload/download rates per interface (mrtg) >> Temperatures (sensord, hddtemp) >> Security logs, warning and alerts e.g. fail2ban, auth.log (rsync of log >> files) >> >> I have a few tens of servers to monitor, which I would like to do with >> one software and one console. Those servers are not all physically on the >> same network, nor do they have a VPN (so, no UDP) but tcp and ssh are >> mostly reliable even though they are low bandwidth. >> >> Please note that shinken (much like nagios) doesn't really give a good >> visible history of things it measures - only alerts; Also, it can't really >> sample things every few seconds - the lowest reasonable update interval >> (given shinken's architecture) is ~5 minutes for the things it measures >> above. >> >> Any recommendations? >> >> Thanks in advance, >> Ori >> >> _______________________________________________ >> Linux-il mailing list >> Linux-il@cs.huji.ac.il >> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il >> > -- [image: View my profile on LinkedIn] <http://www.linkedin.com/in/gliderflyer>
_______________________________________________ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il