Re: [lopsa-discuss] Monitoring Sucks!

Paul Graydon Tue, 26 Jul 2011 13:50:14 -0700

Right, which is why he states clearly in the blog he's looking forappropriate times, wanting to hold multiple meetings if necessary to tryand cover all timezones.


Paul


On 7/25/2011 6:00 PM, da...@lang.hm wrote:

Ok, it was a meeting on IRC
even worse as it means that the only people who can participate arethe ones who can arrange to be available at the right time.
David Lang

On Mon, 25 Jul 2011, da...@lang.hm wrote:
Date: Mon, 25 Jul 2011 20:56:51 -0700 (PDT)
From: da...@lang.hm
To: Christopher R Webber <christopher.web...@ucr.edu>
Cc: "discuss@lists.lopsa.org" <discuss@lists.lopsa.org>
Subject: Re: [lopsa-discuss] Monitoring Sucks!
sorry, going and visiting various blogs/forums on a regular basis toparticipate in discussions there just isn't practical for me (thereare too many blogs/forums I want to follow now that I can't keep upwith.
I'll take a look and see if I can participate via e-mail (somethingvery few forum packages support), but otherwise web forums are justtoo cumbersom to deal with.
David Lang

On Mon, 25 Jul 2011, Christopher R Webber wrote:
Date: Mon, 25 Jul 2011 04:23:08 +0000
From: Christopher R Webber <christopher.web...@ucr.edu>
To: "discuss@lists.lopsa.org" <discuss@lists.lopsa.org>
Subject: Re: [lopsa-discuss] Monitoring Sucks!
Really, this is why people should be participating in the#monitoringsucks discussion. The goal is to work as a community tocome up with a few standard ideas that we can all build on. Many ofus only have need for parts of the stack, others need a stack to dovery different things. If we can work together to come up with howthese things come together, we can all start contributing to thesolution instead of bitching about how much the state of#monitoringsucks.
-- cwebber

Christopher Webber
Computing Infrastructure and Security
University of California, Riverside


On Jul 24, 2011, at 4:34 PM, <da...@lang.hm>
<da...@lang.hm> wrote:
On Fri, 22 Jul 2011, Tom Limoncelli wrote:
Part of the problem is that there are four ponies here not one.


 - Historical monitoring: Gathering statistics via SNMP or similar,
 storing them, and drawing pretty graphs.
 - Real-time monitoring: ping and other "is it up/down?" queries.
These two things are so different that I rarely see software thatcan doboth very well. Real-time should keep the last n-minutes ofresults in RAMfor fast calculations. Historical monitoring should stash thingson disk
and move on.

There are at least two more components:
- Alerting: Say you know something is "wrong", the alertingsystem has todecide who to contact (based on a pager rotation schedule, etc.)and how tocontact them (email or pager depending on ToD, urgency, and soon), and
 implements the escalation policy.
Alerting is made even more complex by the fact that you really wantto be able to alert on things that your applications and systemslog, not just on what your monitoring probes return.
logging and alerting really do overlap a lot, but I don't know anytools that take advantage of this rather than trying to partition it.
I've come to the conclusion that the best way to do alerting is toget all the logs into a syslog stream to a central server farm andhave an alerting engine watch that (simple event correlator is agood starting point).
the monitoring system should look for things and generate logentries to pass on to the alerting system.
trying to do everything in one system will run you into a lot ofproblems.
- Graphing/dashboard: The system that draws the dashboards andpretty
 graphs mentioned above.
It would be nice if we had well-defined interfaces between thesecomponents
so that we could mix and match.
and I think this is the key to it all.
right now in my company we have the situation "you aren't in themonitoring group, so your opinion doesn't matter. Besides, we'vejust spent $big_bucks to buy $professional_tool, that will solveall monitoring issues", but If I was able to work on this, I woulddo something along the following lines
note that when I say 'system' this could be a process, a server, ora farm of servers depending on your scale
setup one system with high performance disks running the MRTGnetwork service
setup a second system with something like Nagios recieving passivechecks, but modify the passive check receiver to push a copy ofit's data into MRTG. When MRTG sees something 'interesting', log it.
setup a third system with something like SEC to watch logs (bothfrom Nagios and from other log data) to do the alerting
setup a fourth system with something along the lines of splunk forad-hoc queries of the logs
setup a fifth system to generate periodic reports from the data

setup a sixth system to generate real-time dashboards from the data
Nagios would do the up/down checks, do dependancy resolution, etc(so that one router going down doesn't generate 1000 alerts fromall the services on all the servers on the other side of therouter, although it may be that that belongs in the alerting enginestage of things
David Lang
Tom
P.S. Has anyone tried http://opentsdb.net/ ? It looks veryinteresting.
_______________________________________________
Discuss mailing list
Discuss@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Discuss mailing list
Discuss@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Discuss mailing list
Discuss@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Discuss mailing list
Discuss@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Discuss mailing list
Discuss@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/


_______________________________________________
Discuss mailing list
Discuss@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Re: [lopsa-discuss] Monitoring Sucks!

Reply via email to