On Tue, 13 Oct 2009, Stefan wrote:
An infrastructure (servers, storage, desktops/workstations/laptops, network,
etc.) is worth nothing in itself, if not providing proper application
response time or QoE. I have since decided to take apart (identify all
components making ~) all critical business apps and then decide what and how
to monitor the underlying systems/processes/components (to the best of my
ability and tools availability) making up the transactions "paths"
end-to-end (I use the term "transaction" loosely - VoIP is among those apps
requiring infrastructure monitoring, of course). I would thus answer your
question by saying that "it depends" completely on what you are supporting
... and here is where I would start:
in most cases you really end up needing both views.
you need the end-to-end view to tell if you are up as far as your users
are concerned (and what your response time looks like)
you need the detailed view to anticipate problems, and to find out what's
really wrong when you have HA or load balancing hiding internal flaws from
your users.
David Lang
http://www.netqos.com/resourceroom/whitepapers/forms/handbook.html
***Stefan Mititelu
http://twitter.com/netfortius
http://www.linkedin.com/in/netfortius
On Tue, Oct 13, 2009 at 3:28 PM, Atom Powers <atom.pow...@gmail.com> wrote:
On Tue, Oct 13, 2009 at 1:19 PM, Rob Cherry <lo...@lxrb.com> wrote:
I am in a situation of starting a site from scratch and thus have no
historical helpful configurations to build from. I have just finished
implementing a monitoring solution and now I need to tell it what I care
about. There is obvious stuff like availability and response times, but
other metrics have become more tricky.
I just put together a simple list for friend. Of course, what you
monitor depends on what you want to know. In this case, I want to know
if there is a problem I need to take care of so that I don't need to
drive in to work an hour later.
--
I have centralized remote monitoring for most of my systems. I trigger
an alert of three consecutive values are out of range (checked every
60s); these are triggers that indicate an immediate problem.
ICMP host unreachable
CPU process queue, >3
CPU idle, <10%
Free Swap space, <10MB
Free Disk Space, <20%
Available memory, <10MB
Network queue, >5
Network errors, >3
*process not running
*process TCP port unreachable
I keep a history of those, plus several more metrics for capacity planning.
CPU system/user/wait time
Network traffic in/out/total
Disk read/write/queue
number of processes
CPU temperature
UPS load
--
Perfection is just a word I use occasionally with mustard.
--Atom Powers--
_______________________________________________
Discuss mailing list
Discuss@lopsa.org
http://lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Discuss mailing list
Discuss@lopsa.org
http://lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Discuss mailing list
Discuss@lopsa.org
http://lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/