If you use RAID on any of your Windows servers, I would suggest monitoring the "Disk Idle Time" counter. Some RAID configurations can give misleading stats, the write-cache can make it seem like its writing much faster than it actually is.
If you are using a NAT'ing router, also look to see if you can monitor how big your NAT table is. These are a finite size, once they are full, your network will be very "slow". ARP and MAC table sizes on local switches may also prove useful. At least you'll be able to tell if a server or service just barfed on Layer 2. On Tue, Oct 13, 2009 at 9:41 PM, Doug Hughes <d...@will.to> wrote: > Rob Cherry wrote: >> I am in a situation of starting a site from scratch and thus have no >> historical helpful configurations to build from. I have just finished >> implementing a monitoring solution and now I need to tell it what I >> care about. There is obvious stuff like availability and response >> times, but other metrics have become more tricky. >> >> My historical knowledge all seems increasingly irrelevant too. For >> example, checking free memory on a modern Solaris 10 box makes no >> sense - all the memory is "stolen" by the kernel for aggressive ZFS >> caching and given up when applications need it. >> >> Given this and other considerations, I will kick off a list of what i >> intend to monitor, but I would be very curious to know what everyone >> else is doing and whether they agree/disagree with the list - >> >> >> Solaris/Linux >> - / % usage >> - /tmp % usage >> - swap % usage >> - CPU load >> - Overall response times for various services such as http/ssh etc. >> >> Windows >> - %SYSTEMROOT% % free >> - memory % free >> - CPU % free >> - Response times on well known ports >> >> Cisco/Networking equipment >> - Internal temperatures >> - CPU/Memory utilization >> >> UPS >> - Output load average >> - Battery % Charge >> - Minutes remaining / am i on battery >> >> Any glaring omissions? >> >> Rob > maybe not glaring, but if you're collecting host stats, I'd put a plug > in for memory paging stats (page-in, page-out). If you're worried about > swap usage on solaris, you might as well look for desparation free (de) > in memory stats. Anything above 0 is considered bad. Also scan rate (sr). > > cisco networking gear - why not collect interface stats like in/out > octets and/or line usage (cisco has a private mib for that), also drops > and errors per interface > > > _______________________________________________ > Discuss mailing list > Discuss@lopsa.org > http://lopsa.org/cgi-bin/mailman/listinfo/discuss > This list provided by the League of Professional System Administrators > http://lopsa.org/ > _______________________________________________ Discuss mailing list Discuss@lopsa.org http://lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/