Hi all, On Thu, Feb 06, 2003 at 09:13:06PM +0800, Jason Lim wrote: > Hi all, > > I was wondering what kind of failures you experience with long-running > hardware.
Mostly mechanical parts like Fans, Harddisks. CPUs can normaly run arround 10Years without problems, as far as i know. > Most of us run servers with very long uptimes (we've got a server here > with uptime approaching 3 years, which is not long compared to some, but > we think it is pretty good!). > > We're looking at "extending" the life of some of these servers, but are > reluctant to replace all the hardware, especially since what is there > "works"... > > Most of these servers either have 3ware RAID cards, or have some other > sort of RAID (scsi, ide, software, etc.). The hard disks are replaced as > they fail, so by now some RAID 1 drives are actually 40Gb when only about > 20Gb is used, because the RAID hardware cannot "extend" to use the extra > size (but this is a different issue). You can detect indicies for a soon failure with smartmontools. This Tools read the SMART values/log must modern harddisk provide. Often there are messages in /var/log/messages with indicate Harddisk problems. > Now... we can replace all the fans in the systems (eg. CPU fan, case fans, > etc.). Some even suggested we jimmy on an extra fan going sideways on the > CPU heatsick, so if the top fan fails at least airflow is still being > pushed around which is better than nothing (sort of like a redundant CPU > fan system). You can monitor cpu/case temparature with the sensors package. Also Voltages of the Mainboard. (power supply) And also Speed of Fans. (often they get slower an slower before failure) > But how about the motherboards themselves? Is it often for something on > the motherboard to fail, after 3-4 years continuous operation without > failure? > > Or is there some other part(s) we should look out for instead... would the > CPU itself die after 3 years continuous operation? Or maybe RAM? Or even > the LAN cards? RAM is also not so often. NICs more often.(voltage peeks or things like this ???) You can monitor them with mii-tool ? You can build failover with the bonding driver of the kernel, as far as i know. Not all cards/drivers supply right mii informations. > We keep the systems at between 18-22 degrees celcius (tending towards the > lower end) as we've heard/read somewhere that for every degree drop in > temperature, hardware lifetime is extended by X number of years. Not sure > if that is still true? I dont think modifying cooling system of a server is a good thing, because most systems are allready optimizied for a good air flow. > Any input/suggestions would be greatly appreciated. Its allways good to monitor your systems. There are a lot more thinks you can monitor(ups, Network ...) For bigger installations you can use a centralized monitoring Server(s). They can normaly run all the previous checks and you notify you by mail, pager, sms ... A few Monitoring Servers: Nagios(NetSaint) (GPL) BigBrother (commercial) BigSister (a GPL clone) Markus _ ___ #_~`--'__ `===-, Markus Benning <[EMAIL PROTECTED]> `.`. `#.,// http://www.w3r3wolf.de ,_\_\ ## #\ `__.__ `####\ Open Source is a philosophy ~~\ ,###'~ not a price tag ! \##' -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]