Dear Red Hatters, This is a follow-up of responses and results from my first post on this list. You all helped me to navigate a very troublesome issue -- thanks, and may this discussion point other forlorn users toward happy resolutions.
I will provide only the outline of my original message, the entirety of which is presumably available in the archives. On Mon, 19 Aug 2002 12:49:00 -0500, I wrote: > <snip> > > I am a bad spot. <snip> > > The REAL problem is that this machine has been crashing periodically. > It does not always crash in the same way. It does consistently crash > on Saturday mornings, toward the end of a lengthy Amanda amdump run. > > The system was up and running since the installation in early May. A > 2.4.9-31mppe kernel has been in use since the third week of May. > Amanda backups of local drives began at the end of May, with the > addition of NT server shares in early June. There was a lengthy power > outage June 14th - 15th, but this system was powered down before the > UPS gave out. The RH 6.0 network server and firewall have more > recently been added as Amanda client systems. > > Since the first two anomalies were under heavy load and completely > different, I guessed there was a heat issue (see system specs below for > the logic of this). There was a silent, hard crash the first time > (June 29, a little after 1:30 am), and hard drive errors the second > time (July 20). > > Logs from hard drive errors: > > <snip> > > After I removed and added /dev/hda7, I ran a CVS update of /etc (like > the author of the recent Linux Journal article, I keep my life in a CVS > archive). More disk errors: > > <snip> > > I removed and added /hda5 and all was well. > > These drive errors were completely transient; I had no more disk errors > afterward although we continued to run in this state through the end of > July, when I rebooted after updating the openssl RPMs. Weird, isn't > it? Surely something was overheating, right? We changed the office > thermostat to leave the fans running 24/7, though the air conditioners > are still at 78F except between 6am and 10pm weekdays, when it cools > down to 74F. > > After a third crash under the same circumstances (Aug 10), involving a > long run of "kernel: Oops" messages this time, I ordered additional > fans and pulled the cover off the case to let it breathe freely until I > could take it down and install the fans. > > Guess what -- it crashed again last Saturday morning. More "kernel: > Oops" messages. I guess it probably isn't a heat dissipation > problem... <:-( > > I won't include all the "kernel: Oops" dumps, but here are the initial > ones from the August 10 and 17 crashes: > > <snip> > > <snip> > > Before I drone on with more data, some thoughts I have had: > > - Could the power supply be inadequate? The consensus: no. > - Does the custom kernel have a problem (there _are_ newer kernels > out there, but I've avoided building my own up to this point and we > need the MPPE patches)? Somewhat suspicious -- see below. > - What's the problem with Amanda runs? Sure the CPU, disk and > network are busy, and there's lots of activity on the SCSI tape, > but that's life, buddy! Details of system activity logging below. > HARDWARE: > > Motherboard: Tyan Trinity K7 (S2380) > CPU: AMD Athlon Slot A 750 MHz > Case/PS: InWin ATX Full Tower Case Q500 w/300w PS and > added front intake fan > Memory: 128 Mb This was unsupported ECC RAM -- see questions below. > Storage: Promise (PDC20267) PCI IDE controller > Tekram SCSI controller (sym53c8xx: 53c875 > detected with Tekram NVRAM) > 4 IBM-DTLA-307030 (30 Gb) drives (hd[aceg]) > Pioneer DVD-ROM ATAPIModel DVD-106S 012 (hdb) > Sony SDX-300C AIT SCSI tape > Exabyte EXB-8200 (tried, unsuccessfully, to > reuse 8mm dump tapes from the Sun server) > Networking: SMC1211TX EZCard 10/100 (RealTek RTL8139) > > SOFTWARE: > > This is a Red Hat 7.2 system, with all RPMS directly from install or > Red Hat updates, with the exception of MPPE RPMS from > ftp://ftp.planetmirror.com/pub/mppe: > > kernel-2.4.9-31mppe.i386.rpm > kernel-doc-2.4.9-31mppe.i386.rpm > kernel-headers-2.4.9-31mppe.i386.rpm > kernel-source-2.4.9-31mppe.i386.rpm > ppp-2.4.1-3mppe.i386.rpm > pptpd-1.1.3-1.i386.rpm > > Kernel: Linux version 2.4.9-31mppe (root@richard) (gcc > version 2.96 20000731 (Red Hat Linux 7.1 > 2.96-98)) #1 Tue Mar 5 18:47:37 CET 2002 > Filesystems: > <snip> > > SERVICES: > > SysVInit at runlevel 5: > anacron apmd atd autofs crond gpm ipchains iptables isdn keytable > kudzu lpd netfs network nfs nfslock ntpd p4d portmap pptpd random > rawdevices sendmail smb sshd syslog wine xfs xinetd > > Via xinetd: > amanda amanda amandaidx amidxtape imap ipop3 sgi_fam talk telnet > wu-ftpd (I don't know why chkconfig shows amanda twice...) > > MISC. KERNEL INFO: > > <snip> > > Thanks in advance, especially if you actually read this far!! Only a > true Linux fan would have stayed awake to this, the 378th line of this > message. :) I posted my message to both redhat-list and amanda-users. The ranked responses were (most messages had more than one suggestion): 6 Flaky RAM 4 Not enough RAM 3 Buggy version of the kernel 2 Disk/tape controller problem 1 BIOS settings issue 1 Motherboard cache problem 1 CPU problem It was observed that one of the the kernel "Oops" messages (the Linux equivalent of a blue screen of death, except it doesn't always die...immediately) was specifically related to allocation of a memory page. Other recommended resources: http://www.bitwizard.nl/sig11/ (about intermittent segmentation "SIG11" faults and what causes them), news.linux-sxs.org (ask the resident expert(s)), Linux kernel mailing list, http://www.linuxmanagers.org/. Recommended tests: memtest (an off-line memory tester you run from a boot floppy, but impractical for deployed servers), measure voltages under load (they didn't suggest _how_), kernel-compile loop (build and rebuild the Linux kernel ad nauseum...or is that ad crasheum... to load the system heavily and stress the memory) Someone also pointed out that removing the system cover prevents the fans from producing forced air flow, possibly _contributing_ to heat problems rather than solving them. However, pegasus was in the path of a blower vent that (I thought) is always on. A priceless quote from one response: "Very un-nice problem. Poor you :( I do not wish this to any sysadmin." In light of all this input, I started some kernel building. Ten iterations of building went flawlessly under my casual observation, so at about 5:45, before I logged out, I started a set of 50 to run over several hours, spanning the evening Amanda run. Never got there. :) At 6:08pm the 4th of 50 kernel builds was interrupted by a Segmentation fault, the very symptom I was looking for. Many services on pegasus went down at the same instant. The 10:30pm scheduled Amanda run was forgotten. Finally, the house of cards completely collapsed at about 11:15pm, when all contact with pegasus was lost. Replacing the existing 128 Mb SDRAM with 2 256 Mb SDRAM was a perfect fix. I compiled the kernel 100 times without encountering any problems, while running other processes to elevate I/O and CPU loads. Two Amanda runs have completed flawlessly with full dumps of all disks. It's also fun to have almost a gigabyte of virtual memory to play with, and all the RAM is allowing us to cache a few hundred megabytes of data, making most repetitive disk operations (like dump estimates) fly like the wind. Even so, last Tuesday night we used some swap space... :) BTW, the original 128 Mb SDRAM was registered ECC memory, but the Tyan Trinity K7 motherboard does NOT support ECC RAM. I assume that this means that the ECC feature will not be used, but is it possible that this unsupported RAM flavor was actually _causing_ part of the problem? Should I assume that this RAM is bad, or start using it in a system that is designed to use ECC RAM? A review of /var/log/sa/sar* files shows that the critical moment in the Amanda runs brings the highest sustained levels of context switches (> 5000 cswch/s), CPU activity (< .5% idle over a 30 minute period) and paging activity (> 5000 combined pgpgin/s and pgpgout/s) seen on this system. So that's the problem with Amanda runs -- they stress the system as much as, if not more than, building kernels. The addition of two additional case fans had the desired effect of lowering the system's running temperature. The CPU now stays at about 98 F during business hours, 101 F evenings and weekends (see the thermostat settings description earlier). Significantly, though, at the most intensive time of archiving/compressing/taping, the CPU temp does top 104 F briefly. Without the fans I'd guess that would have been closer to 110 F or more. By the way, I considered making one new fan (less than half way up from the bottom) an intake fan, and the other (above the power supply), an exhaust fan. More than one "Build the perfect Linux box article had warned of creating negative pressure in the case, implying that air flow in the power supply might decrease and put the power supply at risk. After pondering this for a bit, I decided that since the case has LARGE HOLES in the front and sides, designed to allow air to flow in and cool internal drives, that this was probably a non issue. Now I can easily feel the air being drawn into the vents and I am more confident that no devices will overheat with the triple exhaust in the back. Finally, I did an rpm -Va to test the possibility that the flaky memory might have produced corrupted files during installation and upgrades. No signs of this, though. Again, thanks for your help. Another Linux support success story for the archives. :) Truly, Jonathan -- / Jonathan R. Johnson | "Every word of God is flawless." \ | Minnetonka Software, Inc. | -- Proverbs 30:5 | \ [EMAIL PROTECTED] | My own words only speak for me. / -- redhat-list mailing list unsubscribe mailto:[EMAIL PROTECTED]?subject=unsubscribe https://listman.redhat.com/mailman/listinfo/redhat-list