What are the possible causes of a temporary, partial system hang/freeze? Specifically, the system (a web server):
* stops responding to most requests for up to one or two minutes, then comes back to life; * hangs only (as far as I can tell) when there is some load on the server; * discontinues responding to requests from Apache daemon -- Apache waits and waits, then sends out an 'internal error' message; * responds to a few simple command line requests from root (things like ls, cat, and file) but hangs when a command like "ls -l" is used; * continues to respond to queries made through an NFS connection, including reading files from the server (even though root at the exact same time is frozen at the console); * the 'hangs' occur equally on IDE and SCSI drives; * no error messages are generated to the console or to the logs (I have remote logging set to another server, so it isn't a matter of not being able to write the messages out to the drives). My first guess was that this had something to do with my SCSI Adaptec 2940U2W controller or my LVD disks (Seagate Cheetahs). However, I've kicked away at that idea for three months, trying many solutions. It may have someting to do with running software RAID-1, but it hangs the same when the system is in degraded mode with only one SCSI disk operational. According to the developer/maintainer of the RAID-1 code, Mingo, it is highly unlikely to be a problem specifically with the RAID code. And if I can 'cat' a file during the freeze, then it isn't a disk system freeze (how could I read from disk?). I suspect it is something to do with taking a Debian Potato system and putting a vanilla kernel on it patched with the aic7xxx driver and the RAID code. Many people are running these patched vanilla kernels on Redhat systems, so it seems this is something specific to Debian. My system is: ASUS P3B-F motherboard (Intel 440BX AGPset); CPU Bus/PCI Freq - 100.3/33.43 Single PII 400Mhz (Deschutes) 512MB 100Mhz ECC RAM Adaptec 2940U2W (2.20.0 bios) -- Tagged Command Queueing enabled; max 32 commands per device; reset delay 5 sec. SCSI Cable is 3' internal, teflon, custom made Ultra2-LVD with active negation terminator 2xSeagateST39103LW (Ultra2-LVD drives) NE2000 Clone (ISA) I'd appreciate any suggestions. So far, the main things I've tried are: * testing kernels 2.2.10 to 2.2.16 with appropriate RAID patches; * replacing original Adaptec cable and terminator with custom built, high-end, teflon cable and terminator ($125US); * set up remote logging to catch any error message (none logged); * added verbose debugging (nothing); * upgraded the AIC7xxx driver from 5.1.21 to 5.1.30 (performance improved but it still hangs); * removed the IDE drives and kernel support for IDE, based on someone's hunch; * lowered the front bus speed from 100.3Mhz to 88.3 (underclocking CPU at the same time), based on another hunch; * checked IRQs for conflicts -- none ; * compiled the kernel with and without tagged command queuing, with more and fewer max commands; * added hard disk fans; * reduced SCSI controller speed on Seagate drives to 40MB from 80MB; Again, any suggestions appreciated. Thanks in advance, Jeff Hill ------------------------------------------------------------ ------ HR On-Line: The Network for Workplace Issues ------ http://www.hronline.com - Ph:416-604-7251 - Fax:416-604-4708 ------------------------------------------------------------