On 2011-10-19 21:53, Alex wrote: > Hi, > >>> kernel: [73788.355981] [Hardware Error]: Machine check events logged >>> kernel: [73914.635576] CPU4: Package temperature above threshold, cpu >>> clock throttled (total events = 5538406) >>> kernel: [73914.635581] CPU0: Package temperature above threshold, cpu >>> clock throttled (total events = 5538398) >> >> Since your CPU had thermal protection, it's supposed to take effect before >> the hardware is permanently damaged, but the thermal stress might have >> affected it, or other components like memory or the PSU. > >>> [29016.445470] clamd[1110] general protection ip:30df2c3981 >>> sp:7fffa08f4fe0 error:0 in libclamav.so.6.1 >>> .11[30df200000+9ce000] > > I've now switched the hard disks to the old server (also an x86_64 > arch) and it has been running fine with no 'general protection' errors > for more than twelve hours. I think it's safe to assume there is no > software bug causing these errors? > > I've also been stress testing the new hardware separately. It > succeeded through two full passes of memtest86 without any errors. > It's now been running mprime for more than twelve hours and has not > failed. > > When these 'general protection' errors were produced, the system was > typically under high load and high IO. > > I realize this may be a hardware issue, but does anyone have any ideas > how to determine what is really going on?
There are some packages for stress-testing, like cpuburn. cpuburn in MMX mode is quite good at raising your CPU temperature, I suggest you keep an eye on the CPU sensors (sensors -l) if you do run it. Try running one cpuburn on each CPU core for a while. Of course its also possible that your hardware was fine before and you'll damage it by running the stress tests (if you have inadequate cooling for example), so you do so on your own risk! > > Is there a way to stress-test clamav on the new hardware, to try and > induce an error through high IO? For high I/O try this: run updatedb to update your locate database, and at the same time launch a clamd multiscan: clamdscan -m / Another test that you can do is to compile some large pieces of software (Linux kernel, OpenOffice, etc.) with make -j N, where N = nr_cores * 2. GCC uses a _lot_ of pointer manipulation and will randomly crash on faulty hardware, although in that case memtest usually detects the errors too. Best regards, --Edwin _______________________________________________ Help us build a comprehensive ClamAV guide: visit http://wiki.clamav.net http://www.clamav.net/support/ml