Anssi Saari wrote: > Dan Ritter <d...@randomstring.org> writes: > > > We see ECC errors irregularly and infrequently on both Intel and > > AMD CPUs. > > How/where do you see those on a Debian system? I looked into this > briefly but didn't get anywhere.
The kernel announces readiness during boot with: dmesg:[ 18.331561] EDAC amd64: Node 0: DRAM ECC enabled. and then an event looks like this: Message from syslogd@HOSTNAME at Jan 25 15:05:51 ... kernel:[5964975.397283] [Hardware Error]: Corrected error, no action required. Message from syslogd@HOSTNAME at Jan 25 15:05:51 ... kernel:[5964975.406226] [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c04400040080a13 Message from syslogd@HOSTNAME at Jan 25 15:05:51 ... kernel:[5964975.418574] [Hardware Error]: Error Addr: 0x0000001ed405ef50 Message from syslogd@HOSTNAME at Jan 25 15:05:51 ... kernel:[5964975.426919] [Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB. Message from syslogd@HOSTNAME at Jan 25 15:05:51 ... kernel:[5964975.437370] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout) If you see a bunch of these, you want to install edac-utils and run it to see if you have a bad DIMM. -dsr-