Dear Debian community,
we recently started using AMD Ryzen CPUs, ASRock Rack motherboards and Kingston 
unbuffered ECC DIMMs for our small bussiness servers. All the servers are 
running on ZFS for which ECC memory is recommended. So I naively tried to test 
it actually works. I read EVERY disscussion on EVERY forum I was able to find 
(and there is a lot of them, believe me), but I did not find a satisfying 
answer. According to the legendary tweet from AMD (for which is link in every 
discussion), the Ryzen CPUs should support ECC memory, but it is not tested 
feature since they are consumer CPUs. Funny thing is, that according to their 
spec sheets even EPYC class CPUs do not support them (only CPUs with stated ECC 
support I found are Ryzen Embedded ones - for example the V1605B in UDOO Bolt). 
Nevertheless system reports it works -  dmidecode, lshw, kernel loads driver 
and EDAC MC is present in /sys/devices/system/edac/mc, even memtest86+ v6.0 and 
above reports ECC memory. In forum discussions Intel guys are saying that 
correctable ECC errors are relatively common - stated counts vary, but I got 
the impression that at least one in a week should appear. And our virtual 
hypervisor running over half a year with more than 80% memory utilization has 
not a single one, niether in sysfs nor in EUFI event log. I understand that the 
errror count rises with height above mean sea level due to solar radiation and 
we are in 246m altitude, but at least one error would be nice.
The only thing I had success with was memory overclocking - I lowered timing as 
low as possible for system to POST and when Debian was running, it reported 
corectable errors from different memory regions (13 during 30 minutes). Rising 
memory frequency did not work. But all this was done on Asus motherboard, with 
same memory and CPU however. When I change any memory related setting on ASRock 
Rack motherboard, it will not POST.
In kernel documentation is described that Intel CPUs have ability to inject 
errors for driver testing but I did not find anything like it for AMD. Does 
anyone know any way to test that ECC works without breaking the system before? 
Thank you for your answers.

PS: Some commercial memtests should allegedly be able to inject ECC errors (for 
example the one from passmark), have anyone tried those?

Best regards,
Kryštof

Reply via email to