Dear Debian community, we recently started using AMD Ryzen CPUs, ASRock Rack motherboards and Kingston unbuffered ECC DIMMs for our small bussiness servers. All the servers are running on ZFS for which ECC memory is recommended. So I naively tried to test it actually works. I read EVERY disscussion on EVERY forum I was able to find (and there is a lot of them, believe me), but I did not find a satisfying answer. According to the legendary tweet from AMD (for which is link in every discussion), the Ryzen CPUs should support ECC memory, but it is not tested feature since they are consumer CPUs. Funny thing is, that according to their spec sheets even EPYC class CPUs do not support them (only CPUs with stated ECC support I found are Ryzen Embedded ones - for example the V1605B in UDOO Bolt). Nevertheless system reports it works - dmidecode, lshw, kernel loads driver and EDAC MC is present in /sys/devices/system/edac/mc, even memtest86+ v6.0 and above reports ECC memory. In forum discussions Intel guys are saying that correctable ECC errors are relatively common - stated counts vary, but I got the impression that at least one in a week should appear. And our virtual hypervisor running over half a year with more than 80% memory utilization has not a single one, niether in sysfs nor in EUFI event log. I understand that the errror count rises with height above mean sea level due to solar radiation and we are in 246m altitude, but at least one error would be nice. The only thing I had success with was memory overclocking - I lowered timing as low as possible for system to POST and when Debian was running, it reported corectable errors from different memory regions (13 during 30 minutes). Rising memory frequency did not work. But all this was done on Asus motherboard, with same memory and CPU however. When I change any memory related setting on ASRock Rack motherboard, it will not POST. In kernel documentation is described that Intel CPUs have ability to inject errors for driver testing but I did not find anything like it for AMD. Does anyone know any way to test that ECC works without breaking the system before? Thank you for your answers.
PS: Some commercial memtests should allegedly be able to inject ECC errors (for example the one from passmark), have anyone tried those? Best regards, Kryštof