Re: [OpenIndiana-discuss] How do I verify that fmd is actually able to detect and log ECC errors?

Reginald Beardsley via openindiana-discuss Tue, 16 Mar 2021 14:59:14 -0700

 
Guys, thanks for the comments.

I have confirmed through a friend who has designed and built a Zynq based DSO 
with DDR3 memory that the ECC is only computed on a read operation which is 
what I had always assumed. So if a pointer table initialized at boot time only 
gets accessed when you do a scrub it would exactly match the symptoms. The 
kernel panic would prevent fmd from logging an ECC event. If all the bits fade 
to zero and you dereference the pointer in the kernel there is no way to 
prevent a kernel panic.


I have completed a scrub of all 3 pools in the Sol 10 u8 system. All pools and 
vdevs are clean, no errors. I am going to leave it idling until tomorrow and 
start another scrub of the root pool which is the smallest. My expectation is 
it will kernel panic. I'll save the core file and see what I can divine from 
that.

I'm also going to make up a cable that will let me look at the PSU rails under 
load with a scope. As the Z400s have been such fine machines I don't see myself 
getting rid of them. Though I am flirting with getting a Z820 with 10-20 cores 
and 256-512 GB of RAM. Not that long ago I could have made a very good living 
off of such a machine processing seismic data. But now I shall need another job 
to justify feeding it electricity and cooling. It blows my mind that 20x 3 GHz 
cores, 512 GB of RAM and 30+ TB of triple parity RAIDZ is less than what I paid 
for my Ultra 20.

I've got a cheap Chinese PSU tester, but a DSO will do a much better job. With 
a modest amount of fiddle I can set up a repeatable PSU test to be done once a 
year. I also have an EDS-88A in-circuit cap tester and HP 4884A & 4285A LCR 
meters. So I'm rather heavy on the T&M kit. in the mid 90's my lab gear would 
have cost around $500k. All bought for pennies on the dollar via ebay and the 
T&M repair lists. I'm rather in awe of what it can do.

 I know I can record at least 20 million samples, possibly more. I have several 
DSOs as well as analog scopes. So I'll set up a DSO to capture a singleshot 
trace when I start the scrub. Just in case there is a transient event.

I think it worth noting that the most recent ECC error was from 3 years ago. I 
don't recall ever having a kernel panic on this system which is running Hipster 
2017.10 and what I am using as I type this. And I have never had a POST error 
reported on any system. Strangely, the long POST times went away at least on 
one of the 2x 4 slot machines which is now my dedicated OS test and 
Windows/Linux machine.

The HP BIOS is maddeningly opaque. I am planning to build the EFI 2.0 shell on 
a USB stick as that will give me the functionality of a traditional ROM monitor 
program. 

I'm actually planning to replace the 2 GB DIMMs with 4 or 8 GB DIMMS in at 
least a couple of the Z400s. But it's become a grudge match with the machine. I 
want to find the bad DIMM. And I want to be able to do it easily and reliably 
at any time in the future even though I've run into the issue 2 times in 30 
years and am statistically unlikely to live long enough to have it happen a 3rd 
time. I don't like letting machines mess with me.

Have Fun!
Reg
  
_______________________________________________
openindiana-discuss mailing list
openindiana-discuss@openindiana.org
https://openindiana.org/mailman/listinfo/openindiana-discuss

Re: [OpenIndiana-discuss] How do I verify that fmd is actually able to detect and log ECC errors?

Reply via email to