Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

Hawa, Hanna Mon, 17 Jun 2019 06:01:21 -0700

+static void al_a57_edac_l2merrsr(void *arg)
+{

+    edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");


How do we know this is corrected?

If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to 
uncorrected/corrected. Is
this what you are depending on here?

No - not on this. Reporting all the errors as corrected seems to be bad.

Can i be depends on fatal field?


That is described as "set to 1 on the first memory error that caused a Data 
Abort". I
assume this is one of the parity-error external-aborts.

If the repeat counter shows, say, 2, and fatal is set, you only know that at 
least one of
these errors caused an abort. But it could have been all three. The repeat 
counter only
matches against the RAMID and friends, otherwise the error is counted in 
'other'.

I don't think there is a right thing to do here, (other than increase the 
scrubbing
frequency). As you can only feed one error into edac at a time then:

if (fatal)
     edac_device_handle_ue(edac_dev, 0, 0, "L2 Error");
else
     edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");


seems reasonable. You're reporting the most severe, and 'other/repeat' counter 
values just
go missing.

I had print the values of 'other/repeat' to be noticed.

How can L2CTLR_EL1[20] force fatal?


I don't think it can, on a second reading, it looks to be even more complicated 
than I
thought! That bit is described as disabling forwarding of uncorrected data, but 
it looks
like the uncorrected data never actually reaches the other end. (I'm unsure 
what 'flush'
means in this context.)
I was looking for reasons you could 'know' that any reported error was 
corrected. This was
just a bad suggestion!

Is there interrupt for un-correctable error?
Does 'asynchronous errors' in L2 used to report UE?

In case no interrupt, can we use die-notifier subsystem to check if anyerror had occur while system shutdown?

+        cluster = topology_physical_package_id(cpu);


Hmm, I'm not sure cluster==package is guaranteed to be true forever.

If you describe the L2MERRSR_EL1 cpu mapping in your DT you could use that. 
Otherwise
pulling out the DT using something like the arch code's parse_cluster().

I rely on that it's alpine SoC specific driver.


... and that the topology code hasn't changed to really know what a package is:
https://lore.kernel.org/lkml/20190529211340.17087-2-atish.pa...@wdc.com/T/#u

As what you really want to know is 'same L2?', and you're holding the 
cpu_read_lock(),
would struct cacheinfo's shared_cpu_map be a better fit?

This would be done by something like a cpu-mask of cache:shared_cpu_map's for 
the L2's
you've visited. It removes the dependency on package==L2, and insulates you 
from the
cpu-numbering not being exactly as you expect.

I'll add dt property that point to L2-cache node (phandle), then it'llbe easy to create cpu-mask with all cores that point to same l2 cache.


Thanks,
Hanna

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

Reply via email to