+static void al_a57_edac_l2merrsr(void *arg)
+{
+ edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
How do we know this is corrected?
If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to
uncorrected/corrected. Is
this what you are depending on here?
No - not on this. Reporting all the errors as corrected seems to be bad.
Can i be depends on fatal field?
That is described as "set to 1 on the first memory error that caused a Data
Abort". I
assume this is one of the parity-error external-aborts.
If the repeat counter shows, say, 2, and fatal is set, you only know that at
least one of
these errors caused an abort. But it could have been all three. The repeat
counter only
matches against the RAMID and friends, otherwise the error is counted in
'other'.
I don't think there is a right thing to do here, (other than increase the
scrubbing
frequency). As you can only feed one error into edac at a time then:
if (fatal)
edac_device_handle_ue(edac_dev, 0, 0, "L2 Error");
else
edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
seems reasonable. You're reporting the most severe, and 'other/repeat' counter
values just
go missing.
I had print the values of 'other/repeat' to be noticed.
How can L2CTLR_EL1[20] force fatal?
I don't think it can, on a second reading, it looks to be even more complicated
than I
thought! That bit is described as disabling forwarding of uncorrected data, but
it looks
like the uncorrected data never actually reaches the other end. (I'm unsure
what 'flush'
means in this context.)
I was looking for reasons you could 'know' that any reported error was
corrected. This was
just a bad suggestion!
Is there interrupt for un-correctable error?
Does 'asynchronous errors' in L2 used to report UE?
In case no interrupt, can we use die-notifier subsystem to check if any
error had occur while system shutdown?
+ cluster = topology_physical_package_id(cpu);
Hmm, I'm not sure cluster==package is guaranteed to be true forever.
If you describe the L2MERRSR_EL1 cpu mapping in your DT you could use that.
Otherwise
pulling out the DT using something like the arch code's parse_cluster().
I rely on that it's alpine SoC specific driver.
... and that the topology code hasn't changed to really know what a package is:
https://lore.kernel.org/lkml/20190529211340.17087-2-atish.pa...@wdc.com/T/#u
As what you really want to know is 'same L2?', and you're holding the
cpu_read_lock(),
would struct cacheinfo's shared_cpu_map be a better fit?
This would be done by something like a cpu-mask of cache:shared_cpu_map's for
the L2's
you've visited. It removes the dependency on package==L2, and insulates you
from the
cpu-numbering not being exactly as you expect.
I'll add dt property that point to L2-cache node (phandle), then it'll
be easy to create cpu-mask with all cores that point to same l2 cache.
Thanks,
Hanna