On Fri, Apr 19, 2019 at 02:29:11AM +0200, Borislav Petkov wrote: > On Thu, Apr 18, 2019 at 05:07:45PM -0700, Luck, Tony wrote: > > On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote: > > > Which reminds me, Tony, I think all those debugging files "pfn" > > > and "array" and the one you add now, should all be under a > > > CONFIG_RAS_CEC_DEBUG which is default off and used only for development. > > > Mind adding that too pls? > > > > Patch below, on top of previous patch. Note that I didn't move "enable" > > into the RAS_CEC_DEBUG code. I think it has some value even on > > production systems. > > And that value is? Usecase?
Suppose that an entire device on a DIMM fails. Systems with the right type of DIMM (X4) and a memory controller that implements https://en.wikipedia.org/wiki/Chipkill (Intel calls this "SDDC") can continue running ... but there will be a lot of corrected errors from a vast range of different pages. After fifteen or so errors Linux will trigger storm mode and the user will see: mce: CMCI storm detected: switching to poll mode on the console. As we poll we'll find errors and hand them to CEC. But because the errors come from far more than 512 distinct pages CEC will never manage to get a count above 1 before it drops the entry to make space for a new log. So the only indication that the user sees that something is wrong is that storm warning (and the lack of a following "storm subsided" message) tells them that errors are still happening. This amounts to a serviceability failure ... lack of useful diagnostics about a problem. Now there isn't really anything better that CEC can do in this situation. It won't help to have a bigger array. Taking pages offline wouldn't solve the problem (though if that did happen at least it would break the silence). Same situation for other DRAM failure modes that affect a wide range of pages (rank, bank, perhaps row ... though all the errors from a single row failure might fit in the CEC array). Allowing the user to bypass CEC (without a reboot ... cloud folks hate to reboot their systems) would allow the sysadmin to see what is happening (either via /dev/mcelog, or via EDAC driver). -Tony