On Fri, Apr 19, 2019 at 08:04:01AM -0700, Luck, Tony wrote: > Now there isn't really anything better that CEC can do in > this situation. It won't help to have a bigger array. Taking > pages offline wouldn't solve the problem (though if that > did happen at least it would break the silence). > > Same situation for other DRAM failure modes that affect a > wide range of pages (rank, bank, perhaps row ... though all > the errors from a single row failure might fit in the CEC array). > > Allowing the user to bypass CEC (without a reboot ... cloud folks > hate to reboot their systems) would allow the sysadmin to see > what is happening (either via /dev/mcelog, or via EDAC driver).
Err, this all sounds to me like the storm detection code should *automatically* disable the CEC in such cases, I'd say. Because I don't see a cloud admin going into the debugfs and turning it off. Rather, if the detection heuristic we use is smart enough, disabling it automatically should be a lot better serviceability action. Hmmm? -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.