On Thu, Apr 18, 2019 at 04:58:22PM -0700, Cong Wang wrote: > No, it is all about whether we should break users' expectation.
What user expectation? > This doesn't sounds like a valid reason for us to break users' > expectation. I think it is *you* who has some sort of "expectation" but that "expectation" is wrong. > Prior to CONFIG_RAS, mcelog just works fine for users (at least Intel > users). Suddenly after enabling CONFIG_RAS in kernel, mcelog will > no longer receive any correctable memory errors _silently_. That is, of course, wrong too. > What's more, we don't even have rasdaemon running in our system, so Are you saying "we" to mean "we the users" or some company "we"? And that is wrong too, there's at least one rasdaemon: http://git.infradead.org/users/mchehab/rasdaemon.git > there is no consumer of RAS CEC, RAS CEC doesn't need a consumer. You're misunderstanding the whole concept of the error collector. > these errors just simply disappear from users' expected place. They "disappear" because you have CONFIG_RAS_CEC enabled. But they don't really disappear - they're collected by the thing to filter out only the pages which keep generating errors constantly and those get soft-offlined. The sporadic ones simply get ignored because they don't happen again and are only result of alpha particles or overheating conditions or whatever. Now here's the CEC help text: config RAS_CEC bool "Correctable Errors Collector" depends on X86_MCE && MEMORY_FAILURE && DEBUG_FS ---help--- This is a small cache which collects correctable memory errors per 4K page PFN and counts their repeated occurrence. Once the counter for a PFN overflows, we try to soft-offline that page as we take it to mean that it has reached a relatively high error count and would probably be best if we don't use it anymore. Bear in mind that this is absolutely useless if your platform doesn't have ECC DIMMs and doesn't have DRAM ECC checking enabled in the BIOS. you can tell me what in that text is not clear so that I can make it more clear and obvious what that thing is. > I know CONFIG_RAS is new feature supposed to replace MCELOG, No, it isn't. CONFIG_RAS is supposed to collect all the RAS-related functionality in the kernel and it looks like you have some misconceptions about it. > but they can co-exist in kernel config, which means mcelog should > continue to work as before until it gets fully replaced. For that you need to enable X86_MCELOG_LEGACY. And let me repeat it again - if you want to collect errors in userspace, do not enable RAS_CEC at all. > Even the following PoC change could make this situation better, > because with this change when we enable CONFIG_RAS,mcelog > will break _loudly_ rather than just silently, users will notice mcelog > is no longer supported and will look for its alternative choice. You have somehow put in your head that CONFIG_RAS is the counterpart of CONFIG_X86_MCELOG_LEGACY. Which is *simply* *not* *true*. And the moment you realize that, then you'll be a step further in the right direction. So enable X86_MCELOG_LEGACY and you can collect all the errors you wish. And there's a rasdaemon which you can use too, as I pointed above, if you don't want mcelog. CEC is something *completely* different and its purpose is to run in the kernel and prevent users and admins from upsetting unnecessarily with every sporadic correctable error and just because an alpha particle flew through their DIMMs, they all start running in headless chicken mode, trying to RMA perfectly good hardware. Now, if any of that above still doesn't make it clear, please state what you're trying to achieve and I'll try to help. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.