On Wed, 2014-05-28 at 10:49 +0800, Chen Yucong wrote: > > From: Borislav Petkov <b...@suse.de> > > > > Hi all, > > > > this is something Tony and I have been working on behind the curtains > > recently. Here it is in a RFC form, it passes quick testing in kvm. Let > > me send it out before I start hammering on it on a real machine. > > > > More indepth info about what it is and what it does is in patch 1/3. > > > > As always, comments and suggestions are most welcome. > > > > Thanks. > > What's the point of this patch set? > My understanding is that if there are some(COUNT_MASK) corrected DRAM > ECC errors for a specific page frame, we can believe that the page frame > is so ill that it should be isolated as soon as possible. > > The question is: memory_failure can not be used for isolating the page > frame which is being used by kernel, because it just poison the page and > IGNORED. memory_failure is mostly used for handling AR/AO type errors > related to the page frame which the userspace tasks are using now. > > Although the relative page frame is very ill, it is not dead and can > still work. However, memory_failure may kill the userspace tasks, > especially for those page frames that are holding dynamic data rather > than file-backed(file/swap) data. > > So I do not think that it is a good idea to directly use memory_failure > in this patch set. >
I second that. You can't poison a page and potentially kill an application just because an arbitrarily chosen number of corrected errors has been exceeded. That would be an anti-RAS feature: less reliability and availability. A possible alternative would be to soft-offline the page. This is currently done in APEI code when corrected memory error thresholds are exceeded and reported by UEFI via a generic hardware error source (GHES). The example is in ghes_handle_memory_failure() where we call memory_failure_queue(pfn, 0, flags) with flags = MF_SOFT_OFFLINE - Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/