On Wed, 2014-05-28 at 10:49 +0800, Chen Yucong wrote:
> > From: Borislav Petkov <b...@suse.de>
> > 
> > Hi all,
> > 
> > this is something Tony and I have been working on behind the curtains
> > recently. Here it is in a RFC form, it passes quick testing in kvm. Let
> > me send it out before I start hammering on it on a real machine.
> > 
> > More indepth info about what it is and what it does is in patch 1/3.
> > 
> > As always, comments and suggestions are most welcome.
> > 
> > Thanks.
> 
> What's the point of this patch set?
> My understanding is that if there are some(COUNT_MASK) corrected DRAM
> ECC errors for a specific page frame, we can believe that the page frame
> is so ill that it should be isolated as soon as possible.
> 
> The question is: memory_failure can not be used for isolating the page
> frame which is being used by kernel, because it just poison the page and
> IGNORED. memory_failure is mostly used for handling AR/AO type errors
> related to the page frame which the userspace tasks are using now.
> 
> Although the relative page frame is very ill, it is not dead and can
> still work. However, memory_failure may kill the userspace tasks,
> especially for those page frames that are holding dynamic data rather
> than file-backed(file/swap) data.
> 
> So I do not think that it is a good idea to directly use memory_failure
> in this patch set. 
> 

I second that. You can't poison a page and potentially kill an
application just because an arbitrarily chosen number of corrected
errors has been exceeded. That would be an anti-RAS feature: less
reliability and availability.
A possible alternative would be to soft-offline the page. This is
currently done in APEI code when corrected memory error thresholds are
exceeded and reported by UEFI via a generic hardware error source
(GHES). 
The example is in ghes_handle_memory_failure() where we call
memory_failure_queue(pfn, 0, flags) with flags = MF_SOFT_OFFLINE

- Max

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to