On Thu, Mar 28, 2019 at 12:09:05PM +0530, Aravinda Prasad wrote: > > > On Thursday 28 March 2019 06:18 AM, David Gibson wrote: > > On Tue, Mar 26, 2019 at 12:22:50PM +0530, Aravinda Prasad wrote: > >> > >> > >> On Tuesday 26 March 2019 09:49 AM, David Gibson wrote: > >>> On Mon, Mar 25, 2019 at 02:31:10PM +0530, Aravinda Prasad wrote: > >>>> > >>>> > >>>> On Monday 25 March 2019 12:03 PM, David Gibson wrote: > >>>>> On Fri, Mar 22, 2019 at 12:04:25PM +0530, Aravinda Prasad wrote: > >>>>>> Block VM migration requests until the machine check > >>>>>> error handling is complete as (i) these errors are > >>>>>> specific to the source hardware and is irrelevant on > >>>>>> the target hardware, (ii) these errors cause data > >>>>>> corruption and should be handled before migration. > >>>>>> > >>>>>> Signed-off-by: Aravinda Prasad <aravi...@linux.vnet.ibm.com> > >>>>>> --- > >>>>>> hw/ppc/spapr_events.c | 17 +++++++++++++++++ > >>>>>> hw/ppc/spapr_rtas.c | 4 ++++ > >>>>>> include/hw/ppc/spapr.h | 3 +++ > >>>>>> 3 files changed, 24 insertions(+) > >>>>>> > >>>>>> diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c > >>>>>> index d7cc0a4..6356a38 100644 > >>>>>> --- a/hw/ppc/spapr_events.c > >>>>>> +++ b/hw/ppc/spapr_events.c > >>>>>> @@ -41,6 +41,7 @@ > >>>>>> #include "qemu/bcd.h" > >>>>>> #include "hw/ppc/spapr_ovec.h" > >>>>>> #include <libfdt.h> > >>>>>> +#include "migration/blocker.h" > >>>>>> > >>>>>> #define RTAS_LOG_VERSION_MASK 0xff000000 > >>>>>> #define RTAS_LOG_VERSION_6 0x06000000 > >>>>>> @@ -866,6 +867,22 @@ static void spapr_mce_build_elog(PowerPCCPU *cpu, > >>>>>> bool recovered) > >>>>>> void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered) > >>>>>> { > >>>>>> SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine()); > >>>>>> + int ret; > >>>>>> + Error *local_err = NULL; > >>>>>> + > >>>>>> + error_setg(&spapr->migration_blocker, > >>>>>> + "Live migration not supported during machine check > >>>>>> handling"); > >>>>>> + ret = migrate_add_blocker(spapr->migration_blocker, &local_err); > >>>>>> + if (ret < 0) { > >>>>>> + /* > >>>>>> + * We don't want to abort and let the migration to continue. > >>>>>> In a > >>>>>> + * rare case, the machine check handler will run on the target > >>>>>> + * hardware. Though this is not preferable, it is better than > >>>>>> aborting > >>>>>> + * the migration or killing the VM. > >>>>> > >>>>> Can't you just discard the error in that case? > >>>> > >>>> I am actually discarding the error. Do you mean I don't have to print > >>>> the below warning or am I missing something? > >>> > >>> Um.. I guess we should keep the warning. The tricky thing here is > >>> that the information on the machine check is no longer relevant after > >>> the migration, however presumably the event which triggered it might > >>> have already corrupted guest memory, which *would* be relevant to the > >>> guest after migration. Not sure what to do about that. > >> > >> I think the information is still relevant after the migration, because > >> it includes the effective address in error and the context. If the > >> corruption is in the guest kernel context then the machine check handler > >> inside the guest kernel will call panic(). If the corruption is in user > >> space, then the application is sent a SIGBUS. > > > > Good point. > > > >> But the problem is when the machine check handler hardware poisons the > >> page frame backing the corrupt memory to avoid reuse of that page frame. > >> After migration, the page frame backing the corrupt memory is clean, but > >> we poison the clean page frame. > > > > This is the guest side handler? It sounds like we we need two > > variants of the notification that say whether or not the area is still > > bad. > > Yes this is guest side handler. > > Can we just let the guest machine check handler poison a clean page as a > migration during a machine check is a rare occurrence. We may at most > lose a clean page which is a trade-off for additional complexity in the > machine check handler.
Yeah, that seems reasonable. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature