Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?

Markus Armbruster Mon, 19 Jul 2021 00:19:05 -0700

David Gibson <da...@gibson.dropbear.id.au> writes:

> On Thu, Jul 15, 2021 at 03:32:06PM +0200, Markus Armbruster wrote:
>> Commit 2500fb423a "migration: Include migration support for machine
>> check handling" adds this:
>> 
>>     ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, &local_err);
>>     if (ret == -EBUSY) {
>>         /*
>>          * We don't want to abort so we let the migration to continue.
>>          * In a rare case, the machine check handler will run on the target.
>>          * Though this is not preferable, it is better than aborting
>>          * the migration or killing the VM.
>>          */
>>         warn_report("Received a fwnmi while migration was in progress");
>>     }
>> 
>> migrate_add_blocker() can fail in two ways:
>> 
>> 1. -EBUSY: migration is already in progress
>> 
>>    Ignoring this one is clearly intentional.  The comment explains why.
>>    I'm taking it at face value (I'm a spapr ignoramus).
>
> Right.  The argument isn't really about papr particularly, except
> insofar as understanding what fwnmi is.  fwnmi (FirmWare assisted NMI)
> is a reporting mechanism for certain low-level hardware failures
> (think memory ECC or cpu level faults, IIRC).  If we migrate between
> detecting and reporting the error, then the particulars we report will
> be mostly meaningless since they relate to hardware we're no longer
> running on.  Hence the migration blocker.
>
> However, migrating away from a (non-fatal) fwnmi error is a pretty
> reasonable response, so we don't want to actually fail a migration if
> its already in progress.
>
>>    Aside: I doubt
>>    the warning is going to help users.
>
> You're probably right, but it's not very clear how to do better.  It
> might possibly help someone in tech support explain why the reported
> fwnmi doesn't seem to match the hardware the guest is (now) running
> on.

Perhaps pointing to the actual problem could help: the FWNMI's
information is mostly meaningless.

>> 2. -EACCES: we're running with -only-migratable
>> 
>>    Why may we ignore -only-migratable here?
>
> Short answer: because I didn't think about that case.  Long answer:
> I think we probably shoud ignore it anyway.  As above, receiving a
> fwnmi doesn't really prevent migration, it just means that if you're
> unlucky it can report stale information.  Since migrating away from a
> possibly-dubious host would be a reasonable response to a non-fatal
> fwnmi, I don't think we want to simply prohibit fwnmi entirely with
> -only-migratable.

I think the comment text and placement could be improved to make clear
ignoring this failure is intentional, too.  How do you like the
following?

diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
index a8f2cc6bdc..54d8e856d3 100644
--- a/hw/ppc/spapr_events.c
+++ b/hw/ppc/spapr_events.c
@@ -911,16 +911,14 @@ void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered)
         }
     }
 
+    /*
+     * Try to block migration while FWNMI is being handled, so the
+     * machine check handler runs where the information passed to it
+     * actually makes sense.  This won't actually block migration,
+     * only delay it slightly.  If the attempt fails, carry on.
+     */
     ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, NULL);
     if (ret == -EBUSY) {
-        /*
-         * We don't want to abort so we let the migration to continue.
-         * In a rare case, the machine check handler will run on the target.
-         * Though this is not preferable, it is better than aborting
-         * the migration or killing the VM. It is okay to call
-         * migrate_del_blocker on a blocker that was not added (which the
-         * nmi-interlock handler would do when it's called after this).
-         */
         warn_report("Received a fwnmi while migration was in progress");
     }
 
>> By the way, we leak @local_err on failure.  I'll post a patch, but I'd
>> like my question answered first.

Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?

Reply via email to