On Mon, Jul 19, 2021 at 12:41:09PM +0200, Markus Armbruster wrote:
> David Gibson <da...@gibson.dropbear.id.au> writes:
> 
> > On Mon, Jul 19, 2021 at 09:18:07AM +0200, Markus Armbruster wrote:
> >> David Gibson <da...@gibson.dropbear.id.au> writes:
> >> 
> >> > On Thu, Jul 15, 2021 at 03:32:06PM +0200, Markus Armbruster wrote:
> >> >> Commit 2500fb423a "migration: Include migration support for machine
> >> >> check handling" adds this:
> >> >> 
> >> >>     ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, 
> >> >> &local_err);
> >> >>     if (ret == -EBUSY) {
> >> >>         /*
> >> >>          * We don't want to abort so we let the migration to continue.
> >> >>          * In a rare case, the machine check handler will run on the 
> >> >> target.
> >> >>          * Though this is not preferable, it is better than aborting
> >> >>          * the migration or killing the VM.
> >> >>          */
> >> >>         warn_report("Received a fwnmi while migration was in progress");
> >> >>     }
> >> >> 
> >> >> migrate_add_blocker() can fail in two ways:
> >> >> 
> >> >> 1. -EBUSY: migration is already in progress
> >> >> 
> >> >>    Ignoring this one is clearly intentional.  The comment explains why.
> >> >>    I'm taking it at face value (I'm a spapr ignoramus).
> >> >
> >> > Right.  The argument isn't really about papr particularly, except
> >> > insofar as understanding what fwnmi is.  fwnmi (FirmWare assisted NMI)
> >> > is a reporting mechanism for certain low-level hardware failures
> >> > (think memory ECC or cpu level faults, IIRC).  If we migrate between
> >> > detecting and reporting the error, then the particulars we report will
> >> > be mostly meaningless since they relate to hardware we're no longer
> >> > running on.  Hence the migration blocker.
> >> >
> >> > However, migrating away from a (non-fatal) fwnmi error is a pretty
> >> > reasonable response, so we don't want to actually fail a migration if
> >> > its already in progress.
> >> >
> >> >>    Aside: I doubt
> >> >>    the warning is going to help users.
> >> >
> >> > You're probably right, but it's not very clear how to do better.  It
> >> > might possibly help someone in tech support explain why the reported
> >> > fwnmi doesn't seem to match the hardware the guest is (now) running
> >> > on.
> >> 
> >> Perhaps pointing to the actual problem could help: the FWNMI's
> >> information is mostly meaningless.
> >
> > Sorry, I don't follow what you're suggesting.
> 
> We warn
> 
>     warning: Received a fwnmi while migration was in progress
> 
> when we fail to block migration because it's already in progress.
> But what does this mean?  Perhaps warn like this:
> 
>     warning: FWNMI while migration is in progress
>     The guest's report for this may be less than useful.
> 
> My phrasing may well be off, but I hope you get the idea.

I see your point.  It may be some time before this reaches the top of
priority list, however.

> Note that we keep quiet when we fail to block migration due to
> -only-migrate.  I agree with that.  The failure makes a difference only
> when migration gets triggered in a narrow time window, which should be
> quite rare.  Would be nice to warn when migration does get triggered in
> that time window, though.  Not sure it's worth the trouble, in
> particular if we'd have to create infrastructure first.
> 
> >
> >> 
> >> >> 2. -EACCES: we're running with -only-migratable
> >> >> 
> >> >>    Why may we ignore -only-migratable here?
> >> >
> >> > Short answer: because I didn't think about that case.  Long answer:
> >> > I think we probably shoud ignore it anyway.  As above, receiving a
> >> > fwnmi doesn't really prevent migration, it just means that if you're
> >> > unlucky it can report stale information.  Since migrating away from a
> >> > possibly-dubious host would be a reasonable response to a non-fatal
> >> > fwnmi, I don't think we want to simply prohibit fwnmi entirely with
> >> > -only-migratable.
> >> 
> >> I think the comment text and placement could be improved to make clear
> >> ignoring this failure is intentional, too.  How do you like the
> >> following?
> >
> > That's fair..
> >
> >> 
> >> diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
> >> index a8f2cc6bdc..54d8e856d3 100644
> >> --- a/hw/ppc/spapr_events.c
> >> +++ b/hw/ppc/spapr_events.c
> >> @@ -911,16 +911,14 @@ void spapr_mce_req_event(PowerPCCPU *cpu, bool 
> >> recovered)
> >>          }
> >>      }
> >>  
> >> +    /*
> >> +     * Try to block migration while FWNMI is being handled, so the
> >> +     * machine check handler runs where the information passed to it
> >> +     * actually makes sense.  This won't actually block migration,
> >> +     * only delay it slightly.  If the attempt fails, carry on.
> >> +     */
> >>      ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, NULL);
> >>      if (ret == -EBUSY) {
> >> -        /*
> >> -         * We don't want to abort so we let the migration to continue.
> >> -         * In a rare case, the machine check handler will run on the 
> >> target.
> >> -         * Though this is not preferable, it is better than aborting
> >> -         * the migration or killing the VM. It is okay to call
> >> -         * migrate_del_blocker on a blocker that was not added (which the
> >> -         * nmi-interlock handler would do when it's called after this).
> >> -         */
> >>          warn_report("Received a fwnmi while migration was in progress");
> >>      }
> >
> > LGTM.
> 
> Thanks, I'll post this.
> 

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature

Reply via email to