On Mon, Sep 19, 2022 at 9:58 AM Jan Beulich <jbeul...@suse.com> wrote:
> On 19.09.2022 15:24, Tamas K Lengyel wrote: > > On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeul...@suse.com> wrote: > >> > >> On 19.09.2022 14:25, Tamas K Lengyel wrote: > >>> On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeul...@suse.com> wrote: > >>>> > >>>> On 16.09.2022 23:35, Boris Ostrovsky wrote: > >>>>> > >>>>> On 9/16/22 8:52 AM, Jan Beulich wrote: > >>>>>> On 15.09.2022 16:01, Tamas K Lengyel wrote: > >>>>>>> While experimenting with the vPMU subsystem an ASSERT failure was > >>>>>>> observed in vmx_find_msr because the vcpu_runnable state was true. > >>>>>>> > >>>>>>> The root cause of the bug appears to be the fact that the vPMU > subsystem > >>>>>>> doesn't save its state on context_switch. > >> > >> For the further reply below - is this actually true? What is the > >> vpmu_switch_from() (resolving to vpmu_save()) doing then early > >> in context_switch()? > >> > >>>>>>> The vpmu_load function will attempt > >>>>>>> to gather the PMU state if its still loaded two different ways: > >>>>>>> 1. if the current pcpu is not where the vcpu ran before doing > a remote save > >>>>>>> 2. if the current pcpu had another vcpu active before doing a > local save > >>>>>>> > >>>>>>> However, in case the prev vcpu is being rescheduled on another > pcpu its state > >>>>>>> has already changed and vcpu_runnable is returning true, thus #2 > will trip the > >>>>>>> ASSERT. The only way to avoid this race condition is to make sure > the > >>>>>>> prev vcpu is paused while being checked and its context saved. > Once the prev > >>>>>>> vcpu is resumed and does #1 it will find its state already saved. > >>>>>> While I consider this explanation plausible, I'm worried: > >>>>>> > >>>>>>> --- a/xen/arch/x86/cpu/vpmu.c > >>>>>>> +++ b/xen/arch/x86/cpu/vpmu.c > >>>>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t > from_guest) > >>>>>>> vpmu = vcpu_vpmu(prev); > >>>>>>> > >>>>>>> /* Someone ran here before us */ > >>>>>>> + vcpu_pause(prev); > >>>>>>> vpmu_save_force(prev); > >>>>>>> vpmu_reset(vpmu, VPMU_CONTEXT_LOADED); > >>>>>>> + vcpu_unpause(prev); > >>>>>>> > >>>>>>> vpmu = vcpu_vpmu(v); > >>>>>>> } > >>>>>> We're running with IRQs off here, yet vcpu_pause() waits for the > vcpu > >>>>>> to actually be de-scheduled. Even with IRQs on this is already a > >>>>>> relatively heavy operation (also including its impact on the remote > >>>>>> side). Additionally the function is called from context_switch(), > and > >>>>>> I'm unsure of the usability of vcpu_pause() on such a path. In > >>>>>> particular: Is there a risk of two CPUs doing this mutually to one > >>>>>> another? If so, is deadlocking excluded? > >>>>>> > >>>>>> Hence at the very least I think the description wants extending, to > >>>>>> discuss the safety of the change. > >>>>>> > >>>>>> Boris - any chance you could comment here? Iirc that's code you did > >>>>>> introduce. > >>>>> > >>>>> > >>>>> Is the assertion in vmx_find_msr() really needs to be for runnable > vcpu or can it be a check on whether vcpu is actually running (e.g. > RUNSTATE_running)? > >>>> > >>>> You cannot safely check for "running", as "runnable" may transition > >>>> to/from "running" behind your back. > >>> > >>> The more I look at this the more I think the only sensible solution is > >>> to have the vPMU state be saved on vmexit for all vCPUs. > >> > >> Do you really mean vmexit? It would suffice if state was reliably > >> saved during context-switch-out, wouldn't it? At that point the > >> vCPU can't be resumed on another pCPU, yet. > >> > >>> That way all > >>> this having to figure out where and when a context needs saving during > >>> scheduling goes away. Yes, it adds a bit of overhead for cases where > >>> the vCPU will resume on the same pCPU and that context saved could > >>> have been skipped, > >> > >> If you really mean vmexit, then I'm inclined to say that's more > >> than just "a bit of overhead". I'd agree if you really meant > >> context-switch-out, but as said further up it looks to me as if > >> that was already occurring. Apparently I'm overlooking something > >> crucial ... > > > > Yes, the current setup is doing exactly that, saving the vPMU context > > on context-switch-out, and that's where the ASSERT failure occurs > > because the vCPU it needs to save the context for is already runnable > > on another pCPU. > > Well, if that's the scenario (sorry for not understanding it that > way earlier on), then the assertion is too strict: While in context > switch, the vCPU may be runnable, but certainly won't actually run > anywhere. Therefore I'd be inclined to suggest to relax the > condition just enough to cover this case, without actually going to > checking for "running". > What ensures the vCPU won't actually run anywhere if it's in the runnable state? And how do I relax the condition just enough to cover just this case? Tamas