domain: limit window where curr_vcpu != current on context switch

Jan Beulich Thu, 09 Jan 2025 01:00:17 -0800

On 08.01.2025 15:26, Roger Pau Monne wrote:
> On x86 Xen will perform lazy context switches to the idle vCPU, where the
> previously running vCPU context is not overwritten, and only current is 
> updated
> to point to the idle vCPU.  The state is then disjunct between current and
> curr_vcpu: current points to the idle vCPU, while curr_vcpu points to the vCPU
> whose context is loaded on the pCPU.
> 
> While on that lazy context switched state, certain calls (like
> map_domain_page()) will trigger a full synchronization of the pCPU state by
> forcing a context switch.  Note however how calling any of such functions
> inside the context switch code itself is very likely to trigger an infinite
> recursion loop.
> 
> Attempt to limit the window where curr_vcpu != current in the context switch
> code, as to prevent and infinite recursion loop around sync_local_execstate().
> 
> This is required for using map_domain_page() in the vCPU context switch code,
> otherwise using map_domain_page() in that context ends up in a recursive
> sync_local_execstate() loop:


Question is whether it's a good idea in the first place to start using
map_domain_page() from the context switch path. Surely there are possible
alternatives.

> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1982,16 +1982,16 @@ static void load_default_gdt(unsigned int cpu)
>      per_cpu(full_gdt_loaded, cpu) = false;
>  }
>  
> -static void __context_switch(void)
> +static void __context_switch(struct vcpu *n)
>  {
>      struct cpu_user_regs *stack_regs = guest_cpu_user_regs();
>      unsigned int          cpu = smp_processor_id();
>      struct vcpu          *p = per_cpu(curr_vcpu, cpu);
> -    struct vcpu          *n = current;
>      struct domain        *pd = p->domain, *nd = n->domain;
>  
>      ASSERT(p != n);
>      ASSERT(!vcpu_cpu_dirty(n));
> +    ASSERT(p == current);
>  
>      if ( !is_idle_domain(pd) )
>      {
> @@ -2036,6 +2036,18 @@ static void __context_switch(void)
>  
>      write_ptbase(n);
>  
> +    /*
> +     * It's relevant to set both current and curr_vcpu back-to-back, to 
> avoid a
> +     * window where calls to mapcache_current_vcpu() during the context 
> switch
> +     * could trigger a recursive loop.
> +     *
> +     * Do the current switch immediately after switching to the new guest
> +     * page-tables, so that current is (almost) always in sync with the
> +     * currently loaded page-tables.
> +     */
> +    set_current(n);
> +    per_cpu(curr_vcpu, cpu) = n;

The latter paragraph of the comment states something that so far wasn't 
intended,
and imo also shouldn't be going forward. It's curr_vcpu which wants to be in 
sync
with the loaded page tables. (Whether pulling ahead its updating is okay is a
separate question. All of these actions used to be be very carefully placed they
way they are. Which isn't to say that I can exclude things having gone stale 
...)
And yes, that has always meant that mapcache_current_vcpu()'s condition for
calling sync_local_execstate() was building upon the fact that it won't be 
called
from context switching contexts.

Did you consider updating that condition (evaluating curr_cpu) instead?

> @@ -2048,8 +2060,6 @@ static void __context_switch(void)
>      if ( pd != nd )
>          cpumask_clear_cpu(cpu, pd->dirty_cpumask);
>      write_atomic(&p->dirty_cpu, VCPU_CPU_CLEAN);
> -
> -    per_cpu(curr_vcpu, cpu) = n;
>  }
>  
>  void context_switch(struct vcpu *prev, struct vcpu *next)
> @@ -2081,16 +2091,36 @@ void context_switch(struct vcpu *prev, struct vcpu 
> *next)
>  
>      local_irq_disable();
>  
> -    set_current(next);
> -
>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
>           (is_idle_domain(nextd) && cpu_online(cpu)) )
>      {
> +        /*
> +         * Lazy context switch to the idle vCPU, set current == idle.  Full
> +         * context switch happens if/when sync_local_execstate() is called.
> +         */
> +        set_current(next);
>          local_irq_enable();

The comment is misleading as far as the first half of the if() condition goes:
No further switching is going to happen in that case, aiui.

>      }
>      else
>      {
> -        __context_switch();
> +        /*
> +         * curr_vcpu will always point to the currently loaded vCPU context, 
> as
> +         * it's not updated when doing a lazy switch to the idle vCPU.
> +         */
> +        struct vcpu *prev_ctx = per_cpu(curr_vcpu, cpu);
> +
> +        if ( prev_ctx != current )
> +        {
> +            /*
> +             * Doing a full context switch to a non-idle vCPU from a lazy
> +             * context switched state.  Adjust current to point to the
> +             * currently loaded vCPU context.
> +             */
> +            ASSERT(current == idle_vcpu[cpu]);
> +            ASSERT(!is_idle_vcpu(next));
> +            set_current(prev_ctx);

This feels wrong, as in "current" then not representing what it should 
represent,
for a certain time window. I may be dense, but neither comment not description
clarify to me why this might be needed. I can see that it's needed to please the
ASSERT() you add to __context_switch(), yet then I might ask why that assertion
is put there.

> +        }
> +        __context_switch(next);
>  
>          /* Re-enable interrupts before restoring state which may fault. */
>          local_irq_enable();
> @@ -2156,15 +2186,23 @@ int __sync_local_execstate(void)
>  {
>      unsigned long flags;
>      int switch_required;
> +    unsigned int cpu = smp_processor_id();
> +    struct vcpu *p;
>  
>      local_irq_save(flags);
>  
> -    switch_required = (this_cpu(curr_vcpu) != current);
> +    p = per_cpu(curr_vcpu, cpu);
> +    switch_required = (p != current);
>  
>      if ( switch_required )
>      {
> -        ASSERT(current == idle_vcpu[smp_processor_id()]);
> -        __context_switch();
> +        ASSERT(current == idle_vcpu[cpu]);
> +        /*
> +         * Restore current to the previously running vCPU, __context_switch()
> +         * will update current together with curr_vcpu.
> +         */
> +        set_current(p);

Similarly here.

> +        __context_switch(idle_vcpu[cpu]);
>      }
>  
>      local_irq_restore(flags);
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2232,8 +2232,6 @@ void __init trap_init(void)
>  
>  void activate_debugregs(const struct vcpu *curr)
>  {
> -    ASSERT(curr == current);
> -
>      write_debugreg(0, curr->arch.dr[0]);
>      write_debugreg(1, curr->arch.dr[1]);
>      write_debugreg(2, curr->arch.dr[2]);

Why would this assertion go away? If it suddenly triggers, the parameter name
would now end up being wrong.

Jan

Re: [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch

Reply via email to