vcpumask_to_pcpumask() case study

Andrew Cooper Sat, 01 Jun 2024 11:50:56 -0700

One of the followon items I had from the bitops clean-up is this:

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 648d6dd475ba..9c3a017606ed 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask(
             unsigned int cpu;
 
             vcpu_id = ffsl(vmask) - 1;
-            vmask &= ~(1UL << vcpu_id);
+            vmask &= vmask - 1;
             vcpu_id += vcpu_bias;
             if ( (vcpu_id >= d->max_vcpus) )
                 return 0;


which yields the following improvement:

  add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34)
  Function                                     old     new   delta
  vcpumask_to_pcpumask                         519     485     -34


While I (the programmer) can reason the two expressions are equivalent,
the compiler can't, so we really are saving a SHL to re-calculate (1 <<
vcpu_id) and swapping it for a LEA -1(vmask) which happens to be hoisted
above the ffsl() call.

However, the majority of the savings came from the fact that the old
code used to hold 1 in %r15 (across the entire function!) so it could
calculate (1 << vcpu_id) on each loop iteration.  With %r15 now free for
other use, we one fewer thing spilt to the stack.

Anyway - while it goes to show that while/ffs logic should be extra
careful to use x &= x - 1 for it's loop condition logic, that's not all.

The rest of this function is crazy.  We're reading a guest-word at a
time in order to make a d->max_vcpus sized bitmap (with a reasonable
amount of opencoded logic to reassemble the fragments back into a vcpu
number).

PVH dom0 can reach here, and because it's not pv32, will be deemed to
have a guest word size of 64.  Also, word-at-time for any HVM guest is
an insane overhead in terms of the pagewalks behind the copy_from_hvm()
infrastructure.

Instead, we should calculate the size of the bitmap at
DIV_ROUND_UP(d->max_vcpus, 8), copy the bitmap in one whole go, and then
just have a single for_each_set_bit() looping over it.  Importantly this
avoids needing to know or care about the guest word size.

This removes 4 of the 6 hiding lfences; the pair for calculating
is_native to start with, and then one of the two pairs behind the use of
is_native to select the type of the access.

The only complications is this:  Right now, PV max vCPUS is 8k, so we
could even get away with this being an on-stack bitmap.  However, the
vast majority of PV guests small and a 64-bit bitmap would do fine.

But, given this is just one example, wouldn't it be better if we just
unconditionally had a marshalling buffer for hypercalls?  There's the
XLAT area but it doesn't exist PV64, and we've got various other pieces
of scratch per-cpu data.

If not, having a 128/256-bit bitmap on the stack will still be good
enough in practice, but still ok at amortising the PVH dom0 costs.

Thoughts?

~Andrew

vcpumask_to_pcpumask() case study

Reply via email to