One of the followon items I had from the bitops clean-up is this: diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index 648d6dd475ba..9c3a017606ed 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask( unsigned int cpu; vcpu_id = ffsl(vmask) - 1; - vmask &= ~(1UL << vcpu_id); + vmask &= vmask - 1; vcpu_id += vcpu_bias; if ( (vcpu_id >= d->max_vcpus) ) return 0;
which yields the following improvement: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34) Function old new delta vcpumask_to_pcpumask 519 485 -34 While I (the programmer) can reason the two expressions are equivalent, the compiler can't, so we really are saving a SHL to re-calculate (1 << vcpu_id) and swapping it for a LEA -1(vmask) which happens to be hoisted above the ffsl() call. However, the majority of the savings came from the fact that the old code used to hold 1 in %r15 (across the entire function!) so it could calculate (1 << vcpu_id) on each loop iteration. With %r15 now free for other use, we one fewer thing spilt to the stack. Anyway - while it goes to show that while/ffs logic should be extra careful to use x &= x - 1 for it's loop condition logic, that's not all. The rest of this function is crazy. We're reading a guest-word at a time in order to make a d->max_vcpus sized bitmap (with a reasonable amount of opencoded logic to reassemble the fragments back into a vcpu number). PVH dom0 can reach here, and because it's not pv32, will be deemed to have a guest word size of 64. Also, word-at-time for any HVM guest is an insane overhead in terms of the pagewalks behind the copy_from_hvm() infrastructure. Instead, we should calculate the size of the bitmap at DIV_ROUND_UP(d->max_vcpus, 8), copy the bitmap in one whole go, and then just have a single for_each_set_bit() looping over it. Importantly this avoids needing to know or care about the guest word size. This removes 4 of the 6 hiding lfences; the pair for calculating is_native to start with, and then one of the two pairs behind the use of is_native to select the type of the access. The only complications is this: Right now, PV max vCPUS is 8k, so we could even get away with this being an on-stack bitmap. However, the vast majority of PV guests small and a 64-bit bitmap would do fine. But, given this is just one example, wouldn't it be better if we just unconditionally had a marshalling buffer for hypercalls? There's the XLAT area but it doesn't exist PV64, and we've got various other pieces of scratch per-cpu data. If not, having a 128/256-bit bitmap on the stack will still be good enough in practice, but still ok at amortising the PVH dom0 costs. Thoughts? ~Andrew