On 06.01.2023 03:03, Andrew Cooper wrote:
> On 05/01/2023 4:05 pm, Jan Beulich wrote:
>> The "n" input is a GFN value and hence bounded by the physical address
>> bits in use on a system.
> 
> The one case where this isn't obviously true is in sh_audit().  It comes
> from a real MFN in the system, not a GFN, which will have the same
> property WRT PADDR_BITS.

I'm afraid I was more wrong with that than just for the audit case. Only
FL1 shadows use GFNs. All other shadows us MFNs. I'll update the sentence.

>>  The hash quality won't improve by also
>> including the upper always-zero bits in the calculation. To keep things
>> as compile-time-constant as they were before, use PADDR_BITS (not
>> paddr_bits) for loop bounding. This reduces loop iterations from 8 to 5.
> 
> While this is all true, you'll get a much better improvement by not
> forcing 'n' onto the stack just to access it bytewise.  Right now, the
> loop looks like:
> 
> <shadow_hash_insert>:
>     48 83 ec 10                 sub    $0x10,%rsp
>     49 89 c9                    mov    %rcx,%r9
>     41 89 d0                    mov    %edx,%r8d
>     48 8d 44 24 08              lea    0x8(%rsp),%rax
>     48 8d 4c 24 10              lea    0x10(%rsp),%rcx
>     48 89 74 24 08              mov    %rsi,0x8(%rsp)
>     0f 1f 80 00 00 00 00        nopl   0x0(%rax)
> /-> 0f b6 10                    movzbl (%rax),%edx
> |   48 83 c0 01                 add    $0x1,%rax
> |   45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
> |   41 01 d0                    add    %edx,%r8d
> |   48 39 c1                    cmp    %rax,%rcx
> \-- 75 ea                       jne    ffff82d0402efda0
> <shadow_hash_insert+0x20>
> 
> 
> which doesn't even have a compile-time constant loop bound.  It's
> runtime calculated by the second lea constructing the upper pointer bound.
> 
> Given this further delta:
> 
> diff --git a/xen/arch/x86/mm/shadow/common.c
> b/xen/arch/x86/mm/shadow/common.c
> index 4a8bcec10fe8..902c749f2724 100644
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -1397,13 +1397,12 @@ static unsigned int shadow_get_allocation(struct
> domain *d)
>  typedef u32 key_t;
>  static inline key_t sh_hash(unsigned long n, unsigned int t)
>  {
> -    unsigned char *p = (unsigned char *)&n;
>      key_t k = t;
>      int i;
>  
>      BUILD_BUG_ON(PADDR_BITS > BITS_PER_LONG + PAGE_SHIFT);
> -    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++ )
> -        k = p[i] + (k << 6) + (k << 16) - k;
> +    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++, n >>= 8 )
> +        k = (uint8_t)n + (k << 6) + (k << 16) - k;
>  
>      return k % SHADOW_HASH_BUCKETS;
>  }
> 
> the code gen becomes:
> 
> <shadow_hash_insert>:
>     41 89 d0                    mov    %edx,%r8d
>     49 89 c9                    mov    %rcx,%r9
>     b8 05 00 00 00              mov    $0x5,%eax
> /-> 45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
> |   40 0f b6 d6                 movzbl %sil,%edx
> |   48 c1 ee 08                 shr    $0x8,%rsi
> |   41 01 d0                    add    %edx,%r8d
> |   83 e8 01                    sub    $0x1,%eax
> \-- 75 e9                       jne    ffff82d0402efd8b
> <shadow_hash_insert+0xb>
> 
> with an actual constant loop bound, and not a memory operand in sight. 
> This form (even at 8 iterations) will easily execute faster than the
> stack-spilled form.

Oh, yes, good idea. Will adjust.

Jan

Reply via email to