On 05/01/2023 4:05 pm, Jan Beulich wrote:
> The "n" input is a GFN value and hence bounded by the physical address
> bits in use on a system.

The one case where this isn't obviously true is in sh_audit().  It comes
from a real MFN in the system, not a GFN, which will have the same
property WRT PADDR_BITS.

>  The hash quality won't improve by also
> including the upper always-zero bits in the calculation. To keep things
> as compile-time-constant as they were before, use PADDR_BITS (not
> paddr_bits) for loop bounding. This reduces loop iterations from 8 to 5.

While this is all true, you'll get a much better improvement by not
forcing 'n' onto the stack just to access it bytewise.  Right now, the
loop looks like:

<shadow_hash_insert>:
    48 83 ec 10                 sub    $0x10,%rsp
    49 89 c9                    mov    %rcx,%r9
    41 89 d0                    mov    %edx,%r8d
    48 8d 44 24 08              lea    0x8(%rsp),%rax
    48 8d 4c 24 10              lea    0x10(%rsp),%rcx
    48 89 74 24 08              mov    %rsi,0x8(%rsp)
    0f 1f 80 00 00 00 00        nopl   0x0(%rax)
/-> 0f b6 10                    movzbl (%rax),%edx
|   48 83 c0 01                 add    $0x1,%rax
|   45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
|   41 01 d0                    add    %edx,%r8d
|   48 39 c1                    cmp    %rax,%rcx
\-- 75 ea                       jne    ffff82d0402efda0
<shadow_hash_insert+0x20>


which doesn't even have a compile-time constant loop bound.  It's
runtime calculated by the second lea constructing the upper pointer bound.

Given this further delta:

diff --git a/xen/arch/x86/mm/shadow/common.c
b/xen/arch/x86/mm/shadow/common.c
index 4a8bcec10fe8..902c749f2724 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1397,13 +1397,12 @@ static unsigned int shadow_get_allocation(struct
domain *d)
 typedef u32 key_t;
 static inline key_t sh_hash(unsigned long n, unsigned int t)
 {
-    unsigned char *p = (unsigned char *)&n;
     key_t k = t;
     int i;
 
     BUILD_BUG_ON(PADDR_BITS > BITS_PER_LONG + PAGE_SHIFT);
-    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++ )
-        k = p[i] + (k << 6) + (k << 16) - k;
+    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++, n >>= 8 )
+        k = (uint8_t)n + (k << 6) + (k << 16) - k;
 
     return k % SHADOW_HASH_BUCKETS;
 }

the code gen becomes:

<shadow_hash_insert>:
    41 89 d0                    mov    %edx,%r8d
    49 89 c9                    mov    %rcx,%r9
    b8 05 00 00 00              mov    $0x5,%eax
/-> 45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
|   40 0f b6 d6                 movzbl %sil,%edx
|   48 c1 ee 08                 shr    $0x8,%rsi
|   41 01 d0                    add    %edx,%r8d
|   83 e8 01                    sub    $0x1,%eax
\-- 75 e9                       jne    ffff82d0402efd8b
<shadow_hash_insert+0xb>

with an actual constant loop bound, and not a memory operand in sight. 
This form (even at 8 iterations) will easily execute faster than the
stack-spilled form.

~Andrew

Reply via email to