> 
> Hi,
> yes that seems bit weird. It is bit better as it does not modify common code,
> but still... Maybe going back to your original idea of replacing memcpy, try
> replacing it with readq? It should generate one instruction read (although it 
> is
> only for x64_64, for 32 bit kernel we would still need to do something else).
> 
> Thanks,
> Amadeusz

Hi,

I've compared the assembly to see if there is clue. Both kernels are using 
64-bit
mov to read register and the only difference is optimized or not. Both
implementations are looking good to me. Currently I don't have answer why
slower kernel hits the problem while optimized one survived.

1. Old kernel. Code is optimized and not able to reproduce the issue on this 
kernel.

(gdb) disas sst_shim32_read64
Dump of assembler code for function sst_shim32_read64:
   0x000000000000096c <+0>:     call   0x971 <sst_shim32_read64+5>
=> call __fentry__
   0x0000000000000971 <+5>:     push   rbp
   0x0000000000000972 <+6>:     mov    rbp,rsp
   0x0000000000000975 <+9>:     mov    eax,esi
   0x0000000000000977 <+11>:    mov    rax,QWORD PTR [rdi+rax*1]
=> perform 64-bit mov
   0x000000000000097b <+15>:    pop    rbp
   0x000000000000097c <+16>:    ret
End of assembler dump.

2. New kernel: obviously optimization is disabled and it calls memcpy to do the 
read operation.

(gdb) disas sst_shim32_read64
Dump of assembler code for function sst_shim32_read64:
   0x00000000000009a8 <+0>:     call   0x9ad <sst_shim32_read64+5>
=> call __fentry__
   0x00000000000009ad <+5>:     push   rbp
   0x00000000000009ae <+6>:     mov    rbp,rsp
   0x00000000000009b1 <+9>:     push   rbx
   0x00000000000009b2 <+10>:    sub    rsp,0x10
   0x00000000000009b6 <+14>:    mov    rax,QWORD PTR gs:0x28
   0x00000000000009bf <+23>:    mov    QWORD PTR [rbp-0x10],rax
   0x00000000000009c3 <+27>:    movabs rax,0xaaaaaaaaaaaaaaaa
   0x00000000000009cd <+37>:    lea    rbx,[rbp-0x18]
   0x00000000000009d1 <+41>:    mov    QWORD PTR [rbx],rax
   0x00000000000009d4 <+44>:    mov    esi,esi
   0x00000000000009d6 <+46>:    add    rsi,rdi
   0x00000000000009d9 <+49>:    mov    edx,0x8
   0x00000000000009de <+54>:    mov    rdi,rbx
   0x00000000000009e1 <+57>:    call   0x9e6 <sst_shim32_read64+62>
=> call memcpy

The memcpy is implemented in arch/x86/lib/memcpy_64.S

(gdb) disas memcpy
Dump of assembler code for function memcpy:
   0xffffffff813519c0 <+0>:     jmp    0xffffffff813519f0 <memcpy_orig>
=> jump to memcpy_orig function

X86_FEATURE_ERMS is disabled so it jumps to memcpy_orig

(gdb) disas memcpy_orig
Dump of assembler code for function memcpy_orig:
   0xffffffff813519f0 <+0>:     mov    rax,rdi
   0xffffffff813519f3 <+3>:     cmp    rdx,0x20
   0xffffffff813519f7 <+7>:     jb     0xffffffff81351a77 <memcpy_orig+135>
=> jump because our read size is 8
...
   0xffffffff81351a77 <+135>:   cmp    edx,0x10
   0xffffffff81351a7a <+138>:   jb     0xffffffff81351aa0 <memcpy_orig+176>
=> jump because our read size is 8
...
   0xffffffff81351aa0 <+176>:   cmp    edx,0x8
   0xffffffff81351aa3 <+179>:   jb     0xffffffff81351ac0 <memcpy_orig+208>
   0xffffffff81351aa5 <+181>:   mov    r8,QWORD PTR [rsi]
   0xffffffff81351aa8 <+184>:   mov    r9,QWORD PTR [rsi+rdx*1-0x8]
   0xffffffff81351aad <+189>:   mov    QWORD PTR [rdi],r8
   0xffffffff81351ab0 <+192>:   mov    QWORD PTR [rdi+rdx*1-0x8],r9
=> perform 64-bit mov twice over same address (rdx=0x8)
   0xffffffff81351ab5 <+197>:   ret

Regards,
Brent

Reply via email to