On 2/15/24 09:28, Max Chou wrote:
Hi all,
When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x
slower than the scalar equivalent on QEMU and it hurts developer
productivity.
From the performance analysis result, we can observe that the glibc
memcpy spends most of the time in the vector unit-stride load/store
helper functions.
Samples: 465K of event 'cycles:u', Event count (approx.): 1707645730664
Children Self Command Shared Object Symbol
+ 28.46% 27.85% qemu-riscv64 qemu-riscv64 [.] vext_ldst_us
+ 26.92% 0.00% qemu-riscv64 [unknown] [.]
0x00000000000000ff
+ 14.41% 14.41% qemu-riscv64 qemu-riscv64 [.]
qemu_plugin_vcpu_mem_cb
+ 13.85% 13.85% qemu-riscv64 qemu-riscv64 [.] lde_b
+ 13.64% 13.64% qemu-riscv64 qemu-riscv64 [.] cpu_stb_mmu
+ 9.25% 9.19% qemu-riscv64 qemu-riscv64 [.] cpu_ldb_mmu
+ 7.81% 7.81% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup
+ 7.70% 7.70% qemu-riscv64 qemu-riscv64 [.] ste_b
+ 5.53% 0.00% qemu-riscv64 qemu-riscv64 [.] adjust_addr
(inlined)
So this patchset tries to improve the performance of the RVV version of
glibc memcpy on QEMU by improving the corresponding helper function
quality.
The overall performance improvement can achieve following numbers
(depending on the size).
Average: 2.86X / Smallest: 1.15X / Largest: 4.49X
PS: This RFC patchset only focuses on the vle8.v & vse8.v instructions,
the next version or next serious will complete other vector ld/st part.
You are still not tackling the root problem, which is over-use of the full out-of-line
load/store routines. The reason that cpu_mmu_lookup is in that list is because you are
performing the full virtual address resolution for each and every byte.
The only way to make a real improvement is to perform virtual address resolution *once*
for the entire vector. I refer to my previous advice:
https://gitlab.com/qemu-project/qemu/-/issues/2137#note_1757501369
r~