Hi all, When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x slower than the scalar equivalent on QEMU and it hurts developer productivity.
>From the performance analysis result, we can observe that the glibc memcpy spends most of the time in the vector unit-stride load/store helper functions. Samples: 465K of event 'cycles:u', Event count (approx.): 1707645730664 Children Self Command Shared Object Symbol + 28.46% 27.85% qemu-riscv64 qemu-riscv64 [.] vext_ldst_us + 26.92% 0.00% qemu-riscv64 [unknown] [.] 0x00000000000000ff + 14.41% 14.41% qemu-riscv64 qemu-riscv64 [.] qemu_plugin_vcpu_mem_cb + 13.85% 13.85% qemu-riscv64 qemu-riscv64 [.] lde_b + 13.64% 13.64% qemu-riscv64 qemu-riscv64 [.] cpu_stb_mmu + 9.25% 9.19% qemu-riscv64 qemu-riscv64 [.] cpu_ldb_mmu + 7.81% 7.81% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup + 7.70% 7.70% qemu-riscv64 qemu-riscv64 [.] ste_b + 5.53% 0.00% qemu-riscv64 qemu-riscv64 [.] adjust_addr (inlined) So this patchset tries to improve the performance of the RVV version of glibc memcpy on QEMU by improving the corresponding helper function quality. The overall performance improvement can achieve following numbers (depending on the size). Average: 2.86X / Smallest: 1.15X / Largest: 4.49X PS: This RFC patchset only focuses on the vle8.v & vse8.v instructions, the next version or next serious will complete other vector ld/st part. Regards, Max. [1] https://inbox.sourceware.org/libc-alpha/20230504074851.38763-1-hau....@sifive.com Max Chou (6): target/riscv: Seperate vector segment ld/st instructions accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb target/riscv: Inline vext_ldst_us and coressponding function for performance accel/tcg: Inline cpu_mmu_lookup function accel/tcg: Inline do_ld1_mmu function accel/tcg: Inline do_st1_mmu function accel/tcg/ldst_common.c.inc | 40 ++++++-- accel/tcg/user-exec.c | 17 ++-- target/riscv/helper.h | 4 + target/riscv/insn32.decode | 11 +- target/riscv/insn_trans/trans_rvv.c.inc | 61 +++++++++++ target/riscv/vector_helper.c | 130 +++++++++++++++++++----- 6 files changed, 221 insertions(+), 42 deletions(-) -- 2.34.1