Hi, As some of you are already aware the current RVV emulation could be faster. We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c: skip set tail when vta is zero") that tried to address at least part of the problem.
Running a simple program like this: ------- #define SZ 10000000 int main () { int *a = malloc (SZ * sizeof (int)); int *b = malloc (SZ * sizeof (int)); int *c = malloc (SZ * sizeof (int)); for (int i = 0; i < SZ; i++) c[i] = a[i] + b[i]; return c[SZ - 1]; } ------- And then compiling it without RVV support will run in 50 milis or so: $ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo-novect.out real 0m0.043s user 0m0.025s sys 0m0.018s Building the same program with RVV support slows it 4-5 times: $ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=1024 ./foo.out real 0m0.196s user 0m0.177s sys 0m0.018s Using the lowest 'vlen' val allowed (128) will slow down things even further, taking it to ~0.260s. 'perf record' shows the following profile on the aforementioned binary: 23.27% qemu-riscv64 qemu-riscv64 [.] do_ld4_mmu 21.11% qemu-riscv64 qemu-riscv64 [.] vext_ldst_us 14.05% qemu-riscv64 qemu-riscv64 [.] cpu_ldl_le_data_ra 11.51% qemu-riscv64 qemu-riscv64 [.] cpu_stl_le_data_ra 8.18% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup 8.04% qemu-riscv64 qemu-riscv64 [.] do_st4_mmu 2.04% qemu-riscv64 qemu-riscv64 [.] ste_w 1.15% qemu-riscv64 qemu-riscv64 [.] lde_w 1.02% qemu-riscv64 [unknown] [k] 0xffffffffb3001260 0.90% qemu-riscv64 qemu-riscv64 [.] cpu_get_tb_cpu_state 0.64% qemu-riscv64 qemu-riscv64 [.] tb_lookup 0.64% qemu-riscv64 qemu-riscv64 [.] riscv_cpu_mmu_index 0.39% qemu-riscv64 qemu-riscv64 [.] object_dynamic_cast_assert First thing that caught my attention is vext_ldst_us from target/riscv/vector_helper.c: /* load bytes from guest memory */ for (i = env->vstart; i < evl; i++, env->vstart++) { k = 0; while (k < nf) { target_ulong addr = base + ((i * nf + k) << log2_esz); ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra); k++; } } env->vstart = 0; Given that this is a unit-stride load that access contiguous elements in memory it seems that this loop could be optimized/removed since it's loading/storing bytes one by one. I didn't find any TCG op to do that though. I assume that ARM SVE might have something of the sorts. Richard, care to comment? The current support we have is good enough for booting a kernel and tests, but things aggravate fast if one attempts to run a x264 SPEC with it. With a SPEC run we have other insns appearing as hot but for now it would be good to see if we can optimize these loads and stores. Any ideas on how to tackle this? Thanks, Daniel