BALATON Zoltan <bala...@eik.bme.hu> writes: > On Mon, 28 Apr 2025, BALATON Zoltan wrote: >> On Mon, 28 Apr 2025, BALATON Zoltan wrote: >>> On Thu, 24 Apr 2025, BALATON Zoltan wrote: >>>>> The test case I've used came out of a discussion about very slow >>>>> access to VRAM of a graphics card passed through with vfio the reason >>>>> for which is still not clear but it was already known that dcbz is >>>>> often used by MacOS and AmigaOS for clearing memory and to avoid >>>>> reading values about to be overwritten which is faster on real CPU but >>>>> was found to be slower on QEMU. The optimised copy routines were >>>>> posted here: <snip> > > I have tried profiling the dst in real card vfio vram with dcbz case > (with 100 iterations instead of 10000 in above tests) but I'm not sure > I understand the results. vperm and dcbz show up but not too high. Can > somebody explain what is happening here and where the overhead likely > comes from? Here is the profile result I got: > > Samples: 104K of event 'cycles:Pu', Event count (approx.): 122371086557 > Children Self Command Shared Object Symbol > - 99.44% 0.95% qemu-system-ppc qemu-system-ppc [.] > cpu_exec_loop > - 98.49% cpu_exec_loop > - 98.48% cpu_tb_exec > - 90.95% 0x7f4e705d8f15 > helper_ldub_mmu > do_ld_mmio_beN > - cpu_io_recompile
This looks like the dbz instructions are being used to clear device memory and tripping over the can_do_io check (normally the translator tries to ensure all device access is at the end of a block). You could try ending the block on dbz instructions and seeing if that helps. Normally I would expect the helper to be more efficient as it can probe the whole address range once and then use host insns to blat the memory. > - 45.79% cpu_loop_exit_noexc > - cpu_loop_exit > __longjmp_chk > cpu_exec_setjmp > - cpu_exec_loop > - 45.78% cpu_tb_exec > 42.35% 0x7f4e6f3f0000 > - 0.72% 0x7f4e99f37037 > helper_VPERM > - 0.68% 0x7f4e99f3716d > helper_VPERM > - 45.16% rr_cpu_thread_fn Hmm you seem to be running in icount mode here for some reason. > - 45.16% tcg_cpu_exec > - 45.15% cpu_exec > - 45.15% cpu_exec_setjmp > - cpu_exec_loop > - 45.14% cpu_tb_exec > 42.08% 0x7f4e6f3f0000 > - 0.72% 0x7f4e99f37037 > helper_VPERM > - 0.67% 0x7f4e99f3716d > helper_VPERM <snip> -- Alex Bennée Virtualisation Tech Lead @ Linaro