On Tue, 29 Apr 2025, Alex Bennée wrote:
BALATON Zoltan <bala...@eik.bme.hu> writes:
On Mon, 28 Apr 2025, BALATON Zoltan wrote:
On Mon, 28 Apr 2025, BALATON Zoltan wrote:
On Thu, 24 Apr 2025, BALATON Zoltan wrote:
The test case I've used came out of a discussion about very slow
access to VRAM of a graphics card passed through with vfio the reason
for which is still not clear but it was already known that dcbz is
often used by MacOS and AmigaOS for clearing memory and to avoid
reading values about to be overwritten which is faster on real CPU but
was found to be slower on QEMU. The optimised copy routines were
posted here:
<snip>
I have tried profiling the dst in real card vfio vram with dcbz case
(with 100 iterations instead of 10000 in above tests) but I'm not sure
I understand the results. vperm and dcbz show up but not too high. Can
somebody explain what is happening here and where the overhead likely
comes from? Here is the profile result I got:
Samples: 104K of event 'cycles:Pu', Event count (approx.): 122371086557
Children Self Command Shared Object Symbol
- 99.44% 0.95% qemu-system-ppc qemu-system-ppc [.]
cpu_exec_loop
- 98.49% cpu_exec_loop
- 98.48% cpu_tb_exec
- 90.95% 0x7f4e705d8f15
helper_ldub_mmu
do_ld_mmio_beN
- cpu_io_recompile
This looks like the dbz instructions are being used to clear device
memory and tripping over the can_do_io check (normally the translator
tries to ensure all device access is at the end of a block).
If you look at the benchmark results I posted earlier in this thread in
https://lists.nongnu.org/archive/html/qemu-ppc/2025-04/msg00326.html
I also tried using dcba instead of dcbz in the CopyFromVRAM* functions but
that only helped very little so not sure it's because of dcbz. Then I
thought it might be VPERM but the NoAltivec variants are also only a
little faster. It could be that using 64 bit access instead of 128 bit
(the NoAltivec functions use FPU regs) makes it slower while avoiding
VPERM makes it faster which cancel each other but the profile also shows
VPERM not high and somebody else also tested this with -cpu g3 and only
got 1% faster result so maybe it's also not primarily because of VPERM but
there's a bigger overhead before these..
You could try ending the block on dbz instructions and seeing if that
helps. Normally I would expect the helper to be more efficient as it can
probe the whole address range once and then use host insns to blat the
memory.
Maybe I could try that if I can do that the same way as done in
io_prepare.
- 45.79% cpu_loop_exit_noexc
- cpu_loop_exit
__longjmp_chk
cpu_exec_setjmp
- cpu_exec_loop
- 45.78% cpu_tb_exec
42.35% 0x7f4e6f3f0000
- 0.72% 0x7f4e99f37037
helper_VPERM
- 0.68% 0x7f4e99f3716d
helper_VPERM
- 45.16% rr_cpu_thread_fn
Hmm you seem to be running in icount mode here for some reason.
No idea why. I had no such options and complied without --enable-debug and
nothing special on QEMU command just defaults options. How can I check if
icount is enabled? Can profiling with perf tool interfere? I thought that
only reads CPU performance counters and does not attach to the process
otherwise.
Regards,
BALATON Zoltan
- 45.16% tcg_cpu_exec
- 45.15% cpu_exec
- 45.15% cpu_exec_setjmp
- cpu_exec_loop
- 45.14% cpu_tb_exec
42.08% 0x7f4e6f3f0000
- 0.72% 0x7f4e99f37037
helper_VPERM
- 0.67% 0x7f4e99f3716d
helper_VPERM
<snip>