On Tue, 29 Apr 2025, Alex Bennée wrote:
BALATON Zoltan <bala...@eik.bme.hu> writes:
On Mon, 28 Apr 2025, BALATON Zoltan wrote:
On Mon, 28 Apr 2025, BALATON Zoltan wrote:
On Thu, 24 Apr 2025, BALATON Zoltan wrote:
The test case I've used came out of a discussion about very slow
access to VRAM of a graphics card passed through with vfio the reason
for which is still not clear but it was already known that dcbz is
often used by MacOS and AmigaOS for clearing memory and to avoid
reading values about to be overwritten which is faster on real CPU but
was found to be slower on QEMU. The optimised copy routines were
posted here:
<snip>

I have tried profiling the dst in real card vfio vram with dcbz case
(with 100 iterations instead of 10000 in above tests) but I'm not sure
I understand the results. vperm and dcbz show up but not too high. Can
somebody explain what is happening here and where the overhead likely
comes from? Here is the profile result I got:

Samples: 104K of event 'cycles:Pu', Event count (approx.): 122371086557
  Children      Self  Command          Shared Object            Symbol
-   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.] 
cpu_exec_loop
   - 98.49% cpu_exec_loop
      - 98.48% cpu_tb_exec
         - 90.95% 0x7f4e705d8f15
              helper_ldub_mmu
              do_ld_mmio_beN
            - cpu_io_recompile

This looks like the dbz instructions are being used to clear device
memory and tripping over the can_do_io check (normally the translator
tries to ensure all device access is at the end of a block).

If you look at the benchmark results I posted earlier in this thread in https://lists.nongnu.org/archive/html/qemu-ppc/2025-04/msg00326.html I also tried using dcba instead of dcbz in the CopyFromVRAM* functions but that only helped very little so not sure it's because of dcbz. Then I thought it might be VPERM but the NoAltivec variants are also only a little faster. It could be that using 64 bit access instead of 128 bit (the NoAltivec functions use FPU regs) makes it slower while avoiding VPERM makes it faster which cancel each other but the profile also shows VPERM not high and somebody else also tested this with -cpu g3 and only got 1% faster result so maybe it's also not primarily because of VPERM but there's a bigger overhead before these..

You could try ending the block on dbz instructions and seeing if that
helps. Normally I would expect the helper to be more efficient as it can
probe the whole address range once and then use host insns to blat the
memory.

Maybe I could try that if I can do that the same way as done in io_prepare.

               - 45.79% cpu_loop_exit_noexc
                  - cpu_loop_exit
                    __longjmp_chk
                    cpu_exec_setjmp
                  - cpu_exec_loop
                     - 45.78% cpu_tb_exec
                          42.35% 0x7f4e6f3f0000
                        - 0.72% 0x7f4e99f37037
                             helper_VPERM
                        - 0.68% 0x7f4e99f3716d
                             helper_VPERM
               - 45.16% rr_cpu_thread_fn

Hmm you seem to be running in icount mode here for some reason.

No idea why. I had no such options and complied without --enable-debug and nothing special on QEMU command just defaults options. How can I check if icount is enabled? Can profiling with perf tool interfere? I thought that only reads CPU performance counters and does not attach to the process otherwise.

Regards,
BALATON Zoltan

                  - 45.16% tcg_cpu_exec
                     - 45.15% cpu_exec
                        - 45.15% cpu_exec_setjmp
                           - cpu_exec_loop
                              - 45.14% cpu_tb_exec
                                   42.08% 0x7f4e6f3f0000
                                 - 0.72% 0x7f4e99f37037
                                      helper_VPERM
                                 - 0.67% 0x7f4e99f3716d
                                      helper_VPERM
<snip>

Reply via email to