BALATON Zoltan <bala...@eik.bme.hu> writes:

> On Mon, 28 Apr 2025, BALATON Zoltan wrote:
>> On Mon, 28 Apr 2025, BALATON Zoltan wrote:
>>> On Thu, 24 Apr 2025, BALATON Zoltan wrote:
>>>>> The test case I've used came out of a discussion about very slow
>>>>> access to VRAM of a graphics card passed through with vfio the reason
>>>>> for which is still not clear but it was already known that dcbz is
>>>>> often used by MacOS and AmigaOS for clearing memory and to avoid
>>>>> reading values about to be overwritten which is faster on real CPU but
>>>>> was found to be slower on QEMU. The optimised copy routines were
>>>>> posted here:
<snip>
>
> I have tried profiling the dst in real card vfio vram with dcbz case
> (with 100 iterations instead of 10000 in above tests) but I'm not sure
> I understand the results. vperm and dcbz show up but not too high. Can
> somebody explain what is happening here and where the overhead likely
> comes from? Here is the profile result I got:
>
> Samples: 104K of event 'cycles:Pu', Event count (approx.): 122371086557
>   Children      Self  Command          Shared Object            Symbol
> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.] 
> cpu_exec_loop
>    - 98.49% cpu_exec_loop
>       - 98.48% cpu_tb_exec
>          - 90.95% 0x7f4e705d8f15
>               helper_ldub_mmu
>               do_ld_mmio_beN
>             - cpu_io_recompile

This looks like the dbz instructions are being used to clear device
memory and tripping over the can_do_io check (normally the translator
tries to ensure all device access is at the end of a block).

You could try ending the block on dbz instructions and seeing if that
helps. Normally I would expect the helper to be more efficient as it can
probe the whole address range once and then use host insns to blat the
memory.

>                - 45.79% cpu_loop_exit_noexc
>                   - cpu_loop_exit
>                     __longjmp_chk
>                     cpu_exec_setjmp
>                   - cpu_exec_loop
>                      - 45.78% cpu_tb_exec
>                           42.35% 0x7f4e6f3f0000
>                         - 0.72% 0x7f4e99f37037
>                              helper_VPERM
>                         - 0.68% 0x7f4e99f3716d
>                              helper_VPERM
>                - 45.16% rr_cpu_thread_fn

Hmm you seem to be running in icount mode here for some reason.

>                   - 45.16% tcg_cpu_exec
>                      - 45.15% cpu_exec
>                         - 45.15% cpu_exec_setjmp
>                            - cpu_exec_loop
>                               - 45.14% cpu_tb_exec
>                                    42.08% 0x7f4e6f3f0000
>                                  - 0.72% 0x7f4e99f37037
>                                       helper_VPERM
>                                  - 0.67% 0x7f4e99f3716d
>                                       helper_VPERM
<snip>

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Reply via email to