Re: [RFC PATCH] target/ppc: Inline most of dcbz helper

BALATON Zoltan Wed, 30 Apr 2025 04:21:15 -0700

On Wed, 30 Apr 2025, Nicholas Piggin wrote:

On Wed Apr 30, 2025 at 7:09 AM AEST, BALATON Zoltan wrote:

On Tue, 29 Apr 2025, Alex Bennée wrote:

BALATON Zoltan <bala...@eik.bme.hu> writes:

On Tue, 29 Apr 2025, Alex Bennée wrote:

BALATON Zoltan <bala...@eik.bme.hu> writes:

On Mon, 28 Apr 2025, Richard Henderson wrote:

On 4/28/25 06:26, BALATON Zoltan wrote:

I have tried profiling the dst in real card vfio vram with dcbz
case (with 100 iterations instead of 10000 in above tests) but I'm
not sure I understand the results. vperm and dcbz show up but not
too high. Can somebody explain what is happening here and where the
overhead likely comes from? Here is the profile result I got:
Samples: 104K of event 'cycles:Pu', Event count (approx.):
122371086557
   Children      Self  Command          Shared Object            Symbol
-   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.]
cpu_exec_loop
    - 98.49% cpu_exec_loop
       - 98.48% cpu_tb_exec
          - 90.95% 0x7f4e705d8f15
               helper_ldub_mmu
               do_ld_mmio_beN
             - cpu_io_recompile
                - 45.79% cpu_loop_exit_noexc


I think the real problem is the number of loop exits due to i/o.  If
I'm reading this rightly, 45% of execution is in cpu_io_recompile.

I/O can only happen as the last insn of a translation block.


I'm not sure I understand this. A comment above cpu_io_recompile says
"In deterministic execution mode, instructions doing device I/Os must
be at the end of the TB." Is that wrong? Otherwise shouldn't this only
apply if running with icount or something like that?


That comment should be fixed. It used to only be the case for icount
mode but there was another race bug that meant we need to honour device
access as the last insn for both modes.

When we detect that it has happened in the middle of a translation
block, we abort the block, compile a new one, and restart execution.


Where does that happen? The calls of cpu_io_recompile in this case
seem to come from io_prepare which is called from do_ld16_mmio_beN if
(!cpu->neg.can_do_io) but I don't see how can_do_io is set.


Inline by set_can_do_io()


That one I've found but don't know where the cpu_loop_exit returns
from the end of cpu_io_recompile.


cpu_loop_exit longjmp's back to the top of the execution loop.

Where this becomes a bottleneck is when this same translation block
is in a loop.  Exactly this case of memset/memcpy of VRAM.  This
could be addressed by invalidating the previous translation block
and creating a new one which always ends with the i/o.


And where to do that? cpu_io_recompile just exits the TB but what
generates the new TB? I need some more clues to understands how to do
this.


 cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;

sets the cflags for the next cb, which typically will fail to find and
then regenerate. Normally cflags_next_tb is empty.


Shouldn't this only regenerate the next TB on the first loop iteration
and not afterwards?


if we've been here before (needing n insn from the base addr) we will
have a cached translation we can re-use. It doesn't stop the longer TB
being called again as we re-enter a loop.


So then maybe it should at least check if there's already a cached TB
where it can continue before calling cpu_io_recompile in io_prepare and
only recompile if needed?


It basically does do that AFAIKS. cpu_io_recompile() name is misleading
it does not cause a recompile, it just updates cflags and exits. Next
entry will look up TB that has just 1 insn and enter that.

After reading it I came to the same conclusion but then I don't understandwhat causes the problem. Is it just that it will exit the loop for everyIO to look up the recompiled TB? It looks like it tries to chain TBs, whydoes that not work here?

I was thinking maybe we need a flag or counter
to see if cpu_io_recompile is called more than once and after a limit
invalidate the TB and create two new ones the first ending at the I/O and
then what cpu_io_recompile does now which as I understood was what Richard
suggested but I don't know how to do that.


memset/cpy routines had kind of the same problem with real hardware.
They wanted to use vector instructions for best performance, but when
those are used on MMIO they would trap and be very slow.

Why do those trap on MMIO on real machine? These routines were tested onreal machines and the reasoning to use the widest possible access was thatPCI transfer has overhead and that is minimised by transferring more bitsin one op. I think they also verifed that it works at least for the 32 bitCPUs up to G4 that were used on real AmigaNG machines. There are somebenchmark results here:https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS?start=60 which isalso where the benchmark I used comes from so this should be similar. Ithink the MemCopy on that page has plain unoptimised copy as Copy to/fromVRAM and optimised routines similar to this benchmark as Read/Write PixelArray, but it's not easy to search. Some of the machines like Pegasos IIand AmigaOne XE were made with both G3 or G4 CPUs so if I find a resultfrom those with same graphics card that could show if AltiVec is faster(although the G4s were also higher clock so not directly comparable). Someresults there are also from QEMU, mostly those that are with SiliconMotion502 but that does not have this problem only vfio-pci pass through. Somaybe it's something with how vfio-pci maps PCI memory BARs?

Problem is we don't know ahead of time if some routine will access
MMIO or not. You could recompile it with fewer instructions but then
it will be slow when used for regular memory.

Heuristics are tough because you could have e.g., one initial big
memset to clear a MMIO region that iterates many times over inner
loop of dcbz instructions, but then is never used again for MMIO but
important for regular page clearing. Making something that dynamically
decays or periodically would recompile to non-IO case perhaps, but
then complexity goes up.

I would prefer not like to do that just for a microbenchmark, but if
you think it is reasonable overall win for average workloads of your
users then perhaps.

I'm still trying to understand what to optimise. So far it looks like thatdcbz has the least impact, then vperm a bit bigger but still only about afew percent and the biggest impact is still not known for sure but we seefaster access on real machines that run on slower PCIe (only 4x at best)while CPU benchmarks don't show slower performance on QEMU only accessingpassed through card's VRAM is slower than expected. But if there's a trapinvolved I've found before that exceptions are slower with QEMU but I didnot see evidence of that in the profile.


Regards,
BALATON Zoltan

Re: [RFC PATCH] target/ppc: Inline most of dcbz helper

Reply via email to