On Wed, 30 Apr 2025, Nicholas Piggin wrote:
On Wed Apr 30, 2025 at 7:09 AM AEST, BALATON Zoltan wrote:
On Tue, 29 Apr 2025, Alex Bennée wrote:
BALATON Zoltan <bala...@eik.bme.hu> writes:
On Tue, 29 Apr 2025, Alex Bennée wrote:
BALATON Zoltan <bala...@eik.bme.hu> writes:
On Mon, 28 Apr 2025, Richard Henderson wrote:
On 4/28/25 06:26, BALATON Zoltan wrote:
I have tried profiling the dst in real card vfio vram with dcbz
case (with 100 iterations instead of 10000 in above tests) but I'm
not sure I understand the results. vperm and dcbz show up but not
too high. Can somebody explain what is happening here and where the
overhead likely comes from? Here is the profile result I got:
Samples: 104K of event 'cycles:Pu', Event count (approx.):
122371086557
Children Self Command Shared Object Symbol
- 99.44% 0.95% qemu-system-ppc qemu-system-ppc [.]
cpu_exec_loop
- 98.49% cpu_exec_loop
- 98.48% cpu_tb_exec
- 90.95% 0x7f4e705d8f15
helper_ldub_mmu
do_ld_mmio_beN
- cpu_io_recompile
- 45.79% cpu_loop_exit_noexc
I think the real problem is the number of loop exits due to i/o. If
I'm reading this rightly, 45% of execution is in cpu_io_recompile.
I/O can only happen as the last insn of a translation block.
I'm not sure I understand this. A comment above cpu_io_recompile says
"In deterministic execution mode, instructions doing device I/Os must
be at the end of the TB." Is that wrong? Otherwise shouldn't this only
apply if running with icount or something like that?
That comment should be fixed. It used to only be the case for icount
mode but there was another race bug that meant we need to honour device
access as the last insn for both modes.
When we detect that it has happened in the middle of a translation
block, we abort the block, compile a new one, and restart execution.
Where does that happen? The calls of cpu_io_recompile in this case
seem to come from io_prepare which is called from do_ld16_mmio_beN if
(!cpu->neg.can_do_io) but I don't see how can_do_io is set.
Inline by set_can_do_io()
That one I've found but don't know where the cpu_loop_exit returns
from the end of cpu_io_recompile.
cpu_loop_exit longjmp's back to the top of the execution loop.
Where this becomes a bottleneck is when this same translation block
is in a loop. Exactly this case of memset/memcpy of VRAM. This
could be addressed by invalidating the previous translation block
and creating a new one which always ends with the i/o.
And where to do that? cpu_io_recompile just exits the TB but what
generates the new TB? I need some more clues to understands how to do
this.
cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;
sets the cflags for the next cb, which typically will fail to find and
then regenerate. Normally cflags_next_tb is empty.
Shouldn't this only regenerate the next TB on the first loop iteration
and not afterwards?
if we've been here before (needing n insn from the base addr) we will
have a cached translation we can re-use. It doesn't stop the longer TB
being called again as we re-enter a loop.
So then maybe it should at least check if there's already a cached TB
where it can continue before calling cpu_io_recompile in io_prepare and
only recompile if needed?
It basically does do that AFAIKS. cpu_io_recompile() name is misleading
it does not cause a recompile, it just updates cflags and exits. Next
entry will look up TB that has just 1 insn and enter that.
After reading it I came to the same conclusion but then I don't understand
what causes the problem. Is it just that it will exit the loop for every
IO to look up the recompiled TB? It looks like it tries to chain TBs, why
does that not work here?
I was thinking maybe we need a flag or counter
to see if cpu_io_recompile is called more than once and after a limit
invalidate the TB and create two new ones the first ending at the I/O and
then what cpu_io_recompile does now which as I understood was what Richard
suggested but I don't know how to do that.
memset/cpy routines had kind of the same problem with real hardware.
They wanted to use vector instructions for best performance, but when
those are used on MMIO they would trap and be very slow.
Why do those trap on MMIO on real machine? These routines were tested on
real machines and the reasoning to use the widest possible access was that
PCI transfer has overhead and that is minimised by transferring more bits
in one op. I think they also verifed that it works at least for the 32 bit
CPUs up to G4 that were used on real AmigaNG machines. There are some
benchmark results here:
https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS?start=60 which is
also where the benchmark I used comes from so this should be similar. I
think the MemCopy on that page has plain unoptimised copy as Copy to/from
VRAM and optimised routines similar to this benchmark as Read/Write Pixel
Array, but it's not easy to search. Some of the machines like Pegasos II
and AmigaOne XE were made with both G3 or G4 CPUs so if I find a result
from those with same graphics card that could show if AltiVec is faster
(although the G4s were also higher clock so not directly comparable). Some
results there are also from QEMU, mostly those that are with SiliconMotion
502 but that does not have this problem only vfio-pci pass through. So
maybe it's something with how vfio-pci maps PCI memory BARs?
Problem is we don't know ahead of time if some routine will access
MMIO or not. You could recompile it with fewer instructions but then
it will be slow when used for regular memory.
Heuristics are tough because you could have e.g., one initial big
memset to clear a MMIO region that iterates many times over inner
loop of dcbz instructions, but then is never used again for MMIO but
important for regular page clearing. Making something that dynamically
decays or periodically would recompile to non-IO case perhaps, but
then complexity goes up.
I would prefer not like to do that just for a microbenchmark, but if
you think it is reasonable overall win for average workloads of your
users then perhaps.
I'm still trying to understand what to optimise. So far it looks like that
dcbz has the least impact, then vperm a bit bigger but still only about a
few percent and the biggest impact is still not known for sure but we see
faster access on real machines that run on slower PCIe (only 4x at best)
while CPU benchmarks don't show slower performance on QEMU only accessing
passed through card's VRAM is slower than expected. But if there's a trap
involved I've found before that exceptions are slower with QEMU but I did
not see evidence of that in the profile.
Regards,
BALATON Zoltan