vfio vs tcg (was: Re: [RFC PATCH] target/ppc: Inline most of dcbz helper)

BALATON Zoltan Fri, 02 May 2025 07:29:50 -0700

Adding some vfio people who I hope could give some more insight. Thequestion I try to find out is why accessing VRAM of a graphics card passedthrough with vfio-pci to a PPC guest running on x86_64 host with TCG isslow.


On Wed, 30 Apr 2025, BALATON Zoltan wrote:

On Wed, 30 Apr 2025, Alex Bennée wrote:
BALATON Zoltan <bala...@eik.bme.hu> writes:
On Wed, 30 Apr 2025, Nicholas Piggin wrote:
Any MMIO access has to come via the slow path. Any MMIO also currently
has to be the last instruction in a block in case the operation triggers
a change in the translation regime that needs to be picked up by the
next instruction you execute.
This is a pathological case when modelling VRAM on a device because its
going to be slow either way. At least if you model the multiple byte
access with a helper you can amortise some of the cost of the MMU lookup
with a single probe_() call.
I think there is some mix up here because of all the different scenarios Ibenchmarked so let me try to clear that up. The goal is to find out whyaccess to vfio-pci passed through graphics card VRAM is slower than expectedwhen the host should be faster than those mostly embedded or old PPCs used onreal machines with only 4x PCIe or PCIe to PCI bridges. In this case we arenot emulating VRAM but mapping the framebuffer from the real card and accessthat. To find where the slow down comes from I've benchmarked all the casesupthread but here are the relevant parts again for easier comparison:
First both src and dst are in RAM (just malloced buffers so this is the baseline):
src 0xb79c8008 dst 0xb78c7008
byte loop: 21.16 sec
memset: 3.85 sec
memcpy: 5.07 sec
copyToVRAMNoAltivec: 2.52 sec
copyToVRAMAltivec: 2.42 sec
copyFromVRAMNoAltivec: 6.39 sec
copyFromVRAMAltivec: 7.02 sec
The FromVRAM cases use dcbz to avoid loading RAM contents to cache on realmachine that is about to be overwritten so dcbz is never applied to MMIO.(Arguably it should use dcba but for some reason nobody remembers why it usesdcbz instead.) The ToVRAM cases have dcbt which is noop on QEMU. I guess thedifference we see here is because of probe_access in dcbz as was shown byprevious profiling. Replacing that with dcba (which is noop in QEMU) makesToVRAM and FromVRAM run the about the same (you can find that case inoriginal message). FromVRAM still a bit slower for some reason but most ofthis overhead can be accounted to dcbz.
In second test dst is mmapped from emulated ati-vga framebuffer BAR. We cansay we emulate vram here but that's just a ram memory region created in vga.cas:
memory_region_init_ram_nomigrate(&s->vram, obj, "vga.vram", s->vram_size,&local_err);
it also has dirty tracking enabled, I don't know if that has any effect. Thisis shown in left column here:
dst in emulated ati-vga               | dst in real card vfio vram
mapping 0x80800000                      mapping 0x80800000
src 0xb78e0008 dst 0xb77de000         | src 0xb7ec5008 dst 0xb7dc3000
byte loop: 21.2 sec                   | byte loop: 563.98 sec
memset: 3.89 sec                      | memset: 39.25 sec
memcpy: 5.07 sec                      | memcpy: 140.49 sec
copyToVRAMNoAltivec: 2.53 sec         | copyToVRAMNoAltivec: 72.03 sec
copyToVRAMAltivec: 12.22 sec          | copyToVRAMAltivec: 78.12 sec
copyFromVRAMNoAltivec: 6.43 sec       | copyFromVRAMNoAltivec: 728.52 sec
copyFromVRAMAltivec: 35.33 sec        | copyFromVRAMAltivec: 754.95 sec
Here we see that AltiVec cases have additional overhead which I think isrelated to vperm as that's the only op that does not seem to be compiled tosomething sensible but calls an unoptimised helper (although that's alsothere for RAM so not sure why this is slower). But this shows no otheroverhead due to MMIO being involved as the NoAltivec cases are the same aswith RAM.
Last case, shown in right column above, is when instead of ati-vga I have areal ATI card passed through with vfio-pci which is much slower than what isexplained only by PCI overhead and I'm trying to find out the source of thatslow down.

[...]

Why do those trap on MMIO on real machine? These routines were tested
on real machines and the reasoning to use the widest possible access
was that PCI transfer has overhead and that is minimised by
transferring more bits in one op. I think they also verifed that it
works at least for the 32 bit CPUs up to G4 that were used on real
AmigaNG machines. There are some benchmark results here:
https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS?start=60 which
is also where the benchmark I used comes from so this should be
similar. I think the MemCopy on that page has plain unoptimised copy
as Copy to/from VRAM and optimised routines similar to this benchmark
as Read/Write Pixel Array, but it's not easy to search. Some of the
machines like Pegasos II and AmigaOne XE were made with both G3 or G4
CPUs so if I find a result from those with same graphics card that
could show if AltiVec is faster (although the G4s were also higher
clock so not directly comparable). Some results there are also from
QEMU, mostly those that are with SiliconMotion 502 but that does not
have this problem only vfio-pci pass through.


They don't - what we need is to have a RAM-like-device model for QEMU
where we can relax the translation rules because we know we are writing
to RAM like things that don't have registers or other state changing
behaviour.

The poor behaviour is because QEMU currently treats all MMIO as
potentially system state altering where as for VRAM it doesn't need to.

This does not seem to be the case with emulated ati-vga, and with vfio-pci itshould also be mapped memory from the graphics card which technically is MMIObut how does QEMU decides that when it does not seem to consider ati-vga asIO? Typically in QEMU MMIO is an io memory region that goes through memopsand that's understandably slow but here we should read/write mapped memoryspace. Maybe I should try to find out what vfio-pci actually does here but itis used for gaming with KVM and there people get near native performance so Idon't think there is an overhead in vfio-pci.

After looking at how vfio creates the BARs it seems to usememory_region_io but then also mmaps it but maybe not always. I could notfind out how this works but there may be MMIO involved here. With KVM thismay not be a problem but it causes TCG to break the TB at the op accessingthe IO region and if this is in a loop it can cause freqent exits andlooking up a new TB at least that's what I think now. Question if thisseems correct and how could it be avoided? Can TBs be chained here toavoid the exit from the loop? Or why do we have io memory regions for thePCI memory bars in the first place when these are then mapped in the guestaddress space and accessed directly?. In hw/vfio/region.c thevfio_region_{read,write} memop functions seem to do two things: endianconversion and calling eoi to maybe ack an interrupt? Why the endianconversion is needed when the guest should be aware it's talking to a PCIdevice (and also the memops is defined as DEVICE_LITTLE_ENDIAN so theremay be double conversion here)? I don't know how the interrupts work butthis maybe only makes sense for MMIO BARs that contain registers and notfor VRAM. Typically VRAM is marked as prefetchable so maybe for those BARswe could use memory regions instead of io to avoid this problem?


More details can be found in the original thread here:
https://lists.nongnu.org/archive/html/qemu-ppc/2025-04/msg00326.html

Regards,
BALATON Zoltan

vfio vs tcg (was: Re: [RFC PATCH] target/ppc: Inline most of dcbz helper)

Reply via email to