On Thu, 24 Apr 2025, BALATON Zoltan wrote:
The test case I've used came out of a discussion about very slow
access to VRAM of a graphics card passed through with vfio the reason
for which is still not clear but it was already known that dcbz is
often used by MacOS and AmigaOS for clearing memory and to avoid
reading values about to be overwritten which is faster on real CPU but
was found to be slower on QEMU. The optimised copy routines were
posted here:
https://www.amigans.net/modules/newbb/viewtopic.php?post_id=149123#forumpost149123
and the rest of it I've written to make it a test case is here:
http://zero.eik.bme.hu/~balaton/qemu/vramcopy.tar.xz
Replace the body of has_altivec() with just "return false". Sorry for
only giving pieces but the code posted above has a copyright that does
not allow me to include it in the test. This is not measuring VRAM
access now just memory copy but shows the effect of dcbz. I've got
these results with this patch:
Linux user master: Linux user patch:
byte loop: 2.2 sec byte loop: 2.2 sec
memcpy: 2.19 sec memcpy: 2.19 sec
copyToVRAMNoAltivec: 1.7 sec copyToVRAMNoAltivec: 1.71 sec
copyToVRAMAltivec: 2.13 sec copyToVRAMAltivec: 2.12 sec
copyFromVRAMNoAltivec: 5.11 sec copyFromVRAMNoAltivec: 2.79 sec
copyFromVRAMAltivec: 5.87 sec copyFromVRAMAltivec: 3.26 sec
Linux system master: Linux system patch:
byte loop: 5.86 sec byte loop: 5.9 sec
memcpy: 5.45 sec memcpy: 5.47 sec
copyToVRAMNoAltivec: 2.51 sec copyToVRAMNoAltivec: 2.53 sec
copyToVRAMAltivec: 3.84 sec copyToVRAMAltivec: 3.85 sec
copyFromVRAMNoAltivec: 6.11 sec copyFromVRAMNoAltivec: 3.92 sec
copyFromVRAMAltivec: 7.22 sec copyFromVRAMAltivec: 5.51 sec
I did some more benchmarking to identify what slows it down. I noticed
that memset uses dcbz too so I added a test for that. I've also added a
parameter to allow testing actual VRAM and now that I have a card working
with vfio-pci passthrough I could also test that. The updated
vramcopy.tar.xz is at the same URL as above. These tests were run with the
amigaone machine under Linux booted as described here:
https://www.qemu.org/docs/master/system/ppc/amigang.html
I compiled the benchmark twice, once as in the tar and once replacing dcbz
in the copyFromVRAM* routines with dcba (which is noop on QEMU). First two
results are with both src and dst in RAM, second two tests are with dst in
VRAM (mapped from phys address 0x80800000 where the card's framebuffer is
mapped). The left column shows results with emulated ati-vga as in the
amigang.html docs. The right column is with real ATI X550 card (old and
slow but works with this old PPC Linux) passed through with vfio-pci.
with ati-vga with vfio-pci
src 0xb79c8008 dst 0xb78c7008 | src 0xb7c92008 dst 0xb7b91008
byte loop: 21.16 sec byte loop: 21.16 sec
memset: 3.85 sec | memset: 3.87 sec
memcpy: 5.07 sec memcpy: 5.07 sec
copyToVRAMNoAltivec: 2.52 sec | copyToVRAMNoAltivec: 2.53 sec
copyToVRAMAltivec: 2.42 sec | copyToVRAMAltivec: 2.37 sec
copyFromVRAMNoAltivec: 6.39 sec | copyFromVRAMNoAltivec: 6.38 sec
copyFromVRAMAltivec: 7.02 sec | copyFromVRAMAltivec: 7 sec
using dcba instead of dcbz | using dcba instead of dcbz
src 0xb7b69008 dst 0xb7a68008 | src 0xb7c44008 dst 0xb7b43008
byte loop: 21.14 sec byte loop: 21.14 sec
memset: 3.85 sec | memset: 3.88 sec
memcpy: 5.06 sec | memcpy: 5.07 sec
copyToVRAMNoAltivec: 2.53 sec | copyToVRAMNoAltivec: 2.52 sec
copyToVRAMAltivec: 2.3 sec copyToVRAMAltivec: 2.3 sec
copyFromVRAMNoAltivec: 2.59 sec copyFromVRAMNoAltivec: 2.59 sec
copyFromVRAMAltivec: 2.95 sec copyFromVRAMAltivec: 2.95 sec
dst in emulated ati-vga | dst in real card vfio vram
mapping 0x80800000 mapping 0x80800000
src 0xb78e0008 dst 0xb77de000 | src 0xb7ec5008 dst 0xb7dc3000
byte loop: 21.2 sec | byte loop: 563.98 sec
memset: 3.89 sec | memset: 39.25 sec
memcpy: 5.07 sec | memcpy: 140.49 sec
copyToVRAMNoAltivec: 2.53 sec | copyToVRAMNoAltivec: 72.03 sec
copyToVRAMAltivec: 12.22 sec | copyToVRAMAltivec: 78.12 sec
copyFromVRAMNoAltivec: 6.43 sec | copyFromVRAMNoAltivec: 728.52 sec
copyFromVRAMAltivec: 35.33 sec | copyFromVRAMAltivec: 754.95 sec
dst in emulated ati-vga using dcba | dst in real card vfio vram using dcba
mapping 0x80800000 mapping 0x80800000
src 0xb7ba7008 dst 0xb7aa5000 | src 0xb77f4008 dst 0xb76f2000
byte loop: 21.15 sec | byte loop: 577.42 sec
memset: 3.85 sec | memset: 39.52 sec
memcpy: 5.06 sec | memcpy: 142.8 sec
copyToVRAMNoAltivec: 2.53 sec | copyToVRAMNoAltivec: 71.71 sec
copyToVRAMAltivec: 12.2 sec | copyToVRAMAltivec: 78.09 sec
copyFromVRAMNoAltivec: 2.6 sec | copyFromVRAMNoAltivec: 727.23 sec
copyFromVRAMAltivec: 35.03 sec | copyFromVRAMAltivec: 753.15 sec
The results show that dcbz has some effect but an even bigger slow down is
caused by using AltiVec which is supposed to do wider access to reduce the
overhead but maybe it's not translated to host vector instructions
correctly. The host in the above test was Intel i7-9700K. So to solve this
maybe AltiVec should be improved more than dcbz but I don't know what and
how.
Regards,
BALATON Zoltan