On Thu, 24 Apr 2025, BALATON Zoltan wrote:
The test case I've used came out of a discussion about very slow
access to VRAM of a graphics card passed through with vfio the reason
for which is still not clear but it was already known that dcbz is
often used by MacOS and AmigaOS for clearing memory and to avoid
reading values about to be overwritten which is faster on real CPU but
was found to be slower on QEMU. The optimised copy routines were
posted here:
https://www.amigans.net/modules/newbb/viewtopic.php?post_id=149123#forumpost149123
and the rest of it I've written to make it a test case is here:
http://zero.eik.bme.hu/~balaton/qemu/vramcopy.tar.xz
Replace the body of has_altivec() with just "return false". Sorry for
only giving pieces but the code posted above has a copyright that does
not allow me to include it in the test. This is not measuring VRAM
access now just memory copy but shows the effect of dcbz. I've got
these results with this patch:

Linux user master:                  Linux user patch:
byte loop: 2.2 sec                  byte loop: 2.2 sec
memcpy: 2.19 sec                    memcpy: 2.19 sec
copyToVRAMNoAltivec: 1.7 sec        copyToVRAMNoAltivec: 1.71 sec
copyToVRAMAltivec: 2.13 sec         copyToVRAMAltivec: 2.12 sec
copyFromVRAMNoAltivec: 5.11 sec     copyFromVRAMNoAltivec: 2.79 sec
copyFromVRAMAltivec: 5.87 sec       copyFromVRAMAltivec: 3.26 sec

Linux system master:                Linux system patch:
byte loop: 5.86 sec                 byte loop: 5.9 sec
memcpy: 5.45 sec                    memcpy: 5.47 sec
copyToVRAMNoAltivec: 2.51 sec       copyToVRAMNoAltivec: 2.53 sec
copyToVRAMAltivec: 3.84 sec         copyToVRAMAltivec: 3.85 sec
copyFromVRAMNoAltivec: 6.11 sec     copyFromVRAMNoAltivec: 3.92 sec
copyFromVRAMAltivec: 7.22 sec       copyFromVRAMAltivec: 5.51 sec

I did some more benchmarking to identify what slows it down. I noticed that memset uses dcbz too so I added a test for that. I've also added a parameter to allow testing actual VRAM and now that I have a card working with vfio-pci passthrough I could also test that. The updated vramcopy.tar.xz is at the same URL as above. These tests were run with the amigaone machine under Linux booted as described here:
https://www.qemu.org/docs/master/system/ppc/amigang.html

I compiled the benchmark twice, once as in the tar and once replacing dcbz in the copyFromVRAM* routines with dcba (which is noop on QEMU). First two results are with both src and dst in RAM, second two tests are with dst in VRAM (mapped from phys address 0x80800000 where the card's framebuffer is mapped). The left column shows results with emulated ati-vga as in the amigang.html docs. The right column is with real ATI X550 card (old and slow but works with this old PPC Linux) passed through with vfio-pci.

with ati-vga                            with vfio-pci

src 0xb79c8008 dst 0xb78c7008         | src 0xb7c92008 dst 0xb7b91008
byte loop: 21.16 sec                    byte loop: 21.16 sec
memset: 3.85 sec                      | memset: 3.87 sec
memcpy: 5.07 sec                        memcpy: 5.07 sec
copyToVRAMNoAltivec: 2.52 sec         | copyToVRAMNoAltivec: 2.53 sec
copyToVRAMAltivec: 2.42 sec           | copyToVRAMAltivec: 2.37 sec
copyFromVRAMNoAltivec: 6.39 sec       | copyFromVRAMNoAltivec: 6.38 sec
copyFromVRAMAltivec: 7.02 sec         | copyFromVRAMAltivec: 7 sec

using dcba instead of dcbz            | using dcba instead of dcbz
src 0xb7b69008 dst 0xb7a68008         | src 0xb7c44008 dst 0xb7b43008
byte loop: 21.14 sec                    byte loop: 21.14 sec
memset: 3.85 sec                      | memset: 3.88 sec
memcpy: 5.06 sec                      | memcpy: 5.07 sec
copyToVRAMNoAltivec: 2.53 sec         | copyToVRAMNoAltivec: 2.52 sec
copyToVRAMAltivec: 2.3 sec              copyToVRAMAltivec: 2.3 sec
copyFromVRAMNoAltivec: 2.59 sec         copyFromVRAMNoAltivec: 2.59 sec
copyFromVRAMAltivec: 2.95 sec           copyFromVRAMAltivec: 2.95 sec

dst in emulated ati-vga               | dst in real card vfio vram
mapping 0x80800000                      mapping 0x80800000
src 0xb78e0008 dst 0xb77de000         | src 0xb7ec5008 dst 0xb7dc3000
byte loop: 21.2 sec                   | byte loop: 563.98 sec
memset: 3.89 sec                      | memset: 39.25 sec
memcpy: 5.07 sec                      | memcpy: 140.49 sec
copyToVRAMNoAltivec: 2.53 sec         | copyToVRAMNoAltivec: 72.03 sec
copyToVRAMAltivec: 12.22 sec          | copyToVRAMAltivec: 78.12 sec
copyFromVRAMNoAltivec: 6.43 sec       | copyFromVRAMNoAltivec: 728.52 sec
copyFromVRAMAltivec: 35.33 sec        | copyFromVRAMAltivec: 754.95 sec

dst in emulated ati-vga using dcba    | dst in real card vfio vram using dcba
mapping 0x80800000                      mapping 0x80800000
src 0xb7ba7008 dst 0xb7aa5000         | src 0xb77f4008 dst 0xb76f2000
byte loop: 21.15 sec                  | byte loop: 577.42 sec
memset: 3.85 sec                      | memset: 39.52 sec
memcpy: 5.06 sec                      | memcpy: 142.8 sec
copyToVRAMNoAltivec: 2.53 sec         | copyToVRAMNoAltivec: 71.71 sec
copyToVRAMAltivec: 12.2 sec           | copyToVRAMAltivec: 78.09 sec
copyFromVRAMNoAltivec: 2.6 sec        | copyFromVRAMNoAltivec: 727.23 sec
copyFromVRAMAltivec: 35.03 sec        | copyFromVRAMAltivec: 753.15 sec

The results show that dcbz has some effect but an even bigger slow down is caused by using AltiVec which is supposed to do wider access to reduce the overhead but maybe it's not translated to host vector instructions correctly. The host in the above test was Intel i7-9700K. So to solve this maybe AltiVec should be improved more than dcbz but I don't know what and how.

Regards,
BALATON Zoltan

Reply via email to