Re: Display update issue on M1 Macs

BALATON Zoltan Tue, 31 Jan 2023 07:08:37 -0800

On Tue, 31 Jan 2023, Akihiko Odaki wrote:

On 2023/01/31 8:58, BALATON Zoltan wrote:
On Sat, 28 Jan 2023, Akihiko Odaki wrote:
On 2023/01/23 8:28, BALATON Zoltan wrote:
On Thu, 19 Jan 2023, Akihiko Odaki wrote:
On 2023/01/15 3:11, BALATON Zoltan wrote:
On Sat, 14 Jan 2023, Akihiko Odaki wrote:
On 2023/01/13 22:43, BALATON Zoltan wrote:
On Thu, 5 Jan 2023, BALATON Zoltan wrote:
Hello,
I got reports from several users trying to run AmigaOS4 on sam460exon Apple silicon Macs that they get missing graphics that I can'treproduce on x86_64. With help from the users who get the problemwe've narrowed it down to the following:
It looks like that data written to the sm501's ram inqemu/hw/display/sm501.c::sm501_2d_operation() is then not seen fromsm501_update_display() in the same file. The sm501_2d_operation()function is called when the guest accesses the emulated card so itmay run in a different thread than sm501_update_display() which iscalled by the ui backend but I'm not sure how QEMU calls these. Isdevice code running in iothread and display update in main thread?The problem is also independent of the display backend and wasreproduced with both -display cocoa and -display sdl.
We have confirmed it's not the pixman routines thatsm501_2d_operation() uses as the same issue is seen also with QEMU4.x where pixman wasn't used and with all versions up to 7.2 so it'salso not some bisectable change in QEMU. It also happens with--enable-debug so it doesn't seem to be related to optimisationeither and I don't get it on x86_64 but even x86_64 QEMU builds runon Apple M1 with Rosetta 2 show the problem. It also only seems toaffect graphics written from sm501_2d_operation() which AmigaOS4uses extensively but other OSes don't and just render graphics withthe vcpu which work without problem also on the M1 Macs that showthis problem with AmigaOS4. Theoretically this could be some missingsyncronisation which is something ARM and PPC may need while x86doesn't but I don't know if this is really the reason and if sowhere and how to fix it). Any idea what may cause this and whatcould be a fix to try?
Any idea anyone? At least some explanation if the above is plausibleor if there's an option to disable the iothread and run everyting ina single thread to verify the theory could help. I've got reportsfrom at least 3 people getting this problem but I can't do much tofix it without some help.
(Info on how to run it is here:
http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
but AmigaOS4 is not freely distributable so it's a bit hard toreproduce. Some Linux X servers that support sm501/sm502 may alsouse the card's 2d engine but I don't know about any live CDs thatreadily run on sam460ex.)
Thank you,
BALATON Zoltan
Sorry, I missed the email.
Indeed the ui backend should call sm501_update_display() in the mainthread, which should be different from the thread callingsm501_2d_operation(). However, if I understand it correctly, both ofthe functions should be called with iothread lock held so there shouldbe no race condition in theory.
But there is an exception: memory_region_snapshot_and_clear_dirty()releases iothread lock, and that broke raspi3b display device:
https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=wn+k8dqneb_...@mail.gmail.com/T/
It is unexpected that gfx_update() callback releases iothread lock soit may break things in peculiar ways.
Peter, is there any change in the situation regarding the raceintroduced by memory_region_snapshot_and_clear_dirty()?
For now, to workaround the issue, I think you can create another mutexand make the entire sm501_2d_engine_write() and sm501_update_display()critical sections.
Interesting thread but not sure it's the same problem so thisworkaround may not be enough to fix my issue. Here's a video posted byone of the people who reported it showing the problem on M1 Mac:
https://www.youtube.com/watch?v=FDqoNbp6PQs

and here's how it looks like on other machines:

https://www.youtube.com/watch?v=ML7-F4HNFKQ
There are also videos showing it running on RPi 4 and G5 Mac withoutthis issue so it seems to only happen on Apple Silicon M1 Macs. What'sstrange is that graphics elements are not just delayed which I thinkshould happen with missing thread synchronisation where the updatecallback would miss some pixels rendered during it's running butsubsequent update callbacks would eventually draw those, woudn't they?Also setting full_update to 1 in sm501_update_display() callback todisable dirty tracking does not fix the problem. So it looks like as ifsm501_2d_operation() running on one CPU core only writes data to thelocal cache of that core which sm501_update_display() running on othercore can't see, so maybe some cache synchronisation is needed inmemory_region_set_dirty() or if that's already there maybe I shouldcall it for all changes not only those in the visible display area? I'mstill not sure I understand the problem and don't know what could be afix for it so anything to test to identify the issue better might alsobring us closer to a solution.
Regards,
BALATON Zoltan
If you set full_update to 1, you may also comment outmemory_region_snapshot_and_clear_dirty() andmemory_region_snapshot_get_dirty() to avoid the iothread mutex beingunlocked. The iothread mutex should ensure cache coherency as well.
But as you say, it's weird that the rendered result is not just delayedbut missed. That may imply other possibilities (e.g., the results areoverwritten by someone else). If the problem persists after commentingout memory_region_snapshot_and_clear_dirty() andmemory_region_snapshot_get_dirty(), I think you can assume theinter-thread coherency between sm501_2d_operation() andsm501_update_display() is not causing the problem.
I've asked people who reported and can reproduce it to test this but itdid not change anything so confirmed it's not that race condition butlooks more like some cache inconsistency maybe. Any other ideas?
Regards,
BALATON Zoltan
I can come up with two important differences between x86 and Arm which canaffect the execution of QEMU:1. Memory model. Arm uses a memory model more relaxed than x86 so it ismore sensitive for synchronization failures among threads.2. Different instructions. TCG uses JIT so differences in instructionsmatter.
We should be able to exclude 1) as a potential cause of the problem.iothread mutex should take care of race condition and even cache coherencyproblem; mutex includes memory barrier functionality.
Where is this barrier in QEMU code? Does this also ensure cache coherencybetween different cores or only memory sync in one core? From the testing Isuspect it's probably not becuase of the weak ordering of ARM but somethingto do with different threads writing and reading the memory area. Is therea way to disable separate vcpu thread and run everything in a single threadto verify this theory? (We only have one vcpu so it's not an MTTCG issuebut something between the vcpu and main thread maybe.)
QEMU uses pthread_mutex for macOS, and pthread_mutex (or any sane muteximplementation for SMP systems) should also ensure memory synchronizationacross different cores.
That said, it is still possible that we miss something that prevents memorysynchronization. Ideally the theory should be confirmed by experiments, butit is not easy with Mac.
The easiest option is to run QEMU/sam460ex on Linux on QEMU/hvf. Running theentire Linux system without -smp option may be too slow so you may usetaskset command on Linux to pin QEMU/sam460ex process to a particular vCPU.This is somewhat incomplete as virtualization interferes with caches and hideproblems or trigger other bugs. The difference of the operating systems isalso concerning.
Another option is to use taskset command on Asahi Linux. Installing AsahiLinux is easy, but uninstalling it is a bit complicated.
m1n1 hypervisor from Asahi Linux project allows to restrict CPUs to use, andI think it also allows to change the memory model to x86 TSO. Unlike QEMU/hvfon macOS, it is very minimalistic so its interference to e.g.m caches islimited. It is very useful for debugging XNU or Linux, but hard to set up andrequires another computer to control it.
Finally, you can patch XNU kernel, but this is obviously not easy.

Yes that is getting too difficult. I don't have an M1 Mac myself so I relyon users who reported the problem to test so this is limited to somethingsimple like trying a QEMU option to disable threads. At least in the pastthis was possible but I don't know how to do that. Anybody else readingthis thread or should I ask that separately for somebody who knows theanswer to notice?

For difference 2), you may try to use TCI. You can find details of TCI intcg/tci/README.
This was tested and also with TCI got the same results just much slower.
The common sense tells, however, the memory model is usually the cause ofthe problem when you see behavioral differences between x86 and Arm, andTCG should work fine with both of x86 and Arm as they should have beentested well.
It's not only between x86 and ARM but also between different ARM CPUs itseems as there are videos of this test case running on Raspberry Pi 4 butall QEMU versions failed on Apple M1 so maybe it's something specific tothat CPU.
It is likely that the combination of Apple's microarchitecture and Arminstruction set causes the problem. For example, even though the memory modelin x86 is weaker than x86, such difference may not surface depending on thedesign of load/store unit or the size of load/store buffers.
Fortunately macOS provides Rosetta 2 for x86 emulation on Apple M1, whichmakes it possible to compare x86 and Arm without concerning the difference ofthe microarchitecture.

We've tried that before and even running x86 QEMU on M1 with Rosetta 2 itwas the same so it's probably not something about the code itself but howit's being run by that CPU. I just don't see how can this fail while itworks elsewhere. The ati-vga has similar code and other guest OSes can usethat so I've asked for more testing with that, maybe it would reveal somemore details. One difference might be that the sm501 driver in AmigaOSuses 16 bit display so it may also be something with converting bit depthsbut I'm not sure.


Regards,
BALATON Zoltan

Re: Display update issue on M1 Macs

Reply via email to