Investigating https://bugs.launchpad.net/qemu/+bug/1777777 I found what seems to be a design flaw in the ptimer code.
The ptimer API is basically a down-counter (with a lot of bells and whistles), with functions to do things like "read current count" and "write current count". When the counter hits zero it typically reloads, and the ptimer code notifies a callback function in the device that's using ptimer. In the current implementation this is done using a QEMU bottom-half handler, so ptimer_trigger() calls replay_bh_schedule_event() on a QEMUBH that it was passed by the device when it was initialized. The comment says this is "to avoid reentrancy issues". That's a bit vague, but I assume the idea is to avoid the code of the device's 'triggered' callback being called from within eg ptimer_set_count(), because the device might be in the middle of changing multiple parts of the ptimer's state, its own state might not be consistent, etc. Unfortunately, use of the bottom-half mechanism might have worked back when the ptimer was first written, but these days we don't run the iothread in lockstep with a single vcpu thread, so you can get races, like: * the QEMU timer ptimer uses calls the ptimer_tick() function * ptimer code updates its own state for the counter rollover, and schedules the bottom-half handler to run * the vcpu thread executes guest code that does "read the counter value", which calls ptimer_get_count() and gets the new rolled-over value * the vcpu thread executes guest code that does "read an interrupt-pending register", which looks at a bit of state that's updated by the device's bottom-half handler, and incorrectly gets a value indicating "no rollover happened" * ...then the bottom-half handler runs and the device code updates its interrupt state, but too late. I'm not sure what the best fix here is, but it's hard to see how we can really continue to use bottom-half handlers here. Possibilities I thought of: (1) make ptimer_trigger() just directly call the device's callback function. We'd need to audit all the devices to fix them up to handle the cases where the callback gets called while the device is in the middle of reconfiguring the timer. (2) call the device's callback function directly when the ptimer triggers from the QEMU timer expiry. But for the case of "a call to ptimer_set_count() etc caused the timer to trigger", don't call the callback, instead return a boolean from those functions which tells the caller that the timer triggered, and they need to deal with it (by calling their callback function when they've finished messing with the timer). In either case we would need to gradually convert all (~30) of the devices currently using ptimer over to use the new mechanism, which in all cases would require some examination and modification of the timer code to deal with the new semantics. (I'm thinking of a patch series structure along the lines of "patch 1 renames ptimer_init to ptimer_init_bh, patch 2 introduces new ptimer_init function with new semantics, patches 3..n change one device at a time, patch n+1 deletes the now-unused ptimer_init_bh".) I think overall I favour option 2, which is a bit more syntactically invasive in terms of changing API signatures etc, but semantically easier to reason about (because the callback-function in the device is still not called when the device might be partway through doing an update to the ptimer state that changes multiple parameters of the ptimer). Is there another cleverer fix that I haven't thought of? thanks -- PMM