Dave Bort <[email protected]> writes: > We use qemu (4.0.0, about to flip the switch to 5.0.0) to test our aarch64 > images, running in linux containers on x86_64 alongside other workloads. > > We've recently run into issues where it looks like an emulated CPU (out of > four) sometimes stops making progress for ten or more seconds, and we're > trying to characterize the problem. When this > happens, the other emulated CPUs run just fine, though sometimes two will > stall out at the same time. > > Any suggestions for how to tell if an emulated CPU stopped doing work? > > Based on our experiments, the guest-visible clocks and cycle counters > continue to run when a qemu CPU thread is suspended, so it's hard to tell > whether the emulation paused, or if our code is > spinning with interrupts disabled (though evidence is mounting that that's > not the case). We're adding a bunch more instrumentation to our code, but > maybe qemu has some features that will help > us out. > > I tried to find a way to count the number of TBs executed by an > emulated core over time, but I didn't see a cheap way to do that with > the plugin APIs.
It should be pretty cheap to do. You just need to extend the example bb plugin to take cpu_index into account and do the proper locking to update the instruction counter in vcpu_tb_exec. The qemu_plugin_register_vcpu_idle_cb and qemu_plugin_register_vcpu_resume_cb functions allow you to register call backs for everytime we exit the main run loop and sleep for whatever reason. You could even dump the total instruction counts there. > > We could maybe turn on instruction tracing, but this problem happens pretty > rarely (<1%), we don't have a repro case yet, and we can't really afford the > cost of slowing down every test run. > There's a decent chance that this is caused by an overloaded host, but our > host-side investigations haven't turned up anything concrete either. > > Any advice? > > --dbort > -- Alex Bennée
