On Fri, Mar 21, 2025 at 11:17 AM Miles Glenn <[email protected]> wrote: > > On Thu, 2025-03-20 at 16:09 -0400, Stefan Hajnoczi wrote: > > On Thu, Mar 20, 2025 at 12:34 PM Miles Glenn <[email protected]> wrote: > > > Hello, > > > > > > I am attempting to simulate a system with multiple CPU > > > architectures. To do this I am starting a unique QEMU process for each > > > CPU architecture that is needed. I'm also developing some QEMU code > > > that aids in transporting MMIO transactions across the process > > > boundaries using sockets. > > > > I have CCed Phil. He has been working on heterogenous target emulation > > and might be interested. > > > > > The design takes MMIO request messages off of a socket, services the > > > request by calling address_space_ldq_be(), then sends a response > > > message (containing the requested data) over the same > > > socket. Currently, this is all done inside the socket IOReadHandler > > > callback function. > > > > At a high level this is similar to the vfio-user feature where a PCI > > device is emulated in a separate process. This also involves sending > > messages describing QEMU's MemoryRegion accesses. See the "remote" > > machine type in QEMU to look at the code. > > > > > This works as long as the targeted register exists in the same QEMU > > > process that received the request. However, If the register exists in > > > another QEMU process, then the call to address_space_ldq_be() results > > > in another socket message being sent to that QEMU process, requesting > > > the data, and then waiting (blocking) for the response message > > > containing the data. In other words, it ends up blocking inside the > > > event handler and even though the QEMU process containing the target > > > register was able to receive the request and send the response, the > > > originator of the request is unable to receive the response until it > > > eventually times out and stops blocking. Once it times out and stops > > > blocking, it does receive the response, but now it is too late. > > > > > > Here's a summary of the stack up to where the code blocks: > > > > > > IOReadHandler callback > > > calls address_space_ldq_be() > > > resolves to mmio read op of a remote device > > > sends request over socket and waits (blocks) for response > > > > > > So, I'm looking for a way to handle the work of calling > > > address_space_ldq_be(), which might block when attempting to read a > > > register of a remote device, without blocking inside the IOReadHandler > > > callback context. > > > > > > I've done a lot of searches and reading about how to do this on the web > > > and in the QEMU code but it's still not really clear to me how this > > > should be done in QEMU. I've seen a lot about using coroutines to > > > handle cases like this. Is that what I should be using here? > > > > The fundamental problem is that address_space_ldq_be() is synchronous, > > so there is no way to return back to the caller until the response has > > been received. > > > > vfio-user didn't solve this problem. It simply blocks until the > > response is received, but it does drop the Big QEMU Lock during this > > time so that other vCPU threads can run. For example, see > > hw/remote/proxy.c:send_bar_access_msg() and > > mpqemu_msg_send_and_await_reply(). > > > > QEMU supports nested event loops, but they come with their own set of > > gotchas. The way a nested event loop might help here is to send the > > request and then call aio_poll() to receive the response in another > > IOReadHandler. This way other event loop processing can take place > > while waiting in address_space_ldq_be(). > > > > The second problem is that this approach where QEMU processes send > > requests to each other needs to be implemented carefully to avoid > > deadlocks. For example, devices that do DMA could load/store memory > > belonging to another device handled by another QEMU. Once there is an > > A -> B -> A situation it could deadlock. > > > > Both vfio-user and vhost-user have similar issues with their > > bi-directional communication where a device emulation process can send > > a message to QEMU while processing a message from QEMU. Deadlock can > > be avoided if the code is structured so that QEMU is able to receive > > new requests during the time when it is waiting for a response. > > > > Stefan > > Stefan, Thank you for the quick response and great information! > > I'm not sure if this is the best way, but I was able to get things > working today using the coroutine approach. > > Now, the aforementioned stack looks like this: > > IOReadHandler callback receives request > enters coroutine > calls address_space_ldq_be() > resolves to mmio read op of a remote device > sends request > over socket > detects coroutine context and > calls qemu_coroutine_yield() instead of blocking > returns to callback > > <time passes> > > IOReadHandler callback receives response > re-enters coroutine > mmio read op returns data received in response message > address_space_ldq_be() returns > coroutine completes and returns to callback > > While this works, I couldn't help but notice that the coroutine concept > seems to be like a form of multithreading. Is there some advantage to > using coroutines over doing the work in another thread? Does QEMU > offer an interface that allows for a callback to queue up work that can > be handled by another thread or a pool of threads?
Coroutines make it easier to write concurrent code in an event loop. The alternative is to write asynchronous callback functions, which is tedious for sequences with multiple steps that need to wait for I/O. Coroutines do not offer parallelism, so they are not replacement for multi-threading. QEMU is mostly event-driven rather than multi-threaded. Usually only computation in QEMU that really needs its own CPU runs in its own thread (vCPUs, compression, blocking syscalls when there is no alternative, etc). There are advantages to using coroutines: less synchronization is necessary than with threads (you can be sure no other coroutine will run in the same thread while your code is running) and this eliminates most thread-safety issues. Also, event loops are seen as more scalable than threads (lots of historical resources, for example http://www.kegel.com/c10k.html). One QEMU-specific advantage of coroutines: coroutine code has access to all of QEMU's APIs that require the event loop whereas threads need to take extra steps to interact with the rest of QEMU. Stefan
