We’ve implemented a block driver that exposes storage to QEMU VMs. Our block driver (O) is interposing on writes to some other type of storage (B). O performs low latency replication and then asynchronously issues the write to the backing block driver, B, using bdrv_aio_writev(). Our problem is that the write latencies seen by the workload in the guest should be those imposed by O plus the guest I/O and QEMU stack (around 25us total based on our measurements), but we’re actually seeing much higher latencies (around 120us). We suspect that this is due to the backing block driver B’s coroutines blocking our coroutines. The sequence of events is as follows (see diagram: https://docs.google.com/drawings/d/12h1QbecvxzlKxSFvGKYAzvAJ18kTW6AVTwDR6VA8hkw/pub?w=576&h=565 ):
1. Write is issued to our block driver O using the asynchronous interface for QEMU block driver. 2. Write is replicated to a fast device asynchronously. 2.a. In a different thread, the fast device invokes a callback on completion that causes a coroutine to be scheduled to run in the QEMU iothread that acknowledges completion of the write to the guest OS. 2.b. The coroutine scheduled in (2.a) is executed. 3. Write is issued asynchronously to the backing block driver, B. 3.a. The backing block driver, B, invokes the completion function supplied by us, which frees any memory associated with the write (e.g. copies of IO vectors). Steps (1), (2), and (3) are performed in the same coroutine (our driver's bdrv_aio_writev() implementation). (2.a) is executed in a thread that is part of our transport library linked by O, and (2.b) and (3.a) are executed as coroutines in the QEMU iothread. We've tried improving the performance by using separate iothreads for the two devices, but this only shaved about lowered the latency to around 100us and caused stability issues. What's the best way to create a separate iothread for the backing driver to do all of its work in? -Adrian