On Wed, 02/08 11:00, Adrian Suarez wrote: > > > > Do you only start submitting request to B (step 3) after the fast device > > I/O > > completes (step 2.a)? The fact that they are serialized incurs extra > > latency. > > Have you tried to do 2 and 3 in parallel with AIO? > > > In step 2, we perform an asynchronous call to the fast device, supplying a > callback that calls aio_bh_schedule_oneshot() to schedule the completion in > the AioContext of the block driver. Step 3 uses bdrv_aio_writev(), but I'm > not sure if this is actually causing the write to be performed > synchronously to the backing device. What I'm expecting is that > bdrv_aio_writev() issues the write and then yields so that we don't > serialize all writes to the backing device.
OK, what I'm wondering is why call bdrv_aio_writev() in a BH instead of right away. IOW, have you traced how much time is spent before even calling bdrv_aio_writev()? > > Thanks, > Adrian > > On Wed, Feb 8, 2017 at 6:30 AM, Fam Zheng <f...@redhat.com> wrote: > > > On Wed, 02/08 14:59, Max Reitz wrote: > > > CC-ing qemu-block, Stefan, Fam. > > > > > > > > > On 08.02.2017 03:38, Adrian Suarez wrote: > > > > We’ve implemented a block driver that exposes storage to QEMU VMs. Our > > > > block driver (O) is interposing on writes to some other type of storage > > > > (B). O performs low latency replication and then asynchronously issues > > the > > > > write to the backing block driver, B, using bdrv_aio_writev(). Our > > problem > > > > is that the write latencies seen by the workload in the guest should be > > > > those imposed by O plus the guest I/O and QEMU stack (around 25us total > > > > based on our measurements), but we’re actually seeing much higher > > latencies > > > > (around 120us). We suspect that this is due to the backing block > > driver B’s > > > > coroutines blocking our coroutines. The sequence of events is as > > follows > > > > (see diagram: > > > > https://docs.google.com/drawings/d/12h1QbecvxzlKxSFvGKYAzvAJ18kTW > > 6AVTwDR6VA8hkw/pub?w=576&h=565 > > > > I cannot open this, so just trying to understand from steps below.. > > > > > > ): > > > > > > > > 1. Write is issued to our block driver O using the asynchronous > > interface > > > > for QEMU block driver. > > > > 2. Write is replicated to a fast device asynchronously. > > > > 2.a. In a different thread, the fast device invokes a callback on > > > > completion that causes a coroutine to be scheduled to run in the QEMU > > > > iothread that acknowledges completion of the write to the guest OS. > > > > 2.b. The coroutine scheduled in (2.a) is executed. > > > > 3. Write is issued asynchronously to the backing block driver, B. > > > > 3.a. The backing block driver, B, invokes the completion function > > supplied > > > > by us, which frees any memory associated with the write (e.g. copies > > of IO > > > > vectors). > > > > Do you only start submitting request to B (step 3) after the fast device > > I/O > > completes (step 2.a)? The fact that they are serialized incurs extra > > latency. > > Have you tried to do 2 and 3 in parallel with AIO? > > > > > > > > > > Steps (1), (2), and (3) are performed in the same coroutine (our > > driver's > > > > bdrv_aio_writev() implementation). (2.a) is executed in a thread that > > is > > > > part of our transport library linked by O, and (2.b) and (3.a) are > > executed > > > > as coroutines in the QEMU iothread. > > > > > > > > We've tried improving the performance by using separate iothreads for > > the > > > > two devices, but this only shaved about lowered the latency to around > > 100us > > > > and caused stability issues. What's the best way to create a separate > > > > iothread for the backing driver to do all of its work in? > > > > > > I don't think it's possible to use different AioContexts for > > > BlockDriverStates in the same BDS chain, at least not currently. But > > > others may know more about this. > > > > This may change in the future but currently all the BDSes in a chain need > > to > > stay on the same AioContext. > > > > Fam > >