On Mon, Sep 26, 2011 at 3:21 PM, Stefan Hajnoczi <stefa...@linux.vnet.ibm.com> wrote: > On Mon, Sep 26, 2011 at 09:35:01AM -0300, Marcelo Tosatti wrote: >> On Fri, Sep 23, 2011 at 04:57:26PM +0100, Stefan Hajnoczi wrote: >> > Here is my generic image streaming branch, which aims to provide a way >> > to copy the contents of a backing file into an image file of a running >> > guest without requiring specific support in the various block drivers >> > (e.g. qcow2, qed, vmdk): >> > >> > http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/image-streaming-api >> > >> > The tree does not provide full image streaming yet but I'd like to >> > discuss the approach taken in the code. Here are the main points: >> > >> > The image streaming API is available through HMP and QMP commands. When >> > streaming is started on a block device a coroutine is created to do the >> > background I/O work. The coroutine can be cancelled. >> > >> > While the coroutine copies data from the backing file into the image >> > file, the guest may be performing I/O to the image file. Guest reads do >> > not conflict with streaming but guest writes require special handling. >> > If the guest writes to a region of the image file that we are currently >> > copying, then there is the potential to clobber the guest write with old >> > data from the backing file. >> > >> > Previously I solved this in a QED-specific way by taking advantage of >> > the serialization of allocating write requests. In order to do this >> > generically we need to track in-flight requests and have the ability to >> > queue I/O. Guest writes that affect an in-flight streaming copy >> > operation must wait for that operation to complete before being issued. >> > Streaming copy operations must skip overlapping regions of guest writes. >> > >> > One big difference to the QED image streaming implementation is that >> > this generic implementation is not based on copy-on-read operations. >> > Instead we do a sequence of bdrv_is_allocated() to find regions for >> > streaming, followed by bdrv_co_read() and bdrv_co_write() in order to >> > populate the image file. >> > >> > It turns out that generic copy-on-read is not an attractive operation >> > because it requires using bounce buffers for every request. >> >> Isnt COR essential for a decent read performance on the >> image-stream-from-slow-remote-origin case? > > It is essential for re-read performance from a slow backing file. With > images over internet HTTP it most definitely is worth doing > copy-on-read. > > In the case of an NFS server the performance depends on the network and > server. It might be similar speed or faster to read from NFS. > > I will think some more about how to implement generic copy-on-read.
I've sketched out how generic copy-on-read can work. It's probably not much extra effort since we need request tracking and the ability to queue/hold requests anyway. I hope to have patches implementing this by the end of the week: 1. When CoR is enabled, overlapping requests get queued so that only one is actually being issued to the host at a time. This prevents race conditions where a guest write request is clobbered by a copy-on-read. Note that only overlapping requests are queued, non-overlapping requests proceed in parallel. 2. The read operation uses bdrv_is_allocated() first to see whether a copy-on-read needs to be performed or if we can go down the fast path. The fast path is the normal read straight into the guest buffer. The copy-on-read path reads into a bounce buffer, writes into the image file, and then copies the bounce buffer into the guest buffer. 3. The .bdrv_is_allocated() implementations will be audited and improved to make them aio/coroutine-friendly where necessary. Stefan