On 12/25/2014 08:31 PM, Yang Hongyang wrote: > This is the initial design of block replication. > The blkcolo block driver enables disk replication for continuous > checkpoints. It is designed for COLO that Secondary VM is running. > It can also be applied for FT/HA scene that Secondary VM is not > running. > > Signed-off-by: Wen Congyang <we...@cn.fujitsu.com> > Signed-off-by: Lai Jiangshan <la...@cn.fujitsu.com> > Signed-off-by: Yang Hongyang <yan...@cn.fujitsu.com> > --- > docs/blkcolo.txt | 85 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 85 insertions(+) > create mode 100644 docs/blkcolo.txt
Grammar review only (I'll leave the technical review to others) > > diff --git a/docs/blkcolo.txt b/docs/blkcolo.txt > new file mode 100644 > index 0000000..41c2a05 > --- /dev/null > +++ b/docs/blkcolo.txt > @@ -0,0 +1,85 @@ > +Disk replication using blkcolo > +---------------------------------------- > +Copyright Fujitsu, Corp. 2014 Visually, the separator line should match the length of the line above, and maybe have a blank line after. > + > +This work is licensed under the terms of the GNU GPL, version 2 or later. > +See the COPYING file in the top-level directory. > + > +The blkcolo block driver enables disk replication for continuous checkpoints. > +It is designed for COLO that Secondary VM is running. It can also be applied similar comments as for Wen's RFC COLO v2 series for docs/block-replication.txt (in fact, do we need two files, or should all this information be merged into a single file?): s/for COLO that/for COLO (COurse-grain LOck-stepping replication), where/ > +for FT/HA scene that Secondary VM is not running. s/for FT/HA scene that/to FT/HA (Fault-tolerance/High assurance) scenarios, where/ > + > +This document gives an overview of blkcolo's design. > + > +== Background == > +High availability solutions such as micro checkpoint and COLO will do > +consecutive checkpoint. The VM state of Primary VM and Secondary VM is s/checkpoint/checkpoints/ > +identical right after a VM checkpoint, but becomes different as the VM > +executes till the next checkpoint. To support disk contents checkpoint, > +the modified disk contents in the Secondary VM must be buffered, and are > +only dropped at next checkpoint time. To reduce the network transportation > +effort at the time of checkpoint, the disk modification operations of > +Primary disk are asynchronously forwarded to the Secondary node. > + > +== Disk Buffer == > +The following is the image of Disk buffer: > + > + +----------------------+ +------------------------+ > + |Primary Write Requests| |Secondary Write Requests| > + +----------------------+ +------------------------+ > + | | > + | (4) > + | V > + | /-------------\ > + | Copy and Forward | | > + |---------(1)----------+ | Disk Buffer | > + | | | | > + | (3) \-------------/ > + | speculative ^ > + | write through (2) > + | | | > + V V | > + +--------------+ +----------------+ > + | Primary Disk | | Secondary Disk | > + +--------------+ +----------------+ > + 1) Primary write requests will be copied and forwarded to Secondary > + QEMU. > + 2) Before Primary write requests are written to Secondary disk, the > + original sector content will be read from Secondary disk and > + buffered in the Disk buffer, but it will not overwrite the existing > + sector content in the Disk buffer. > + 3) Primary write requests will be written to Secondary disk. > + 4) Secondary write requests will be bufferd in the Disk buffer and it s/bufferd/buffered/ > + will overwrite the existing sector content in the buffer. > + > +== Capture I/O request == > +The blkcolo is a new block driver protocol, so all I/O requests can be > +captured in the driver interface bdrv_co_readv()/bdrv_co_writev(). > + > +== Checkpoint & failover == > +The blkcolo buffers the write requests in Secondary QEMU. And the buffer > +should be dropped at a checkpoint, or be flushed to Secondary disk when s/when/on/ > +failover. We add four block driver interfaces to do this: > +a. bdrv_prepare_checkpoint() > + This interface may block, and return when all Primary write s/return/returns/ > + requests are forwarded to Secondary QEMU. > +b. bdrv_do_checkpoint() > + This interface is called after all VM state is transfered to s/transfered/transferred/ > + Secondary QEMU. The Disk buffer will be dropped in this interface. > +c. bdrv_get_sent_data_size() > + This is used on Primary node. > + It should be called by migration/checkpoint thread in order > + to decide whether to start a new checkpoint or not. If the data > + amount being sent is too large, we should start a new checkpoint. > +d. bdrv_stop_replication() > + It is called when failover. We will flush the Disk buffer into s/when/on/ > + Secondary Disk and stop disk replication. > + > +== Usage == > +On both Primary/Secondary host, invoke QEMU with the following parameters: > + "-drive file=blkcolo:host:port:/path/to/image" > +a. host > + Hostname or IP of the Secondary host. > +b. port > + The Secondary QEMU will listen on this port, and the Primary QEMU > + will connect to this port. > -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature