On 02/12/2015 04:44 PM, Fam Zheng wrote: > On Thu, 02/12 15:40, Wen Congyang wrote: >> On 02/12/2015 03:21 PM, Fam Zheng wrote: >>> Hi Congyang, >>> >>> On Thu, 02/12 11:07, Wen Congyang wrote: >>>> +== Workflow == >>>> +The following is the image of block replication workflow: >>>> + >>>> + +----------------------+ +------------------------+ >>>> + |Primary Write Requests| |Secondary Write Requests| >>>> + +----------------------+ +------------------------+ >>>> + | | >>>> + | (4) >>>> + | V >>>> + | /-------------\ >>>> + | Copy and Forward | | >>>> + |---------(1)----------+ | Disk Buffer | >>>> + | | | | >>>> + | (3) \-------------/ >>>> + | speculative ^ >>>> + | write through (2) >>>> + | | | >>>> + V V | >>>> + +--------------+ +----------------+ >>>> + | Primary Disk | | Secondary Disk | >>>> + +--------------+ +----------------+ >>>> + >>>> + 1) Primary write requests will be copied and forwarded to Secondary >>>> + QEMU. >>>> + 2) Before Primary write requests are written to Secondary disk, the >>>> + original sector content will be read from Secondary disk and >>>> + buffered in the Disk buffer, but it will not overwrite the existing >>>> + sector content in the Disk buffer. >>> >>> I'm a little confused by the tenses ("will be" versus "are") and terms. I am >>> reading them as "s/will be/are/g" >>> >>> Why do you need this buffer? >> >> We only sync the disk till next checkpoint. Before next checkpoint, secondary >> vm write to the buffer. >> >>> >>> If both primary and secondary write to the same sector, what is saved in the >>> buffer? >> >> The primary content will be written to the secondary disk, and the secondary >> content >> is saved in the buffer. > > I wonder if alternatively this is possible with an imaginary "writable backing > image" feature, as described below. > > When we have a normal backing chain, > > {virtio-blk dev 'foo'} > | > | > | > [base] <- [mid] <- (foo)
foo's backing is mid, and mid's backing is base? The foo is a base's snapshot? Thanks Wen Congyang > > Where [base] and [mid] are read only, (foo) is writable. When we add an > overlay > to an existing image on top, > > {virtio-blk dev 'foo'} {virtio-blk dev 'bar'} > | | > | | > | | > [base] <- [mid] <- (foo) <---------------------- (bar) > > It's important to make sure that writes to 'foo' doesn't break data for 'bar'. > We can utilize an automatic hidden drive-backup target: > > {virtio-blk dev 'foo'} > {virtio-blk dev 'bar'} > | > | > | > | > v > v > > [base] <- [mid] <- (foo) <----------------- (hidden target) > <--------------- (bar) > > v ^ > v ^ > v ^ > v ^ > >>>> drive-backup sync=none >>>> > > So when guest writes to 'foo', the old data is moved to (hidden target), which > remains unchanged from (bar)'s PoV. > > The drive in the middle is called hidden because QEMU creates it > automatically, > the naming is arbitrary. > > It is interesting because it is a more generalized case of image fleecing, > where the (hidden target) is exposed via NBD server for data scanning (read > only) purpose. > > More interestingly, with above facility, it is also possible to create a guest > visible live snapshot (disk 'bar') of an existing device (disk 'foo') very > cheaply. Or call it shadow copy if you will. > > Back to the COLO case, the configuration will be very similar: > > > {primary wr} > {secondary vm} > | > | > | > | > | > | > v > v > > [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) > <------------- (active disk) > > v ^ > v ^ > v ^ > v ^ > >>>> drive-backup sync=none >>>> > > The workflow analogue is: > >>>> + 1) Primary write requests will be copied and forwarded to Secondary >>>> + QEMU. > > Primary write requests are forwarded to secondary QEMU as well. > >>>> + 2) Before Primary write requests are written to Secondary disk, the >>>> + original sector content will be read from Secondary disk and >>>> + buffered in the Disk buffer, but it will not overwrite the existing >>>> + sector content in the Disk buffer. > > Before Primary write requests are written to (nbd target), aka the Secondary > disk, the orignal sector content is read from it and copied to (hidden buf > disk) by drive-backup. It obviously will not overwrite the data in (active > disk). > >>>> + 3) Primary write requests will be written to Secondary disk. > > Primary write requests are written to (nbd target). > >>>> + 4) Secondary write requests will be buffered in the Disk buffer and it >>>> + will overwrite the existing sector content in the buffer. > > Secondary write request will be written in (active disk) as usual. > > Finally, when checkpoint arrives, if you want to sync with primary, just drop > data in (hidden buf disk) and (active disk); when failover happends, if you > want to promote secondary vm, you can commit (active disk) to (nbd target), > and > drop data in (hidden buf disk). > > Fam > . >