On 06/08/2012 09:53 AM, Stefan Hajnoczi wrote: > On Fri, Jun 8, 2012 at 2:19 PM, Jeff Cody <jc...@redhat.com> wrote: >> On 06/08/2012 08:42 AM, Stefan Hajnoczi wrote: >>> On Thu, Jun 7, 2012 at 3:14 PM, Jeff Cody <jc...@redhat.com> wrote: >>>> On 06/07/2012 02:19 AM, Taisuke Yamada wrote: >>> We want to commit snap1.qcow2 down into vm001.img while the guest is >>> running: >>> >>> vm001.img <-- snap2.qcow2 >>> >>> This means copying allocated blocks from snap1.qcow2 and writing them >>> into vm001.img. Once this process is complete it is safe to delete >>> snap1.qcow2 since all data is now in vm001.img. >> >> Yes, this is the same as what we are wanting to accomplish. The trick >> here is open vm001.img r/w, in a safe manner (by safe, I mean able to >> abort in case of error while keeping the guest running live). >> >> My thoughts on this has revolved around something similar to what was >> done in bdrv_append(), where a duplicate BDS is created, a new file-open >> performed with the appropriate access mode flags, and if successful >> swapped out for the originally opened BDS for vm001.img. If there is an >> error, the new BDS is abandoned without modifying the BDS list. > > Yes, there needs to be an atomic way to try opening the image read/write. > >>> >>> As a result we have made the backing file chain shorter. This is >>> improtant because otherwise incremental backup would grow the backing >>> file chain forever - each time it takes a new snapshot the chain >>> becomes longer and I/O accesses can become slower! >>> >>> The task is to add a new block job type called "commit". It is like >>> the qemu-img commit command except it works while the guest is >>> running. >>> >>> The new QMP command should look like this: >>> >>> { 'command': 'block-commit', 'data': { 'device': 'str', 'image': >>> 'str', 'base': 'str', '*speed': 'int' } >> >> This is very similar to what I was thinking as well - I think the only >> difference in the command is that I what you called 'image' I called >> 'top', and the argument order was after base. >> >> Here is what I had for the command: >> >> { 'command': 'block-commit', 'data': { 'device': 'str', '*base': 'str', >> '*top': 'str', '*speed': 'int' } } >> >> I don't think I have a strong preference for either of our proposed >> commands - they are essentially the same. > > Yes, they are basically the same. I don't mind which one we use. > >>> >>> Note that block-commit cannot work on the top-most image since the >>> guest is still writing to that image and we might never be able to >>> copy all the data into the base image (the guest could write new data >>> as quickly as we copy it to the base). The command should check for >>> this and reject the top-most image. >> >> By this you mean that you would like to disallow committing the >> top-level image to the base? Perhaps there is a way to attempt to >> converge, and adaptively give more time to the co-routine if we are able >> to detect divergence. This may require violating the 'speed' parameter, >> however, and make the commit less 'live'. > > Yes, I think we should disallow merging down the topmost image. The > convergence problem is the same issue that live migration has. It's a > hard problem and not something that is essential for block-commit. It > would add complexity into the block-commit implementation - the only > benefit would be that you can merge down the last COW snapshot. We > already have an implementation that does this: the "commit" command > which stops the guest :).
OK. And, it is better to get this implemented now, and if desired, I suppose we can always revisit the convergence at a later date (or not at all, if there is no pressing case for it). > > If we take Taisuke's scenario, just create a snapshot on the SSD > immediately before merging down: > > /big-slow-cheap-disk/master.img <- /small-fast-expensive-ssd/cow.qcow2 > > (qemu) snapshot_blkdev ... > > /big-slow-cheap-disk/master.img <- > /small-fast-expensive-ssd/cow.qcow2 <- > /small-fast-expensive-ssd/newcow.qcow2 > > (qemu) block-commit cow.qcow2 > > /big-slow-cheap-disk/master.img <- /small-fast-expensive-ssd/newcow.qcow2 > >>> >>> Let's figure out how to specify block-commit so we're all happy, that >>> way we can avoid duplicating work. Any comments on my notes above? >>> >> >> I think we are almost completely on the same page - devil is in the >> details, of course (for instance, on how to convert the destination base >> from r/o to r/w). > > Great. The atomic r/o -> r/w transition and the commit coroutine can > be implemented on in parallel. Are you happy to implement the atomic > r/o -> r/w transition since you wrote bdrv_append()? Then Zhi Hui can > assume that part already works and focus on implementing the commit > coroutine in the meantime. I'm just suggesting a way to split up the > work, please let me know if you think this is good. I am happy to do it that way. I'll shift my focus to the atomic image reopen in r/w mode. I'll go ahead and post my diagrams and other info for block-commit on the wiki, because I don't think it conflicts with we discussed above (although I will modify my diagrams to not show commit from the top-level image). Of course, feel free to change it as necessary. Thanks, Jeff