On 06/07/2013 11:22 PM, Kevin Wolf wrote: > Am 07.06.2013 um 15:48 hat Liu Yuan geschrieben: >> On 06/07/2013 03:31 PM, Kevin Wolf wrote: >>> Am 06.06.2013 um 15:09 hat Liu Yuan geschrieben: >>>> On 06/06/2013 08:46 PM, Kevin Wolf wrote: >>>>> Am 06.06.2013 um 13:57 hat Liu Yuan geschrieben: >>>> Only when the write comes from VM, we do the following stuff >>>> - delete active vdi A >>>> - created a new inode based on the previously reloaded As1's inode >>> >>> Thanks, this is the part that I missed. >>> >>> I'm not sure however why the actual switch is delayed until the first >>> write. This seems inconsistent with qcow2 snapshots. >>> >>> Do you know if there is a reason why we can't always do this already >>> during bdrv_snapshot_goto()? >>> >> >> I think the reason is sd_load_vmstate() need to load vm state objects >> with the correct inode object. >> >> I tried to remove >> >> if (!s->inode.vm_state_size) >> >> and make sd_create_branch unconditional. This means 'loadvm' command >> will try to call sd_create_branch() inside sd_snapshot_goto(). But >> failed with reading the wrong snapshot because the vdi's inode object is >> changed by sd_create_branch(). > > Ok, I think I start to understand how these things work. Basically, > qemu's savevm functionality is designed to work with qcow2 like this: > > 1. Save the VM state to the active layer > 2. create_snapshot saves the active layer including VM state > 3. [ Forget to remove the VM state from the active layer :-) ] > 4. loadvm loads the snapshot again, including VM state > 5. VM state is restored from the active layer > > So for Sheepdog, the problem is that step 2 doesn't include the VM state, > right? So our options are either to change Sheepdog so that step 2 does > involve moving the VM state (might end up rather ugly) or you need to > swap steps 1 and 2, so that you first switch to the new snapshot and > then write the VM state. >
Sorry, I didn't fully understand the above example. If sheepdog takes snapshots, vmstate will go with the exact current active disk into the cluster storage, for e.g 1 we take two snapshots on the disk 'test' with tag A and B respectively 2 disk(A) + vmstate(A) will be stored as indexed by A's vdi disk(B) + vmstate(B) will be stored as indexed by B's vdi The chain is A --> B --> C, C is the current active vdi. Note, both A, B and C has different vdi_id. The problem show up when we do loadvm A, the process is: 1 reload inode of A (in snapshot_goto) 2 read vmstate via A's vdi_id (loadvm_state) 3 delete C and create C1, reload inode of C1 (sd_create_branch on write) So if at stage 1, we call sd_create_branch(), the inode will point to C1 and stage 2 will fail to read vmstate because vdi_id is C1's. This is why we rely on the write to call sd_create_branch(). I am not sure if we can fix sheepdog's loadvm without touching the core code of QEMU. > Because I think what we're doing currently is not only hard to > understand, but actually wrong. Consider this: > > 1. savevm with a Sheepdog image > 2. Exit qemu before any write request is sent > 3. Restart qemu with the Sheepdog image and notice how the wrong > snapshot is active. Oops. > Sheepdog indeed has this problem. > Or doesn't Sheepdog have any notion of an "active snapshot" and this is > only a runtime decision? > >>>> The chain will look like: >>>> >>>> As1 --> As2 >>>> | >>>> V >>>> A >>>> >>>> This is how sheepdog handles savevm/loadvm. >>>> >>>> So for 'qemu-img snapshot -a', we should create the branch in the >>>> .bdrv_snapshot_goto. >>>> >>>> As you pointed out, we need to consider vm state even for 'snapshot -a', >>>> so I need to rework the patch 2/2. >>> >>> Yes, the presence of VM state is independent from whether you're using >>> qemu-img or loadvm. And it actually goes both ways: qemu-img can revert >>> to snapshots that have a VM state, and loadvm can be used with images >>> that don't have a VM state (if you have multiple images, only one of >>> them has the VM state). >>> >> >> Seems not true of current code. If I 'loadvm' a snapshot without a >> vmstate, I'll get 'qemu-system-x86_64: This is a disk-only snapshot. >> Revert to it offline using qemu-img'. > > Try again with multiple disks. Only one disk gets the VM state, the > other ones get only the disk snapshot. > >> But 'qemu-img snapshot -a' works as you said, it can rollback to the >> snapshot regardless of vmstate. >> >> Also this is a difference to store vmstate for sheepdog images. *Every* >> snapshot image can have its own vmstate stored in sheepdog cluster. That >> is, we can have multiple snapshot with its own private vmstate for sheepdog. > > Sorry, can't parse. :-( > I misunderstood your statements, please ignore it. Thanks, Yuan