Am 07.06.2013 um 18:14 hat Liu Yuan geschrieben: > On 06/07/2013 11:22 PM, Kevin Wolf wrote: > > Am 07.06.2013 um 15:48 hat Liu Yuan geschrieben: > >> On 06/07/2013 03:31 PM, Kevin Wolf wrote: > >>> Am 06.06.2013 um 15:09 hat Liu Yuan geschrieben: > >>>> On 06/06/2013 08:46 PM, Kevin Wolf wrote: > >>>>> Am 06.06.2013 um 13:57 hat Liu Yuan geschrieben: > >>>> Only when the write comes from VM, we do the following stuff > >>>> - delete active vdi A > >>>> - created a new inode based on the previously reloaded As1's inode > >>> > >>> Thanks, this is the part that I missed. > >>> > >>> I'm not sure however why the actual switch is delayed until the first > >>> write. This seems inconsistent with qcow2 snapshots. > >>> > >>> Do you know if there is a reason why we can't always do this already > >>> during bdrv_snapshot_goto()? > >>> > >> > >> I think the reason is sd_load_vmstate() need to load vm state objects > >> with the correct inode object. > >> > >> I tried to remove > >> > >> if (!s->inode.vm_state_size) > >> > >> and make sd_create_branch unconditional. This means 'loadvm' command > >> will try to call sd_create_branch() inside sd_snapshot_goto(). But > >> failed with reading the wrong snapshot because the vdi's inode object is > >> changed by sd_create_branch(). > > > > Ok, I think I start to understand how these things work. Basically, > > qemu's savevm functionality is designed to work with qcow2 like this: > > > > 1. Save the VM state to the active layer > > 2. create_snapshot saves the active layer including VM state > > 3. [ Forget to remove the VM state from the active layer :-) ] > > 4. loadvm loads the snapshot again, including VM state > > 5. VM state is restored from the active layer > > > > So for Sheepdog, the problem is that step 2 doesn't include the VM state, > > right? So our options are either to change Sheepdog so that step 2 does > > involve moving the VM state (might end up rather ugly) or you need to > > swap steps 1 and 2, so that you first switch to the new snapshot and > > then write the VM state. > > > > Sorry, I didn't fully understand the above example. If sheepdog takes > snapshots, vmstate will go with the exact current active disk into the > cluster storage, for e.g > > 1 we take two snapshots on the disk 'test' with tag A and B respectively > 2 disk(A) + vmstate(A) will be stored as indexed by A's vdi > disk(B) + vmstate(B) will be stored as indexed by B's vdi > > The chain is A --> B --> C, C is the current active vdi. Note, both A, B > and C has different vdi_id. > > The problem show up when we do loadvm A, the process is: > > 1 reload inode of A (in snapshot_goto) > 2 read vmstate via A's vdi_id (loadvm_state) > 3 delete C and create C1, reload inode of C1 (sd_create_branch on write) > > So if at stage 1, we call sd_create_branch(), the inode will point to C1 > and stage 2 will fail to read vmstate because vdi_id is C1's. > > This is why we rely on the write to call sd_create_branch(). I am not > sure if we can fix sheepdog's loadvm without touching the core code of QEMU.
Hm, yes, I was confused. :-) Anyway, the options stay the same: Either C1 must somehow inherit the VM state from A on the Sheepdog level, or we must make sure to get this order: 1. Switch to (read-only) snapshot A 2. Load the VM state 3. Create (writable) snapshot C1 and switch to it This is a different order as required for qcow2 (where C1 would inherit the VM state from A, so our current first 1, then 3, then 2 works), but if we can't fix it on the Sheepdog side, we'll have to touch the core code of qemu. I don't like much to touch the block.c code for it, but if it's required to fix a bug, let's just do it. > > Because I think what we're doing currently is not only hard to > > understand, but actually wrong. Consider this: > > > > 1. savevm with a Sheepdog image > > 2. Exit qemu before any write request is sent > > 3. Restart qemu with the Sheepdog image and notice how the wrong > > snapshot is active. Oops. > > > > Sheepdog indeed has this problem. Thanks for confirming. Kevin