Am 08.07.2015 um 12:43 schrieb Dr. David Alan Gilbert: > * Christian Borntraeger (borntrae...@de.ibm.com) wrote: >> Am 08.07.2015 um 12:14 schrieb Dr. David Alan Gilbert: >>> * Christian Borntraeger (borntrae...@de.ibm.com) wrote: >>>> Am 07.07.2015 um 15:08 schrieb Juan Quintela: >>>>> This includes a new section that for now just stores the current qemu >>>>> state. >>>>> >>>>> Right now, there are only one way to control what is the state of the >>>>> target after migration. >>>>> >>>>> - If you run the target qemu with -S, it would start stopped. >>>>> - If you run the target qemu without -S, it would run just after >>>>> migration finishes. >>>>> >>>>> The problem here is what happens if we start the target without -S and >>>>> there happens one error during migration that puts current state as >>>>> -EIO. Migration would ends (notice that the error happend doing block >>>>> IO, network IO, i.e. nothing related with migration), and when >>>>> migration finish, we would just "continue" running on destination, >>>>> probably hanging the guest/corruption data, whatever. >>>>> >>>>> Signed-off-by: Juan Quintela <quint...@redhat.com> >>>>> Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com> >>>> >>>> This is bisected to cause a regression on s390. >>>> >>>> A guest restarts (booting) after managedsave/start instead of continuing. >>>> >>>> Do you have any idea what might be wrong? >>> >>> I'd add some debug to the pre_save and post_load to see what state value is >>> being saved/restored. >>> >>> Also, does that regression happen when doing the save/restore using the >>> same/latest >>> git, or is it a load from an older version? >> >> Seems to happen only with some guest definitions, but I cant really pinpoint >> it yet. >> e.g. removing queues='4' from my network card solved it for a reduced xml, >> but >> doing the same on a bigger xml was not enough :-/ > > Nasty; Still the 'paused' value in the pre-save/post-load feels right. > I've read through the patch again and it still fells right to me, so I don't > see anything obvious. > > Perhaps it's worth turning on the migration tracing on both sides and seeing > what's > different with that 'queues=4' ?
Reducing the amount of virtio disks also seem to help. I am asking myself if some devices use the runstate somehow and this change triggers a race.