* Hailiang Zhang (zhang.zhanghaili...@huawei.com) wrote: > On 2016/2/27 0:36, Dr. David Alan Gilbert wrote: > >* Dr. David Alan Gilbert (dgilb...@redhat.com) wrote: > >>* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: > >>>From: root <root@localhost.localdomain> > >>> > >>>This is the 15th version of COLO (Still only support periodic checkpoint). > >>> > >>>Here is only COLO frame part, you can get the whole codes from github: > >>>https://github.com/coloft/qemu/commits/colo-v2.6-periodic-mode > >>> > >>>There are little changes for this series except the network releated part. > >> > >>I was looking at the time the guest is paused during COLO and > >>was surprised to find one of the larger chunks was the time to reset > >>the guest before loading each checkpoint; I've traced it part way, the > >>biggest contributors for my test VM seem to be: > >> > >> 3.8ms pcibus_reset: VGA > >> 1.8ms pcibus_reset: virtio-net-pci > >> 1.5ms pcibus_reset: virtio-blk-pci > >> 1.5ms qemu_devices_reset: piix4_reset > >> 1.1ms pcibus_reset: piix3-ide > >> 1.1ms pcibus_reset: virtio-rng-pci > >> > >>I've not looked deeper yet, but some of these are very silly; > >>I'm running with -nographic so why it's taking 3.8ms to reset VGA is > >>going to be interesting. > >>Also, my only block device is the virtio-blk, so while I understand the > >>standard PC machine has the IDE controller, why it takes it over a ms > >>to reset an unused device. > > > >OK, so I've dug a bit deeper, and it appears that it's the changes in > >PCI bars that actually take the time; every time we do a reset we > >reset all the BARs, this causes it to do a pci_update_mappings and > >end up doing a memory_region_del_subregion. > >Then we load the config space of the PCI device as we do the vmstate_load, > >and this recreates all the mappings again. > > > >I'm not sure what the fix is, but that sounds like it would > >speed up the checkpoints usefully if we can avoid the map/remap when > >they're the same. > > > > Interesting, and thanks for your report. > > We already known qemu_system_reset() is a time-consuming function, we > shouldn't > call it here, but if we didn't do that, there will be a bug, which we have > reported before in the previous COLO series, the bellow is the copy of the > related > patch comment: > > COLO VMstate: Load VM state into qsb before restore it > > We should not destroy the state of secondary until we receive the whole > state from the primary, in case the primary fails in the middle of sending > the state, so, here we cache the device state in Secondary before restore > it. > > Besides, we should call qemu_system_reset() before load VM state, > which can ensure the data is intact. > Note: If we discard qemu_system_reset(), there will be some odd error, > For exmple, qemu in slave side crashes and reports: > > KVM: entry failed, hardware error 0x7 > EAX=00000000 EBX=0000e000 ECX=00009578 EDX=0000434f > ESI=0000fc10 EDI=0000434f EBP=00000000 ESP=00001fca > EIP=00009594 EFL=00010246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 > ES =0040 00000400 0000ffff 00009300 > CS =f000 000f0000 0000ffff 00009b00 > SS =434f 000434f0 0000ffff 00009300 > DS =434f 000434f0 0000ffff 00009300 > FS =0000 00000000 0000ffff 00009300 > GS =0000 00000000 0000ffff 00009300 > LDT=0000 00000000 0000ffff 00008200 > TR =0000 00000000 0000ffff 00008b00 > GDT= 0002dcc8 00000047 > IDT= 00000000 0000ffff > CR0=00000010 CR2=ffffffff CR3=00000000 CR4=00000000 > DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 > DR3=0000000000000000 > DR6=00000000ffff0ff0 DR7=0000000000000400 > EFER=0000000000000000 > Code=c0 74 0f 66 b9 78 95 00 00 66 31 d2 66 31 c0 e9 47 e0 fb 90 <f3> 90 > fa fc 66 c3 66 53 66 89 c3 > ERROR: invalid runstate transition: 'internal-error' -> 'colo' > > The reason is, some of the device state will be ignored when saving > device state to slave, > if the corresponding data is in its initial value, such as 0. > But the device state in slave maybe in initialized value, after a loop of > checkpoint, > there will be inconsistent for the value of device state. > This will happen when the PVM reboot or SVM run ahead of PVM in the > startup process. > Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com> > Signed-off-by: Yang Hongyang <yan...@cn.fujitsu.com> > Signed-off-by: Gonglei <arei.gong...@huawei.com> > Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com > > As described above, some values of the device state are zero, they will be > ignored during migration, it has no problem for normal migration, because > for the VM in destination, the initial values will be zero too. But for COLO, > there are more than one round of migration, the related values may be changed > from no-zero to zero, they will be ignored too in the next checkpoint, the > VMstate will be inconsistent for SVM.
Yes, this doesn't really surprise me; a lot of the migration code does weird things like this, so reset is the safest way. > The above error is caused directly by wrong value of 'async_pf_en_msr'. > > static const VMStateDescription vmstate_async_pf_msr = { > .name = "cpu/async_pf_msr", > .version_id = 1, > .minimum_version_id = 1, > .needed = async_pf_msr_needed, > .fields = (VMStateField[]) { > VMSTATE_UINT64(env.async_pf_en_msr, X86CPU), > VMSTATE_END_OF_LIST() > } > }; > > static bool async_pf_msr_needed(void *opaque) > { > X86CPU *cpu = opaque; > > return cpu->env.async_pf_en_msr != 0; > } > > Some other VMstate of registers in CPUX86State have the same problem, > we can't make sure they won't cause any problems if the values of them > are incorrect. > So here, we just simply call qemu_system_reset() to avoid the inconsistent > problem. > Besides, compared with the most time-consuming operation (ram flushed from > COLO cache to SVM). The time consuming for qemu_system_reset() seems to be > acceptable ;) I've got a patch where I've tried to multithread the flush - it's made it a little faster, but not as much as I hoped (~20ms down to ~16ms using 4 cores) > Another choice to fix the problem is to save the VMstate ignoring the needed() > return value, but this method is not so graceful. > diff --git a/migration/vmstate.c b/migration/vmstate.c > index e5388f0..7d15bba 100644 > --- a/migration/vmstate.c > +++ b/migration/vmstate.c > @@ -409,7 +409,7 @@ static void vmstate_subsection_save(QEMUFile *f, const > VMStateDescription *vmsd, > bool subsection_found = false; > > while (sub && sub->needed) { > - if (sub->needed(opaque)) { > + if (sub->needed(opaque) || migrate_in_colo_state()) { > const VMStateDescription *vmsd = sub->vmsd; > uint8_t len; Maybe, but I suspect this will also find other strange cases in devices. For example, without the reset I wouldn't really trust devices to load the new state in; they wouldn't have been tested with all the states they might have left themselves in previously. Dave > Thanks, > Hailiang > > >Dave > > > >> > >>I guess reset is normally off anyones radar since it's outside > >>the time anyone cares about, but I guess perhaps the guys trying > >>to make qemu start really quickly would be interested. > >> > >>Dave > >> > >>> > >>>Patch status: > >>>Unreviewed: patch 21,27,28,29,33,38 > >>>Updated: patch 31,34,35,37 > >>> > >>>TODO: > >>>1. Checkpoint based on proxy in qemu > >>>2. The capability of continuous FT > >>>3. Optimize the VM's downtime during checkpoint > >>> > >>>v15: > >>> - Go on the shutdown process if encounter error while sending shutdown > >>> message to SVM. (patch 24) > >>> - Rename qemu_need_skip_netfilter to qemu_netfilter_can_skip and Remove > >>> some useless comment. (patch 31, Jason) > >>> - Call object_new_with_props() directly to add filter in > >>> colo_add_buffer_filter. (patch 34, Jason) > >>> - Re-implement colo_set_filter_status() based on COLOBufferFilters > >>> list. (patch 35) > >>> - Re-implement colo_flush_filter_packets() based on COLOBufferFilters > >>> list. (patch 37) > >>>v14: > >>> - Re-implement the network processing based on netfilter (Jason Wang) > >>> - Rename 'COLOCommand' to 'COLOMessage'. (Markus's suggestion) > >>> - Split two new patches (patch 27/28) from patch 29 > >>> - Fix some other comments from Dave and Markus. > >>> > >>>v13: > >>> - Refactor colo_*_cmd helper functions to use 'Error **errp' parameter > >>> instead of return value to indicate success or failure. (patch 10) > >>> - Remove the optional error message for COLO_EXIT event. (patch 25) > >>> - Use semaphore to notify colo/colo incoming loop that failover work is > >>> finished. (patch 26) > >>> - Move COLO shutdown related codes to colo.c file. (patch 28) > >>> - Fix memory leak bug for colo incoming loop. (new patch 31) > >>> - Re-use some existed helper functions to realize the process of > >>> saving/loading ram and device. (patch 32) > >>> - Fix some other comments from Dave and Markus. > >>> > >>>zhanghailiang (38): > >>> configure: Add parameter for configure to enable/disable COLO support > >>> migration: Introduce capability 'x-colo' to migration > >>> COLO: migrate colo related info to secondary node > >>> migration: Integrate COLO checkpoint process into migration > >>> migration: Integrate COLO checkpoint process into loadvm > >>> COLO/migration: Create a new communication path from destination to > >>> source > >>> COLO: Implement colo checkpoint protocol > >>> COLO: Add a new RunState RUN_STATE_COLO > >>> QEMUSizedBuffer: Introduce two help functions for qsb > >>> COLO: Save PVM state to secondary side when do checkpoint > >>> COLO: Load PVM's dirty pages into SVM's RAM cache temporarily > >>> ram/COLO: Record the dirty pages that SVM received > >>> COLO: Load VMState into qsb before restore it > >>> COLO: Flush PVM's cached RAM into SVM's memory > >>> COLO: Add checkpoint-delay parameter for migrate-set-parameters > >>> COLO: synchronize PVM's state to SVM periodically > >>> COLO failover: Introduce a new command to trigger a failover > >>> COLO failover: Introduce state to record failover process > >>> COLO: Implement failover work for Primary VM > >>> COLO: Implement failover work for Secondary VM > >>> qmp event: Add COLO_EXIT event to notify users while exited from COLO > >>> COLO failover: Shutdown related socket fd when do failover > >>> COLO failover: Don't do failover during loading VM's state > >>> COLO: Process shutdown command for VM in COLO state > >>> COLO: Update the global runstate after going into colo state > >>> savevm: Introduce two helper functions for save/find loadvm_handlers > >>> entry > >>> migration/savevm: Add new helpers to process the different stages of > >>> loadvm > >>> migration/savevm: Export two helper functions for savevm process > >>> COLO: Separate the process of saving/loading ram and device state > >>> COLO: Split qemu_savevm_state_begin out of checkpoint process > >>> net/filter: Add a 'status' property for filter object > >>> filter-buffer: Accept zero interval > >>> net: Add notifier/callback for netdev init > >>> COLO/filter: add each netdev a buffer filter > >>> COLO: manage the status of buffer filters for PVM > >>> filter-buffer: make filter_buffer_flush() public > >>> COLO: flush buffered packets in checkpoint process or exit COLO > >>> COLO: Add block replication into colo process > >>> > >>> configure | 11 + > >>> docs/qmp-events.txt | 16 + > >>> hmp-commands.hx | 15 + > >>> hmp.c | 15 + > >>> hmp.h | 1 + > >>> include/exec/ram_addr.h | 1 + > >>> include/migration/colo.h | 42 ++ > >>> include/migration/failover.h | 33 ++ > >>> include/migration/migration.h | 16 + > >>> include/migration/qemu-file.h | 3 +- > >>> include/net/filter.h | 5 + > >>> include/net/net.h | 4 + > >>> include/sysemu/sysemu.h | 9 + > >>> migration/Makefile.objs | 2 + > >>> migration/colo-comm.c | 76 ++++ > >>> migration/colo-failover.c | 83 ++++ > >>> migration/colo.c | 866 > >>> ++++++++++++++++++++++++++++++++++++++++++ > >>> migration/migration.c | 109 +++++- > >>> migration/qemu-file-buf.c | 61 +++ > >>> migration/ram.c | 175 ++++++++- > >>> migration/savevm.c | 114 ++++-- > >>> net/filter-buffer.c | 14 +- > >>> net/filter.c | 40 ++ > >>> net/net.c | 33 ++ > >>> qapi-schema.json | 104 ++++- > >>> qapi/event.json | 15 + > >>> qemu-options.hx | 4 +- > >>> qmp-commands.hx | 23 +- > >>> stubs/Makefile.objs | 1 + > >>> stubs/migration-colo.c | 54 +++ > >>> trace-events | 8 + > >>> vl.c | 31 +- > >>> 32 files changed, 1908 insertions(+), 76 deletions(-) > >>> create mode 100644 include/migration/colo.h > >>> create mode 100644 include/migration/failover.h > >>> create mode 100644 migration/colo-comm.c > >>> create mode 100644 migration/colo-failover.c > >>> create mode 100644 migration/colo.c > >>> create mode 100644 stubs/migration-colo.c > >>> > >>>-- > >>>1.8.3.1 > >>> > >>> > >>-- > >>Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > >-- > >Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > > > >. > > > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK