Re: [Qemu-devel] [PATCH COLO-Frame v15 00/38] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service (FT)

Dr. David Alan Gilbert Mon, 29 Feb 2016 01:48:44 -0800

* Hailiang Zhang (zhang.zhanghaili...@huawei.com) wrote:
> On 2016/2/27 0:36, Dr. David Alan Gilbert wrote:
> >* Dr. David Alan Gilbert (dgilb...@redhat.com) wrote:
> >>* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote:
> >>>From: root <root@localhost.localdomain>
> >>>
> >>>This is the 15th version of COLO (Still only support periodic checkpoint).
> >>>
> >>>Here is only COLO frame part, you can get the whole codes from github:
> >>>https://github.com/coloft/qemu/commits/colo-v2.6-periodic-mode
> >>>
> >>>There are little changes for this series except the network releated part.
> >>
> >>I was looking at the time the guest is paused during COLO and
> >>was surprised to find one of the larger chunks was the time to reset
> >>the guest before loading each checkpoint;  I've traced it part way, the
> >>biggest contributors for my test VM seem to be:
> >>
> >>   3.8ms  pcibus_reset: VGA
> >>   1.8ms  pcibus_reset: virtio-net-pci
> >>   1.5ms  pcibus_reset: virtio-blk-pci
> >>   1.5ms  qemu_devices_reset: piix4_reset
> >>   1.1ms  pcibus_reset: piix3-ide
> >>   1.1ms  pcibus_reset: virtio-rng-pci
> >>
> >>I've not looked deeper yet, but some of these are very silly;
> >>I'm running with -nographic so why it's taking 3.8ms to reset VGA is
> >>going to be interesting.
> >>Also, my only block device is the virtio-blk, so while I understand the
> >>standard PC machine has the IDE controller, why it takes it over a ms
> >>to reset an unused device.
> >
> >OK, so I've dug a bit deeper, and it appears that it's the changes in
> >PCI bars that actually take the time;  every time we do a reset we
> >reset all the BARs, this causes it to do a pci_update_mappings and
> >end up doing a memory_region_del_subregion.
> >Then we load the config space of the PCI device as we do the vmstate_load,
> >and this recreates all the mappings again.
> >
> >I'm not sure what the fix is, but that sounds like it would
> >speed up the checkpoints usefully if we can avoid the map/remap when
> >they're the same.
> >
> 
> Interesting, and thanks for your report.
> 
> We already known qemu_system_reset() is a time-consuming function, we 
> shouldn't
> call it here, but if we didn't do that, there will be a bug, which we have
> reported before in the previous COLO series, the bellow is the copy of the 
> related
> patch comment:
> 
>     COLO VMstate: Load VM state into qsb before restore it
> 
>     We should not destroy the state of secondary until we receive the whole
>     state from the primary, in case the primary fails in the middle of sending
>     the state, so, here we cache the device state in Secondary before restore 
> it.
> 
>     Besides, we should call qemu_system_reset() before load VM state,
>     which can ensure the data is intact.
>     Note: If we discard qemu_system_reset(), there will be some odd error,
>     For exmple, qemu in slave side crashes and reports:
> 
>     KVM: entry failed, hardware error 0x7
>     EAX=00000000 EBX=0000e000 ECX=00009578 EDX=0000434f
>     ESI=0000fc10 EDI=0000434f EBP=00000000 ESP=00001fca
>     EIP=00009594 EFL=00010246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
>     ES =0040 00000400 0000ffff 00009300
>     CS =f000 000f0000 0000ffff 00009b00
>     SS =434f 000434f0 0000ffff 00009300
>     DS =434f 000434f0 0000ffff 00009300
>     FS =0000 00000000 0000ffff 00009300
>     GS =0000 00000000 0000ffff 00009300
>     LDT=0000 00000000 0000ffff 00008200
>     TR =0000 00000000 0000ffff 00008b00
>     GDT=     0002dcc8 00000047
>     IDT=     00000000 0000ffff
>     CR0=00000010 CR2=ffffffff CR3=00000000 CR4=00000000
>     DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
> DR3=0000000000000000
>     DR6=00000000ffff0ff0 DR7=0000000000000400
>     EFER=0000000000000000
>     Code=c0 74 0f 66 b9 78 95 00 00 66 31 d2 66 31 c0 e9 47 e0 fb 90 <f3> 90 
> fa fc 66 c3 66 53 66 89 c3
>     ERROR: invalid runstate transition: 'internal-error' -> 'colo'
> 
>     The reason is, some of the device state will be ignored when saving 
> device state to slave,
>     if the corresponding data is in its initial value, such as 0.
>     But the device state in slave maybe in initialized value, after a loop of 
> checkpoint,
>     there will be inconsistent for the value of device state.
>     This will happen when the PVM reboot or SVM run ahead of PVM in the 
> startup process.
>     Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
>     Signed-off-by: Yang Hongyang <yan...@cn.fujitsu.com>
>     Signed-off-by: Gonglei <arei.gong...@huawei.com>
>     Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com
> 
> As described above, some values of the device state are zero, they will be
> ignored during  migration, it has no problem for normal migration, because
> for the VM in destination, the initial values will be zero too. But for COLO,
> there are more than one round of migration, the related values may be changed
> from no-zero to zero, they will be ignored too in the next checkpoint, the
> VMstate will be inconsistent for SVM.


Yes, this doesn't really surprise me; a lot of the migration code does weird 
things
like this, so reset is the safest way.

> The above error is caused directly by wrong value of 'async_pf_en_msr'.
> 
> static const VMStateDescription vmstate_async_pf_msr = {
>     .name = "cpu/async_pf_msr",
>     .version_id = 1,
>     .minimum_version_id = 1,
>     .needed = async_pf_msr_needed,
>     .fields = (VMStateField[]) {
>         VMSTATE_UINT64(env.async_pf_en_msr, X86CPU),
>         VMSTATE_END_OF_LIST()
>     }
> };
> 
> static bool async_pf_msr_needed(void *opaque)
> {
>     X86CPU *cpu = opaque;
> 
>     return cpu->env.async_pf_en_msr != 0;
> }
> 
> Some other VMstate of registers in CPUX86State have the same problem,
> we can't make sure they won't cause any problems if the values of them
> are incorrect.
> So here, we just simply call qemu_system_reset() to avoid the inconsistent
> problem.
> Besides, compared with the most time-consuming operation (ram flushed from
> COLO cache to SVM). The time consuming for qemu_system_reset() seems to be
> acceptable ;)

I've got a patch where I've tried to multithread the flush - it's made it a 
little
faster, but not as much as I hoped (~20ms down to ~16ms using 4 cores)

> Another choice to fix the problem is to save the VMstate ignoring the needed()
> return value, but this method is not so graceful.
> diff --git a/migration/vmstate.c b/migration/vmstate.c
> index e5388f0..7d15bba 100644
> --- a/migration/vmstate.c
> +++ b/migration/vmstate.c
> @@ -409,7 +409,7 @@ static void vmstate_subsection_save(QEMUFile *f, const 
> VMStateDescription *vmsd,
>      bool subsection_found = false;
> 
>      while (sub && sub->needed) {
> -        if (sub->needed(opaque)) {
> +        if (sub->needed(opaque) || migrate_in_colo_state()) {
>              const VMStateDescription *vmsd = sub->vmsd;
>              uint8_t len;

Maybe, but I suspect this will also find other strange cases in devices.
For example, without the reset I wouldn't really trust devices to load the
new state in; they wouldn't have been tested with all the states they
might have left themselves in previously.

Dave



> Thanks,
> Hailiang
> 
> >Dave
> >
> >>
> >>I guess reset is normally off anyones radar since it's outside
> >>the time anyone cares about, but I guess perhaps the guys trying
> >>to make qemu start really quickly would be interested.
> >>
> >>Dave
> >>
> >>>
> >>>Patch status:
> >>>Unreviewed: patch 21,27,28,29,33,38
> >>>Updated: patch 31,34,35,37
> >>>
> >>>TODO:
> >>>1. Checkpoint based on proxy in qemu
> >>>2. The capability of continuous FT
> >>>3. Optimize the VM's downtime during checkpoint
> >>>
> >>>v15:
> >>>  - Go on the shutdown process if encounter error while sending shutdown
> >>>    message to SVM. (patch 24)
> >>>  - Rename qemu_need_skip_netfilter to qemu_netfilter_can_skip and Remove
> >>>    some useless comment. (patch 31, Jason)
> >>>  - Call object_new_with_props() directly to add filter in
> >>>    colo_add_buffer_filter. (patch 34, Jason)
> >>>  - Re-implement colo_set_filter_status() based on COLOBufferFilters
> >>>    list. (patch 35)
> >>>  - Re-implement colo_flush_filter_packets() based on COLOBufferFilters
> >>>    list. (patch 37)
> >>>v14:
> >>>  - Re-implement the network processing based on netfilter (Jason Wang)
> >>>  - Rename 'COLOCommand' to 'COLOMessage'. (Markus's suggestion)
> >>>  - Split two new patches (patch 27/28) from patch 29
> >>>  - Fix some other comments from Dave and Markus.
> >>>
> >>>v13:
> >>>  - Refactor colo_*_cmd helper functions to use 'Error **errp' parameter
> >>>   instead of return value to indicate success or failure. (patch 10)
> >>>  - Remove the optional error message for COLO_EXIT event. (patch 25)
> >>>  - Use semaphore to notify colo/colo incoming loop that failover work is
> >>>    finished. (patch 26)
> >>>  - Move COLO shutdown related codes to colo.c file. (patch 28)
> >>>  - Fix memory leak bug for colo incoming loop. (new patch 31)
> >>>  - Re-use some existed helper functions to realize the process of
> >>>    saving/loading ram and device. (patch 32)
> >>>  - Fix some other comments from Dave and Markus.
> >>>
> >>>zhanghailiang (38):
> >>>   configure: Add parameter for configure to enable/disable COLO support
> >>>   migration: Introduce capability 'x-colo' to migration
> >>>   COLO: migrate colo related info to secondary node
> >>>   migration: Integrate COLO checkpoint process into migration
> >>>   migration: Integrate COLO checkpoint process into loadvm
> >>>   COLO/migration: Create a new communication path from destination to
> >>>     source
> >>>   COLO: Implement colo checkpoint protocol
> >>>   COLO: Add a new RunState RUN_STATE_COLO
> >>>   QEMUSizedBuffer: Introduce two help functions for qsb
> >>>   COLO: Save PVM state to secondary side when do checkpoint
> >>>   COLO: Load PVM's dirty pages into SVM's RAM cache temporarily
> >>>   ram/COLO: Record the dirty pages that SVM received
> >>>   COLO: Load VMState into qsb before restore it
> >>>   COLO: Flush PVM's cached RAM into SVM's memory
> >>>   COLO: Add checkpoint-delay parameter for migrate-set-parameters
> >>>   COLO: synchronize PVM's state to SVM periodically
> >>>   COLO failover: Introduce a new command to trigger a failover
> >>>   COLO failover: Introduce state to record failover process
> >>>   COLO: Implement failover work for Primary VM
> >>>   COLO: Implement failover work for Secondary VM
> >>>   qmp event: Add COLO_EXIT event to notify users while exited from COLO
> >>>   COLO failover: Shutdown related socket fd when do failover
> >>>   COLO failover: Don't do failover during loading VM's state
> >>>   COLO: Process shutdown command for VM in COLO state
> >>>   COLO: Update the global runstate after going into colo state
> >>>   savevm: Introduce two helper functions for save/find loadvm_handlers
> >>>     entry
> >>>   migration/savevm: Add new helpers to process the different stages of
> >>>     loadvm
> >>>   migration/savevm: Export two helper functions for savevm process
> >>>   COLO: Separate the process of saving/loading ram and device state
> >>>   COLO: Split qemu_savevm_state_begin out of checkpoint process
> >>>   net/filter: Add a 'status' property for filter object
> >>>   filter-buffer: Accept zero interval
> >>>   net: Add notifier/callback for netdev init
> >>>   COLO/filter: add each netdev a buffer filter
> >>>   COLO: manage the status of buffer filters for PVM
> >>>   filter-buffer: make filter_buffer_flush() public
> >>>   COLO: flush buffered packets in checkpoint process or exit COLO
> >>>   COLO: Add block replication into colo process
> >>>
> >>>  configure                     |  11 +
> >>>  docs/qmp-events.txt           |  16 +
> >>>  hmp-commands.hx               |  15 +
> >>>  hmp.c                         |  15 +
> >>>  hmp.h                         |   1 +
> >>>  include/exec/ram_addr.h       |   1 +
> >>>  include/migration/colo.h      |  42 ++
> >>>  include/migration/failover.h  |  33 ++
> >>>  include/migration/migration.h |  16 +
> >>>  include/migration/qemu-file.h |   3 +-
> >>>  include/net/filter.h          |   5 +
> >>>  include/net/net.h             |   4 +
> >>>  include/sysemu/sysemu.h       |   9 +
> >>>  migration/Makefile.objs       |   2 +
> >>>  migration/colo-comm.c         |  76 ++++
> >>>  migration/colo-failover.c     |  83 ++++
> >>>  migration/colo.c              | 866 
> >>> ++++++++++++++++++++++++++++++++++++++++++
> >>>  migration/migration.c         | 109 +++++-
> >>>  migration/qemu-file-buf.c     |  61 +++
> >>>  migration/ram.c               | 175 ++++++++-
> >>>  migration/savevm.c            | 114 ++++--
> >>>  net/filter-buffer.c           |  14 +-
> >>>  net/filter.c                  |  40 ++
> >>>  net/net.c                     |  33 ++
> >>>  qapi-schema.json              | 104 ++++-
> >>>  qapi/event.json               |  15 +
> >>>  qemu-options.hx               |   4 +-
> >>>  qmp-commands.hx               |  23 +-
> >>>  stubs/Makefile.objs           |   1 +
> >>>  stubs/migration-colo.c        |  54 +++
> >>>  trace-events                  |   8 +
> >>>  vl.c                          |  31 +-
> >>>  32 files changed, 1908 insertions(+), 76 deletions(-)
> >>>  create mode 100644 include/migration/colo.h
> >>>  create mode 100644 include/migration/failover.h
> >>>  create mode 100644 migration/colo-comm.c
> >>>  create mode 100644 migration/colo-failover.c
> >>>  create mode 100644 migration/colo.c
> >>>  create mode 100644 stubs/migration-colo.c
> >>>
> >>>--
> >>>1.8.3.1
> >>>
> >>>
> >>--
> >>Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> >--
> >Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> >
> >.
> >
> 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH COLO-Frame v15 00/38] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service (FT)

Reply via email to