* Lai Jiangshan (jiangshan...@gmail.com) wrote: > 1) What's this > > When the migration capability 'bypass-shared-memory' > is set, the shared memory will be bypassed when migration. > > It is the key feature to enable several excellent features for > the qemu, such as qemu-local-migration, qemu-live-update, > extremely-fast-save-restore, vm-template, vm-fast-live-clone, > yet-another-post-copy-migration, etc.. > > The philosophy behind this key feature, including the resulting > advanced key features, is that a part of the memory management > is separated out from the qemu, and let the other toolkits > such as libvirt, kata-containers (https://github.com/kata-containers) > runv(https://github.com/hyperhq/runv/) or some multiple cooperative > qemu commands directly access to it, manage it, provide features on it. > > 2) Status in real world > > The hyperhq(http://hyper.sh http://hypercontainer.io/) > introduced the feature vm-template(vm-fast-live-clone) > to the hyper container for several years, it works perfect. > (see https://github.com/hyperhq/runv/pull/297). > > The feature vm-template makes the containers(VMs) can > be started in 130ms and save 80M memory for every > container(VM). So that the hyper containers are fast > and high-density as normal containers. > > kata-containers project (https://github.com/kata-containers) > which was launched by hyper, intel and friends and which descended > from runv (and clear-container) should have this feature enabled. > Unfortunately, due to the code confliction between runv&cc, > this feature was temporary disabled and it is being brought > back by hyper and intel team. > 3) How to use and bring up advanced features. > > In current qemu command line, shared memory has > to be configured via memory-object. > > a) feature: qemu-local-migration, qemu-live-update > Set the mem-path on the tmpfs and set share=on for it when > start the vm. example: > -object \ > memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \ > -numa node,nodeid=0,cpus=0-7,memdev=mem > > when you want to migrate the vm locally (after fixed a security bug > of the qemu-binary, or other reason), you can start a new qemu with > the same command line and -incoming, then you can migrate the > vm from the old qemu to the new qemu with the migration capability > 'bypass-shared-memory' set. The migration will migrate the device-state > *ONLY*, the memory is the origin memory backed by tmpfs file. > > b) feature: extremely-fast-save-restore > the same above, but the mem-path is on the persistent file system. > > c) feature: vm-template, vm-fast-live-clone > the template vm is started as 1), and paused when the guest reaches > the template point(example: the guest app is ready), then the template > vm is saved. (the qemu process of the template can be killed now, because > we need only the memory and the device state files (in tmpfs)). > > Then we can launch one or multiple VMs base on the template vm states, > the new VMs are started without the “share=on”, all the new VMs share > the initial memory from the memory file, they save a lot of memory. > all the new VMs start from the template point, the guest app can go to > work quickly. > > The new VM booted from template vm can’t become template again, > if you need this unusual chained-template feature, you can write > a cloneable-tmpfs kernel module for it. >
I've just tried doing something similar with this patch; it's really interesting. I used LVM snapshotting for the RAM: cd /dev/shm fallocate -l 20G backingfile losetup -f ./backingfile pvcreate /dev/loop0 vgcreate ram /dev/loop0 lvcreate -L4G -nram1 ram /dev/loop0 qemu -M pc,accel=kvm -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/ram/ram1,share=on -numa node,memdev=mem -vnc :0 -drive file=my.qcow2,id=d,cache=none -monitor stdio boot the VM, and do a : migrate_set_capability bypass-shared-memory on migrate_set_speed 10G migrate "exec:cat > migstream1" q then: lvcreate -n ramsnap1 -s ram/ram1 -L4G qemu -M pc,accel=kvm -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/ram/ramsnap1,share=on -numa node,memdev=mem -vnc :0 -drive file=my.qcow2,id=d,cache=none -monitor stdio -snapshot -incoming "exec:cat migstream1" lvcreate -n ramsnap2 -s ram/ram1 -L4G qemu -M pc,accel=kvm -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/ram/ramsnap2,share=on -numa node,memdev=mem -vnc :1 -drive file=my.qcow2,id=d,cache=none -monitor stdio -snapshot -incoming "exec:cat migstream1" and I've got two separate instances of qemu restored from that stream. It seems to work; I wonder if we ever need things like msync() or similar? I've not tried creating a 2nd template with this. > The libvirt toolkit can’t manage vm-template currently, in the > hyperhq/runv, we use qemu wrapper script to do it. I hope someone add > “libvrit managed template” feature to libvirt. > > d) feature: yet-another-post-copy-migration > It is a possible feature, no toolkit can do it well now. > Using nbd server/client on the memory file is reluctantly Ok but > inconvenient. A special feature for tmpfs might be needed to > fully complete this feature. > No one need yet another post copy migration method, > but it is possible when some crazy man need it. > > Cc: Juan Quintela <quint...@redhat.com> > Cc: "Dr. David Alan Gilbert" <dgilb...@redhat.com> > Cc: Eric Blake <ebl...@redhat.com> > Cc: Markus Armbruster <arm...@redhat.com> > Cc: Samuel Ortiz <sa...@linux.intel.com> > Cc: Sebastien Boeuf <sebastien.bo...@intel.com> > Cc: James O. D. Hunt <james.o.h...@intel.com> > Cc: Xu Wang <gna...@gmail.com> > Cc: Peng Tao <bergw...@gmail.com> > Cc: Xiao Guangrong <xiaoguangr...@tencent.com> > Cc: Xiao Guangrong <xiaoguangrong.e...@gmail.com> > Signed-off-by: Lai Jiangshan <jiangshan...@gmail.com> > --- > > Changes in V5: > check cappability conflict in migrate_caps_check() > > Changes in V4: > fixes checkpatch.pl errors > > Changes in V3: > rebased on upstream master > update the available version of the capability to > v2.13 > > Changes in V2: > rebased on 2.11.1 > > migration/migration.c | 22 ++++++++++++++++++++++ > migration/migration.h | 1 + > migration/ram.c | 27 ++++++++++++++++++--------- > qapi/migration.json | 6 +++++- > 4 files changed, 46 insertions(+), 10 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 52a5092add..110b40f6d4 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -736,6 +736,19 @@ static bool migrate_caps_check(bool *cap_list, > return false; > } > > + if (cap_list[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY]) { > + /* Bypass and postcopy are quite conflicting ways > + * to get memory in the destination. And there > + * is not code to discriminate the differences and > + * handle the conflicts currently. It should be possible > + * to fix, but it is generally useless when both ways > + * are used together. > + */ > + error_setg(errp, "Bypass is not currently compatible " > + "with postcopy"); > + return false; > + } > + Good. > /* This check is reasonably expensive, so only when it's being > * set the first time, also it's only the destination that needs > * special support. > @@ -1509,6 +1522,15 @@ bool migrate_release_ram(void) > return s->enabled_capabilities[MIGRATION_CAPABILITY_RELEASE_RAM]; > } > > +bool migrate_bypass_shared_memory(void) > +{ > + MigrationState *s; > + > + s = migrate_get_current(); > + > + return > s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY]; > +} > + > bool migrate_postcopy_ram(void) > { > MigrationState *s; > diff --git a/migration/migration.h b/migration/migration.h > index 8d2f320c48..cfd2513ef0 100644 > --- a/migration/migration.h > +++ b/migration/migration.h > @@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void); > > bool migrate_postcopy(void); > > +bool migrate_bypass_shared_memory(void); > bool migrate_release_ram(void); > bool migrate_postcopy_ram(void); > bool migrate_zero_blocks(void); > diff --git a/migration/ram.c b/migration/ram.c > index 0e90efa092..bca170c386 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -780,6 +780,11 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, > RAMBlock *rb, > unsigned long *bitmap = rb->bmap; > unsigned long next; > > + /* when this ramblock is requested bypassing */ > + if (!bitmap) { > + return size; > + } > + > if (rs->ram_bulk_stage && start > 0) { > next = start + 1; > } else { > @@ -850,7 +855,9 @@ static void migration_bitmap_sync(RAMState *rs) > qemu_mutex_lock(&rs->bitmap_mutex); > rcu_read_lock(); > RAMBLOCK_FOREACH(block) { > - migration_bitmap_sync_range(rs, block, 0, block->used_length); > + if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) { > + migration_bitmap_sync_range(rs, block, 0, block->used_length); > + } > } > rcu_read_unlock(); > qemu_mutex_unlock(&rs->bitmap_mutex); > @@ -2132,18 +2139,12 @@ static int ram_state_init(RAMState **rsp) > qemu_mutex_init(&(*rsp)->src_page_req_mutex); > QSIMPLEQ_INIT(&(*rsp)->src_page_requests); > > - /* > - * Count the total number of pages used by ram blocks not including any > - * gaps due to alignment or unplugs. > - */ > - (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS; > - > ram_state_reset(*rsp); > > return 0; > } > > -static void ram_list_init_bitmaps(void) > +static void ram_list_init_bitmaps(RAMState *rs) > { > RAMBlock *block; > unsigned long pages; > @@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void) > /* Skip setting bitmap if there is no RAM */ > if (ram_bytes_total()) { > QLIST_FOREACH_RCU(block, &ram_list.blocks, next) { > + if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) > { > + continue; > + } > pages = block->max_length >> TARGET_PAGE_BITS; > block->bmap = bitmap_new(pages); > bitmap_set(block->bmap, 0, pages); > + /* > + * Count the total number of pages used by ram blocks not > + * including any gaps due to alignment or unplugs. > + */ > + rs->migration_dirty_pages += pages; > if (migrate_postcopy_ram()) { > block->unsentmap = bitmap_new(pages); > bitmap_set(block->unsentmap, 0, pages); Can you please rework this to combine with Cédric Le Goater's 'discard non-migratable RAMBlocks' - it's quite similar to what you're trying to do but for a different reason; If you look at the v2 from April 13, I think you can just find somewhere to clear the RAM_MIGRATABLE flag. One thing I noticed; in my world I've got some code that checks if we ever do a RAM iteration, don't find any dirty blocks but then still have migration_dirty_pages being none-0; and with this patch I'm seeing that check trigger: ram_find_and_save_block: no page found, yet dirty_pages=480 it doesn't seem to trigger without the patch. Dave > @@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs) > qemu_mutex_lock_ramlist(); > rcu_read_lock(); > > - ram_list_init_bitmaps(); > + ram_list_init_bitmaps(rs); > memory_global_dirty_log_start(); > migration_bitmap_sync(rs); > > diff --git a/qapi/migration.json b/qapi/migration.json > index 9d0bf82cf4..45326480bd 100644 > --- a/qapi/migration.json > +++ b/qapi/migration.json > @@ -357,13 +357,17 @@ > # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps. > # (since 2.12) > # > +# @bypass-shared-memory: the shared memory region will be bypassed on > migration. > +# This feature allows the memory region to be reused by new qemu(s) > +# or be migrated separately. (since 2.13) > +# > # Since: 1.2 > ## > { 'enum': 'MigrationCapability', > 'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks', > 'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram', > 'block', 'return-path', 'pause-before-switchover', 'x-multifd', > - 'dirty-bitmaps' ] } > + 'dirty-bitmaps', 'bypass-shared-memory' ] } > > ## > # @MigrationCapabilityStatus: > -- > 2.15.1 (Apple Git-101) > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK