> With a deeper thinking, I'd like to share some more analyse: Vmstate saving equals memory snapshotting, to do it in theory methods can be concluded as: 1 get a mirror of it just in the time sending the "snapshot" request, kernel cow that region. 2 get a mirror of it by gradually coping out the region, complete when clone sync with the original region, basically similar to migrate.
Take a closer look: 1 cow the memory region: Saving: block I/O, cpu, since any duplicated step do not exist. Sacrifice: mem. Industry improvement solution: NUMA, price: expensive. Implement: hard, need quite some work. Qemu code maintain: easy. Detail: This method is the closest one to the meaning of "snapshot", but it contains a hidden requirement: reserved memory. As a really used server today, it is not possible that a huge memory is reserved for it: for example, one 4G mem server will possible to run a 3.5G mem guest, to get benefit of easing deploying, hardware independency, whole machine backup/restore. In this case, memory is not enough to do it. Let's take another example more possible happen: one 4G mem server run two 1.5G guest, in this case one guest need to be migrated out, obvious bad. So a much better solution is adding memory at the time doing snapshot, to do it without hardware plug and economic, it need NUMA+memory sharing: Host1 Host2 Host3 | | | | | | | mem | mem | mem | | | |------------------ | shared mem Some hosts share a memory to do snapshot, they get it when doing snapshot and return it to cluster manager after complete. This is possible on expensive architecture, but hard to be done on x86 architecture which labels itself cheap. One unrelated topic I thought: does qemu support migrating to a host device? If not it should support migrate to a block device with fixed size(different with snapshot, two mirror need sync), when shared memory present they can be migrated to a RAM block device quickly. Implement detail: It should be done by adding an API in kernel: mem_snapshot(), from where kernel can cow a region, and write the snapshotted pages to far slower shared mem(if this logic is added as optimization). Fork() can do it, but brings many trouble and wound not benefit from NUMA architecture by moving snapshotted pages to slower mem. 2 gradually coping out and sync the memory region, two ways to do it: 2.1 migrate to block device.(migrate to fd, or migrate to image): Saving: mem. Sacrifice: CPU, block I/O. Industry improvement solution: Flash disk, cheap. Implement: easy, based on migration. Qemu code maintain: easy. Detail: It is a relative easier case, we just need to make the size fixed. And flash disk is possible on X86 architecture. 2.2 migrate to a stream, use another process to receive and rearrange the data. Saving: mem. Sacrifice: CPU(very high), block I/O(unless big buffer). Industry improvement solution: another host or CPU do it. Implement: hard, need new qemu tool. Qemu code maintain: hard, data need to be encoded in qemu, decoded on another process and rearrange, every change or new device adding need change it on both side. Detail: It invokes a process to receive the data, or invoke a fake qemu to recieve it and save(need many memory). Since code are hard to maintain, personally I think it is worse than 2.1. Summary: suggest: 1) support both method 1 and 2.1, treat 2.1 as an improvement for migrate fd. Adding a new qmp interface as "vmsate snapshot" for method 1 to declare it as true snapshot. This allow it work on different architecture. 2) pushing a API to Linux to do method 1, instead of fork(). I'd like to send a RFC to Linux memory mail-list to get feedback. -- Best Regards Wenchao Xia