> This is not reasonable IMHO. > > I was okay with sticking a name on a ramblock, but encoding a guest PA > offset turns this into a supported ABI which I'm not willing to do. > > A one line change is one thing, but not a complex new option that > introduces an ABI only for a proprietary product that's jumping through hoops > to keep > from contributing useful logic to QEMU.
Hi Anthony, Thanks for getting back to me. Sticking a name on the ramblock file would suite our product just fine. Indeed, this is what we had agreed upon at the KVM forum. However, I submitted a more complex patch in an attempt to expose a more general & easy to use feature; I was trying to make a more useful contribution than the simple patch :-) Perhaps I can assuage your ABI concern and argue the utility of this patch vs the one-line version. However, if you aren't satisfied, please let me know and I'll resubmit the one-line version. On ABI: This patch doesn't add a new ABI. QEMU already has this ABI due to Xen live migration. When a Xen domain is booted, a new domain is created with an empty physmap. Then QEMU is launched. QEMU creates its ramblocks and, via memory callbacks (xen_add_to_physmap), populates Xen's physmap using ramblock sizes & offsets. On incoming migration, the Xen toolstack creates a new domain, populates its physmap, and copies RAM from the outgoing migration. When QEMU is launched, it populates its Xen memory model (i.e., XenIOState) by reading the domain's existing physmap from xenstore. When QEMU creates ramblocks, the callbacks in xen-all.c _ignore_ the new ramblocks because their offsets are already in the physmap. If the new ramblocks had different sizes & offsets than those from the outgoing QEMU process, then QEMU's memory model would be inconsistent with Xen's (i.e., the physmap maintained by the hypervisor and the XenIOState maintained in userspace). In particular, QEMU would expect memory at a particular physmap offset that wouldn't have been populated by the Xen toolstack during live migration. On utility: Just adding ramblock names to backing file paths makes post-copy migration & cloning possible, but involves some painful VFS contortions, which I give a detailed example of below. On the other hand, these new -mem-path parameters make post-copy migration & cloning simple by leveraging an existing QMP command, existing filesystems, and kernel behavior. Put another way, the useful logic for memory sharing and post-copy live migration already exists in the kernel and a myriad of filesystems. A fairly small patch (albeit not one line) enables that logic in QEMU. Peter Detailed example: Suppose you have a patched QEMU that adds ramblock names to their backing files and you want to implement memory sharing via cloning. When clones come up, each of their ramblocks' backing files need to contain the same data as the corresponding backing file from the parent (obviously you want those new backing files to somehow share pages and COW). The basic idea is to save the parent's ramblock files and arrange for the clones to open them. You can see the parent's ramblock files easily enough by looking at the unlinked ramblock files (e.g., /proc/pid/fd/10 is a symlink to /tmp/qemu_back_mem.pc.ram.WHFZYw (deleted), /proc/pid/fd/11 is a symlink to /tmp/qemu_back_mem.vga.vram.WT1yQW (deleted), etc.). Unfortunately, since they're all mapped MAP_PRIVATE, these symlinks, when opened, will give all zeros. So you can either implement your own filesystem that gives you a backdoor to the MAP_PRIVATE pages (fast but complicated), or you can use qemu's monitor to dump guest RAM (slow but works). When a clone runs and creates a new backing file using mkstemp, you need to arrange for that backing file to somehow contain the same data as the corresponding file from the parent. There is an obvious heuristic for determining this correspondence: parse the ramblock name from the child's file and use the matching file from the parent. Correctness aside (e.g., multiple ramblocks can have the same name, e.g., e1000.rom, but this is moot because the _important_ ramblocks, i.e., pc.ram and vga.ram, are unique in the emulated system we care about), implementing this heuristic is a pain. To see the file being created, you need to implement a custom file system. Moreover, to share memory with another file that's been opened MAP_PRIVATE, you have to implement your own VMA operations. Oye!