On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote: > * Steven Sistare (steven.sist...@oracle.com) wrote: >> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: >>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: >>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: >>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: >>>>>> Provide the cprsave and cprload commands for live update. These save and >>>>>> restore VM state, with minimal guest pause time, so that qemu may be >>>>>> updated >>>>>> to a new version in between. >>>>>> >>>>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports >>>>>> two >>>>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary >>>>>> (or >>>>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a >>>>>> paused state and waits for the cprload command. >>>>> >>>>> I think cprsave/cprload could be generalized by using QMP to stash the >>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code >>>>> already opens fds passed using this mechanism. >>>>> >>>>> I haven't checked but it may be possible to drop some patches by reusing >>>>> QEMU's monitor file descriptor passing since the code already knows how >>>>> to open from 'getfd' fds. >>>>> >>>>> The reason why using QMP is interesting is because it eliminates the >>>>> need for execve(2). QEMU may be unable to execute a program due to >>>>> chroot, seccomp, etc. >>>>> >>>>> QMP would enable cprsave/cprload to work both with and without >>>>> execve(2). >>>>> >>>>> One tricky thing with this approach might be startup ordering: how to >>>>> get fds via the QMP monitor in the new process before processing the >>>>> entire command-line. >>>> >>>> Early on I experimented with a similar approach. Old qemu passed >>>> descriptors to an >>>> escrow process and exited; new qemu started and retrieved the descriptors >>>> from escrow. >>>> vfio mostly worked after I hacked the kernel to suppress the original-pid >>>> owner check. >>>> I suspect my recent vfio extensions would smooth the rough edges. >>> >>> I wonder about the reason for VFIO's pid limitation, maybe because it >>> pins pages from the original process? >> >> The dma unmap code verifies that the requesting task is the same as the task >> that mapped >> the pages. We could add an ioctl that passes ownership to a new task. We >> would also need >> to fix locked memory accounting, which is associated with the mm of the >> original task. > >>> Is this VFIO pid limitation the main reason why you chose to make QEMU >>> execve(2) the new binary? >> >> That is one. Plus, re-attaching to named shared memory for pc.ram causes >> the vfio conflict >> errors I mentioned in the previous email. We would need to suppress >> redundant dma map calls, >> but allow legitimate dma maps and unmaps in response to the ongoing address >> space changes and >> diff callbacks caused by some drivers. It would be messy and fragile. In >> general, it felt like >> I was working against vfio rather than with it. > > OK the weirdness of vfio helps explain a bit about why you're doing it > this way; can you help separate some difference between restart and > reboot for me though: > > In 'reboot' mode; where the guest must do suspend in it's drivers, how > much of these vfio requirements are needed? I guess the memfd use > for the anonymous areas isn't any use for reboot mode.
Correct. For reboot no special vfio support or fiddling is needed. > You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does > vfio still care about the currently-anonymous areas? Yes, for restart mode. The physical pages behind the anonymous memory remain pinned and are targets for ongoing DMA. Post-exec qemu needs a way to find those same pages. >> Another big reason is a requirement to preserve anonymous memory for legacy >> qemu updates (via >> code injection which I briefly mentioned in KVM forum). If we extend cpr to >> allow updates >> without exec, I still need the exec option. > > Can you explain what that code injection mechanism is for those of us > who didn't see that? Sure. Here is slide 12 from the talk. It relies on mmap(MADV_DOEXEC) which was not accepted upstream. ----------------------------------------------------------------------------- Legacy Live Update * Update legacy qemu process to latest version - Inject code into legacy qemu process to perform cprsave: vmsave.so . Access qemu data structures and globals - eg ram_list, savevm_state, chardevs, vhost_devices - dlopen does not resolve them, must get addresses via symbol lookup. . Delete some vmstate handlers, register new ones (eg vfio) . Call MADV_DOEXEC on guest memory. Find devices, preserve fd * Hot patch a monitor function to dlopen vmsave.so, call entry point - write patch to /proc/pid/mem - Call the monitor function via monitor socket * Send cprload to update qemu * vmsave.so has binary dependency on qemu data structures and variables - Build vmsave-ver.so per legacy version - Indexed by qemu's gcc build-id ----------------------------------------------------------------------------- - Steve >>>> However, the main issue is that guest ram must be backed by named shared >>>> memory, and >>>> we would need to add code to support shared memory for all the secondary >>>> memory objects. >>>> That makes it less interesting for us at this time; we care about updating >>>> legacy qemu >>>> instances with anonymous guest memory. >>> >>> Thanks for explaining this more in the other sub-thread. The secondary >>> memory objects you mentioned are relatively small so I don't think >>> saving them in the traditional way is a problem. >>> >>> Two approaches for zero-copy memory migration fit into QEMU's existing >>> migration infrastructure: >>> >>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, >>> etc) so they are not saved into the savevm file. The existing --object >>> memory-backend-file syntax can be used. >>> >>> - Extending the live migration protocol to detect when file descriptor >>> passing is available (i.e. UNIX domain socket migration) and using >>> that for memory-backend-* objects that have fds. >>> >>> Either of these approaches would handle RAM with existing savevm/migrate >>> commands. >> >> Yes, but the vfio issues would still need to be solved, and we would need new >> command line options to back existing and future secondary memory objects >> with >> named shared memory. >> >>> The remaining issue is how to migrate VFIO and other file descriptors >>> that cannot be reopened by the new process. As mentioned, QEMU already >>> has file descriptor passing support in the QMP monitor and support for >>> opening passed file descriptors (see qemu_open_internal(), >>> monitor_fd_param(), and socket_get_fd()). >>> >>> The advantage of integrating live update functionality into the existing >>> savevm/migrate commands is that it will work in more use cases with >>> less code duplication/maintenance/bitrot prevention than the >>> special-case cprsave command in this patch series. >>> >>> Maybe there is a fundamental technical reason why live update needs to >>> be different from QEMU's existing migration commands but I haven't >>> figured it out yet. >> >> vfio and anonymous memory. >> >> Regarding code duplication, I did consider whether to extend the migration >> syntax and implementation versus creating something new. Those functions >> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr >> use case, and the cpr functions handle state that is n/a for the migration >> case. >> I judged that handling both in the same functions would be less readable and >> maintainable. After feedback during the V1 review, I simplified the cprsave >> code by by calling qemu_save_device_state, as Xen does, thus eliminating any >> interaction with the migration code. >> >> Regarding bit rot, I still need to add a cpr test to the test suite, when >> the >> review is more complete and folks agree on the final form of the >> functionality. >> >> I do like the idea of supporting update without exec, but as a future >> project, >> and not at the expense of dropping update with exec. >> >> - Steve >>