Hi everybody, Here are the notes from the last Hypervisor Live Update call that happened on Monday, May 19. Thanks to everybody who was involved!
These notes are intended to bring people up to speed who could not attend the call as well as keep the conversation going in between meetings. ----->o----- Pasha started off by presenting material on the LUO v2 design[1]. He noted that while we want to eventually preserve devices, the current proposal only preserves file descriptors. He also made a comment that future use cases may allow for extensions for preserving containers across kexec. Pasha went through the state machine for LUO: normal, prepared (VMs still running but devices are serialized), frozen (VM is suspended), and updated (have updated into the next kernel but are not yet in the normal cycle on the other side of the kexec). There are four LUO event stages: prepare (before blackout), freeze (done normally through reboot syscall), finish (transition from updated state to normal state), and cancel (for prepare or freeze). The UAPI includes /dev/liveupdate character device to add participants, sned events, and query state, as well as /sys/kernel/liveupdate/state that describes the current state of the system per the above. The latter can be used by systemd to optimize for boot or live update; live updates always want very fast boot. LUO will register subsystem and can allow for querying of subsystem data. Each struct liveupdate_subsystem has the callbacks for the event stages and pass in a u64 pointer. David Matlack asked if the u64 is normally used to store the address in memory of saved state; Pasha confirmed this would normally be the case but LUO doesn't put any restrictions on its usage. Mike Rapoport noted the lower 12 bits could be used for other purposes if used for an address. Filesystems can also register with LUO to preserve their fds. The struct liveupdate_filesystem includes callbacks for the event stages as well as a ->retrieve() and ->can_preserve() boolean. The latter will allow us to determine if a file can be preserved or not. These callbacks all operate on pointers to struct file. Pratyush asked about the relationship between KHO and LUO. Pasha noted that KHO provides a state machine and in RFC v2 of LUO, LUO can drive KHO which makes the KHO debugfs interface optional. KHO activate will cause LUO to switch to the prepared phase, for example. /dev/liveupdate continues to be the preferred mechanism. Think of KHO as preserving state across kexec whereas LUO provides the state machine. ----->o----- I asked about the next steps for LUO. Pasha noted v2 was very recent and there would be discussion over the next few weeks. Memfd preservation is currently under development and Chris Li is working on device preservation. Pratyush is working on tests for memfd as well as libluo which is a userspace library to making interacting with LUO simpler. Mike had a thought about more tightly coupling KHO and LUO in the kernel tree. Pasha suggested waiting for later to do a clean up of the code, at least waiting for KHO to land (not necessarily LUO landing). In the future, it may be better to store under kernel/ instead of drivers/misc/. No additional work is being planned on KHO until it initially lands. ----->o----- David Matlack brought up the idea of a live update microconference for LPC this year. Once submitted and accepted, then the microconference will proposed its own CFP per Mike Rapoport and then people can submit for that. Pasha noted that this could even include people working on boot time optimizations that may not be aware that their work is useful for live update. ----->o----- I asked about current status of work to split pmem regions into smaller shards. We briefly chatted about defaulting dax regions per a specification on the kernel command line. Mike had a similar approach in the past for pmem on top of e820 with namespaces but it was not sent upstream. Pasha said the current approach being worked on is that the kernel command line would specify what should be fsdax and what should be devdax. The big change, however, is to eliminate the first 2MB of data for the superblock. Mike's approach was to move labels to the very end of the device in the last 128KB. Pasha said it was likely best to not have the pmem label at all. Mike said it was needed to resize namespaces on the pmem devices itself. Mike asked to be cc'd on the patches when they go upstream. ----->o----- David Matlack noted that he was almost finished with VFIO selftests for 6.15 and that it would be sent out. This was planned to be used for testing device preservation in an automated way. Pasha noted there was no good way to create qemu instances in selftests today, so we need infrastructure for KHO and LUO in selftests. This will likely require a significant amount of work. Mike noted that it may be possible to borrow the infrastructure that BPF uses for this. ----->o----- Next meeting will be on Monday, June 2 at 8am PDT (UTC-7), everybody is welcome: https://meet.google.com/rjn-dmzu-hgq Topics for the next meeting: - discuss current feedback on LUO v2 and its next steps - check on status of memfd preservation using LUO - check on status of libluo development from Pratyush - check on status of sharding dax devices and eliminating the labels in the first 2MB - determine timeline for new kernel parameters to specify devdax and fsdax directly on the command line itself without ndctl - check on status of VFIO selftests that will be useful for automated testing of device preservation - determine timelines for selftest framework for live updates, which could be a significant amount of work - update on physical pool allocator that can be used to provide pages for hugetlb, guest_memfd, and memfds - later: testing methodology to allow downstream consumers to qualify that live update works from one version to another - later: reducing blackout window during live update Please let me know if you'd like to propose additional topics for discussion, thank you! [1] https://docs.google.com/presentation/d/1F-lcl4vSGDX72vhcdmlgKTSe8-GAlwqP46G37SDJP0Q/edit?usp=drive_link&resourcekey=0-jrQSQ7Catn-A7EimsR475A [2] https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=ramdax [3] https://github.com/groeck/linux-build-test.git [4] http://kerneltests.org