Hi Andrey, I am working on the PCI portion of the live update and looking at using KSTATE as an alternative to the FDT. Here are some high level feedbacks.
On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <a...@yandex-team.com> wrote: > > Main changes from v1 [1]: > - Get rid of abusing crashkernel and implent proper way to pass memory to > new kernel > - Lots of misc cleanups/refactorings. > > kstate (kernel state) is a mechanism to describe internal some part of the > kernel state, save it into the memory and restore the state after kexec > in the new kernel. > > The end goal here and the main use case for this is to be able to > update host kernel under VMs with VFIO pass-through devices running > on that host. Since we are pretty far from that end goal yet, this > only establishes some basic infrastructure to describe and migrate complex > in-kernel states. > > The idea behind KSTATE resembles QEMU's migration framework [1], which > solves quite similar problem - migrate state of VM/emulated devices > across different versions of QEMU. > > This is an altenative to Kexec Hand Over (KHO [3]). > > So, why not KHO? > KHO does more than just serializing/unserializing. It also has scratch areas etc to allow safely performing early allocation without stepping on the preserved memory. I see KSTATE as an alternative to libFDT as ways of serializing the preserved memory. Not a replacement for KHO. With that, it would be great to see a KSTATE build on top of the current version of KHO. The V6 version of KHO uses a recursive FDT object. I see recursive FDT can map to the C struct description similar to the KSTATE field description nicely. However, that will require KSTATE to make some considerable changes to embrace the KHO v6. For example, the KSTATE uses one contiguous stream buffer and KHO V6 uses many recursive physical address object pointers for different objects. Maybe a KSTATE V3? > - The main reason is KHO doesn't provide simple and convenient internal > API for the drivers/subsystems to preserve internal data. > E.g. lets consider we have some variable of type 'struct a' > that needs to be preserved: > struct a { > int i; > unsigned long *p_ulong; > char s[10]; > struct page *page; > }; > > The KHO-way requires driver/subsystem to have a bunch of code > dealing with FDT stuff, something like > > a_kho_write() > { > ... > fdt_property(fdt, "i", &a.i, sizeof(a.i)); > fdt_property(fdt, "ulong", a.p_ulong, sizeof(*a.p_ulong)); > fdt_property(fdt, "s", &a.s, sizeof(a.s)); > if (err) > ... > } I can add more of the pain point of using FDT as data format for load/restore states. It is not easy to determine how much memory the FDT serialize is going to use up front. We want to do all the memory allocation in the KHO PREPARE phase, so that after the KHO PREPARE phase there is no KHO failure due to can't allocate memory. The current KHO V6 does not handle the case where the recursive FDT goes beyond 4K pages. There is a feature gap where the PCI subsystem will likely save state for a list of PCI devices and the FDT can possibly go more than 4K. FDT also does not save the type of the object buffer, only the size. There is an implicit contract of what this object points to. The KSTATE description table can be extended to be more expressive than FDT, e.g. cover optional min max allowed values. > a_kho_restore() > { > ... > a.i = fdt_getprop(fdt, offset, "i", &len); > if (!a.i || len != sizeof(a.i)) > goto err > *a.p_ulong = fdt_getprop.... > } > > Each driver/subsystem has to solve this problem in their own way. > Also if we use fdt properties for individual fields, that might be > wastefull > in terms of used memory, as these properties use strings as keys. Right, I need to write a lot of boilerplate code to do the per property save/restore. I am not worried too much about memory usage. A lot of string keys are not much longer than 8 bytes. The memory saving convert to binary index is not huge. I actually would suggest adding the string version of the field name to the description table, so that we can dump the state in KSTATE just like the YAML FDT output for debugging purposes. It is a very useful feature of FDT to dump the current saving state into a human readable form. KSTATE can have the same feature added. > > While with KSTATE solves the same problem in more elegant way, with this: > struct kstate_description a_state = { > .name = "a_struct", > .version_id = 1, > .id = KSTATE_TEST_ID, > .state_list = LIST_HEAD_INIT(test_state.state_list), > .fields = (const struct kstate_field[]) { > KSTATE_BASE_TYPE(i, struct a, int), > KSTATE_BASE_TYPE(s, struct a, char [10]), > KSTATE_POINTER(p_ulong, struct a), > KSTATE_PAGE(page, struct a), > KSTATE_END_OF_LIST() > }, > }; > > > { > static unsigned long ulong > static struct a a_data = { .p_ulong = &ulong }; > > kstate_register(&test_state, &a_data); > } > > The driver needs only to have a proper 'kstate_description' and call > kstate_register() > to save/restore a_data. > Basically 'struct kstate_description' provides instructions how to > save/restore 'struct a'. > And kstate_register() does all this save/restore stuff under the hood. It seems the KSTATE uses one contiguous stream and the object has to be loaded in the order it was saved. For the PCI code, the PCI device scanning and probing might cause the device load out of the order of saving. (The PCI probing is actually the reverse order of saving). This kstate_register() might pose restrictions on the restore order. PCI will need to look up and find the device state based on the PCI device ID. Other subsystems will likely have the requirement to look up their own saved state as well. I also see KSTATE can be extended to support that. > - Another bonus point - kstate can preserve migratable memory, which is > required > to preserve guest memory > > > So now to the part how this works. > > State of kernel data (usually it's some struct) is described by the > 'struct kstate_description' containing the array of individual > fields descpriptions - 'struct kstate_field'. Each field > has set of bits in ->flags which instructs how to save/restore > a certain field of the struct. E.g.: > - KS_BASE_TYPE flag tells that field can be just copied by value, > > - KS_POINTER means that the struct member is a pointer to the actual > data, so it needs to be dereference before saving/restoring data > to/from kstate data steam. > > - KS_STRUCT - contains another struct, field->ksd must point to > another 'struct kstate_dscription' The field can't have both bits set for KS_BASE_TYPE and KS_STRUCT type, right? Some of these flag combinations do not make sense. This part might need more careful planning to keep it simple. Maybe some of the flags bits should be enum. > > - KS_CUSTOM - Some non-trivial field that requires custom > kstate_field->save() > ->restore() callbacks to save/restore data. > > - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by > the > field->count() callback > - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or > linear address. Store offset I think we want to describe different stream types. For example the most simple stream container is just a contiguous buffer with start address and size. The more complex one might be and size then an array of page pointers, all those pointers add to up the new buffer which describe an saved KSTATE that is larger than 4K and spread into an array of pages. Those pages don't need to be contiguous. Such a page array buffer stores the KSTATE entry described by a separate description table. > > - KS_END - special flag indicating the end of migration stream data. > > kstate_register() call accepts kstate_description along with an instance > of an object and registers it in the global 'states' list. > > During kexec reboot phase we go through the list of 'kstate_description's > and each instance of kstate_description forms the 'struct kstate_entry' > which save into the kstate's data stream. > > The 'kstate_entry' contains information like ID of kstate_description, version > of it, size of migration data and the data itself. The ->data is formed in > accordance to the kstate_field's of the corresponding kstate_description. The version for the kstate_description might not be enough. The version works if there is a linear history. Here we are likely to have different vendors add their own extension to the device state saving. I suggest instead we save the old kernel's kstate_description table (once per description table as a recursive object) alongside the object physical address as well. The new kernel has their new version of the description table. It can compare between the old and new description tables and find out what fields need to be upgraded or downgraded. The new kernel will use the old kstate_description table to decode the previous kernel's saved object. I think that way it is more flexible to support adding one or more features and not tight to which version has what feature. It can also make sure the new kernel can always dump the old KSTATE into YAML. That way we might be able to simplify the subsection and the depreciation flags. The new kernel doesn't need to carry the history of changes made to the old description table. > After the reboot, when the kstate_register() called it parses migration > stream, finds the appropriate 'kstate_entry' and restores the contents of > the object in accordance with kstate_description and ->fields. Again this restoring can happen in a different order when the PCI device scanning and probing order. The restoration might not happen in one single call chain. Material for V3? I am happy to work with you to get KSTATE working with the existing KHO effort. Chris