On 2024-07-08 2:51, 周春明(日月) wrote: > > Hi Felix, > > When I learn CRIU you introduced in > https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu > <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> , > there is a sentence > "ROCm manages memory in the form of buffer objects (BOs). We are also working > on a new memory management API that will be based on virtual address > ranges...", > Out of curious, how about "new memory management based on virtual address > ranges"? Any introduction for that?
Hi David, This refers to the SVM API that has been in the upstream driver for a while now: https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732 Regards, Felix > > Thanks, > -David > > ------------------------------------------------------------------ > 发件人:Felix Kuehling <felix.kuehl...@amd.com> > 发送时间:2024年5月3日(星期五) 22:44 > 收件人:Tvrtko Ursulin <tursu...@ursulin.net>; > dri-de...@lists.freedesktop.org <dri-de...@lists.freedesktop.org>; > amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Dave Airlie > <airl...@redhat.com>; Daniel Vetter <dan...@ffwll.ch>; criu <c...@openvz.org> > 抄 送:"Errabolu, Ramesh" <ramesh.errab...@amd.com>; "Christian König" > <christian.koe...@amd.com> > 主 题:Re: Proposal to add CRIU support to DRM render nodes > > > > On 2024-04-16 10:04, Tvrtko Ursulin wrote: > > > > On 01/04/2024 18:58, Felix Kuehling wrote: > >> > >> On 2024-04-01 12:56, Tvrtko Ursulin wrote: > >>> > >>> On 01/04/2024 17:37, Felix Kuehling wrote: > >>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote: > >>>>> > >>>>> On 28/03/2024 20:42, Felix Kuehling wrote: > >>>>>> > >>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote: > >>>>>>> > >>>>>>> Hi Felix, > >>>>>>> > >>>>>>> I had one more thought while browsing around the amdgpu CRIU > plugin. It appears it relies on the KFD support being compiled in and > /dev/kfd present, correct? AFAICT at least, it relies on that to figure out > the amdgpu DRM node. > >>>>>>> > >>>>>>> In would be probably good to consider designing things without > that dependency. So that checkpointing an application which does not use > /dev/kfd is possible. Or if the kernel does not even have the KFD support > compiled in. > >>>>>> > >>>>>> Yeah, if we want to support graphics apps that don't use KFD, we > should definitely do that. Currently we get a lot of topology information > from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed > by KFD. We'd need to get GPU device info from the render nodes instead. And > if KFD is available, we may need to integrate both sources of information. > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> It could perhaps mean no more than adding some GPU discovery code > into CRIU. Which shuold be flexible enough to account for things like > re-assigned minor numbers due driver reload. > >>>>>> > >>>>>> Do you mean adding GPU discovery to the core CRIU, or to the > plugin. I was thinking this is still part of the plugin. > >>>>> > >>>>> Yes I agree. I was only thinking about adding some DRM device > discovery code in a more decoupled fashion from the current plugin, for both > the reason discussed above (decoupling a bit from reliance on kfd sysfs), and > then also if/when a new DRM driver might want to implement this the code > could be move to some common plugin area. > >>>>> > >>>>> I am not sure how feasible that would be though. The "gpu id" > concept and it's matching in the current kernel code and CRIU plugin - is > that value tied to the physical GPU instance or how it works? > >>>> > >>>> The concept of the GPU ID is that it's stable while the system is > up, even when devices get added and removed dynamically. It was baked into > the API early on, but I don't think we ever fully validated device hot plug. > I think the closest we're getting is with our latest MI GPUs and dynamic > partition mode change. > >>> > >>> Doesn't it read the saved gpu id from the image file while doing > restore and tries to open the render node to match it? Maybe I am misreading > the code.. But if it does, does it imply that in practice it could be stable > across reboots? Or that it is not possible to restore to a different instance > of maybe the same GPU model installed in a system? > >> > >> Ah, the idea is, that when you restore on a different system, you may > get different GPU IDs. Or you may checkpoint an app running on GPU 1 but > restore it on GPU 2 on the same system. That's why we need to translate GPU > IDs in restored applications. User mode still uses the old GPU IDs, but the > kernel mode driver translates them to the actual GPU IDs of the GPUs that the > process was restored on. > > > > I see.. I think. Normal flow is ppd->user_gpu_id set during client > init, but for restored clients it gets overriden during restore so that any > further ioctls can actually not instantly fail. > > > > And then in amdgpu_plugin_restore_file, when it is opening the render > node, it relies on the kfd topology to have filled in (more or less) the > target_gpu_id corresponding to the render node gpu id of the target GPU - the > one associated with the new kfd gpu_id? > > Yes. > > > > > I am digging into this because I am trying to see if some part of GPU > discovery could somehow be decoupled.. to offer you to work on at least that > until you start to tackle the main body of the feature. But it looks properly > tangled up. > > OK. Most of the interesting plugin code should be in > amdgpu_plugin_topology.c. It currently has some pretty complicated logic to > find a set of devices that matches the topology in the checkpoint, including > shader ISA versions, numbers of compute units, memory sizes, firmware > versions and IO-Links between GPUs. This was originally done to support P2P > with XGMI links. I'm not sure we ever updated it to properly support PCIe P2P. > > > > > > Do you have any suggestions with what I could help with? Maybe > developing some sort of drm device enumeration library if you see a way that > would be useful in decoupling the device discovery from kfd. We would need to > define what sort of information you would need to be queryable from it. > > Maybe. I think a lot of device information is available with some amdgpu > info-ioctl. It may not cover all the things we're checking in the KFD > topology, though. > > > > >>>> This also highlights another aspect on those spatially partitioned > GPUs. GPU IDs identify device partitions, not devices. Similarly, each > partition has its own render node, and the KFD topology info in sysfs points > to the render-minor number corresponding to each GPU ID. > >>> > >>> I am not familiar with this. This is not SR-IOV but some other kind > of partitioning? Would you have any links where I could read more? > >> > >> Right, the bare-metal driver can partition a PF spatially without > SRIOV. SRIOV can also use spatial partitioning and expose each partition > through its own VF, but that's not useful for bare metal. Spatial > partitioning is new in MI300. There is some high-level info in this > whitepaper: > https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf > > <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf>. > > > > From the outside (userspace) this looks simply like multiple DRM render > nodes or how exactly? > > Yes, that's correct. Each partition has its own render node and its own > node in the KFD topology. > > Regards, > Felix > > > > > Regards, > > > > Tvrtko > > > >> > >> Regards, > >> Felix > >> > >> > >>> > >>> Regards, > >>> > >>> Tvrtko > >>> > >>>>>>> Otherwise I am eagerly awaiting to hear more about the design > specifics around dma-buf handling. And also seeing how to extend to other DRM > related anonymous fds. > >>>>>> > >>>>>> I've been pretty far under-water lately. I hope I'll find time to > work on this more, but it's probably going to be at least a few weeks. > >>>>> > >>>>> Got it. > >>>>> > >>>>> Regards, > >>>>> > >>>>> Tvrtko > >>>>> > >>>>>> > >>>>>> Regards, > >>>>>> Felix > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> Regards, > >>>>>>> > >>>>>>> Tvrtko > >>>>>>> > >>>>>>> On 15/03/2024 18:36, Tvrtko Ursulin wrote: > >>>>>>>> > >>>>>>>> On 15/03/2024 02:33, Felix Kuehling wrote: > >>>>>>>>> > >>>>>>>>> On 2024-03-12 5:45, Tvrtko Ursulin wrote: > >>>>>>>>>> > >>>>>>>>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi Felix, > >>>>>>>>>>> > >>>>>>>>>>> On 06/12/2023 21:23, Felix Kuehling wrote: > >>>>>>>>>>>> Executive Summary: We need to add CRIU support to DRM render > nodes in order to maintain CRIU support for ROCm application once they start > relying on render nodes for more GPU memory management. In this email I'm > providing some background why we are doing this, and outlining some of the > problems we need to solve to checkpoint and restore render node state and > shared memory (DMABuf) state. I have some thoughts on the API design, leaning > on what we did for KFD, but would like to get feedback from the DRI community > regarding that API and to what extent there is interest in making that > generic. > >>>>>>>>>>>> > >>>>>>>>>>>> We are working on using DRM render nodes for virtual address > mappings in ROCm applications to implement the CUDA11-style VM API and > improve interoperability between graphics and compute. This uses DMABufs for > sharing buffer objects between KFD and multiple render node devices, as well > as between processes. In the long run this also provides a path to moving all > or most memory management from the KFD ioctl API to libdrm. > >>>>>>>>>>>> > >>>>>>>>>>>> Once ROCm user mode starts using render nodes for virtual > address management, that creates a problem for checkpointing and restoring > ROCm applications with CRIU. Currently there is no support for checkpointing > and restoring render node state, other than CPU virtual address mappings. > Support will be needed for checkpointing GEM buffer objects and handles, > their GPU virtual address mappings and memory sharing relationships between > devices and processes. > >>>>>>>>>>>> > >>>>>>>>>>>> Eventually, if full CRIU support for graphics applications > is desired, more state would need to be captured, including scheduler > contexts and BO lists. Most of this state is driver-specific. > >>>>>>>>>>>> > >>>>>>>>>>>> After some internal discussions we decided to take our > design process public as this potentially touches DRM GEM and DMABuf APIs and > may have implications for other drivers in the future. > >>>>>>>>>>>> > >>>>>>>>>>>> One basic question before going into any API details: Is > there a desire to have CRIU support for other DRM drivers? > >>>>>>>>>>> > >>>>>>>>>>> This sounds like a very interesting feature on the overall, > although I cannot answer on the last question here. > >>>>>>>>>> > >>>>>>>>>> I forgot to finish this thought. I cannot answer / don't know > of any concrete plans, but I think feature is pretty cool and if amdgpu gets > it working I wouldn't be surprised if other drivers would get interested. > >>>>>>>>> > >>>>>>>>> Thanks, that's good to hear! > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> Funnily enough, it has a tiny relation to an i915 feature I > recently implemented on Mesa's request, which is to be able to "upload" the > GPU context from the GPU hang error state and replay the hanging request. It > is kind of (at a stretch) a very special tiny subset of checkout and restore > so I am not mentioning it as a curiosity. > >>>>>>>>>>> > >>>>>>>>>>> And there is also another partical conceptual intersect with > the (at the moment not yet upstream) i915 online debugger. This part being in > the area of discovering and enumerating GPU resources beloning to the client. > >>>>>>>>>>> > >>>>>>>>>>> I don't see an immediate design or code sharing opportunities > though but just mentioning. > >>>>>>>>>>> > >>>>>>>>>>> I did spend some time reading your plugin and kernel > implementation out of curiousity and have some comments and questions. > >>>>>>>>>>> > >>>>>>>>>>>> With that out of the way, some considerations for a possible > DRM CRIU API (either generic of AMDGPU driver specific): The API goes through > several phases during checkpoint and restore: > >>>>>>>>>>>> > >>>>>>>>>>>> Checkpoint: > >>>>>>>>>>>> > >>>>>>>>>>>> 1. Process-info (enumerates objects and sizes so user mode > can allocate > >>>>>>>>>>>> memory for the checkpoint, stops execution on the GPU) > >>>>>>>>>>>> 2. Checkpoint (store object metadata for BOs, queues, etc.) > >>>>>>>>>>>> 3. Unpause (resumes execution after the checkpoint is > complete) > >>>>>>>>>>>> > >>>>>>>>>>>> Restore: > >>>>>>>>>>>> > >>>>>>>>>>>> 1. Restore (restore objects, VMAs are not in the right > place at this time) > >>>>>>>>>>>> 2. Resume (final fixups after the VMAs are sorted out, > resume execution) > >>>>>>>>>>> > >>>>>>>>>>> Btw is check-pointing guaranteeing all relevant activity is > idled? For instance dma_resv objects are free of fences which would need to > restored for things to continue executing sensibly? Or how is that handled? > >>>>>>>>> > >>>>>>>>> In our compute use cases, we suspend user mode queues. This can > include CWSR (compute-wave-save-restore) where the state of in-flight waves > is stored in memory and can be reloaded and resumed from memory later. We > don't use any fences other than "eviction fences", that are signaled after > the queues are suspended. And those fences are never handed to user mode. So > we don't need to worry about any fence state in the checkpoint. > >>>>>>>>> > >>>>>>>>> If we extended this to support the kernel mode command > submission APIs, I would expect that we'd wait for all current submissions to > complete, and stop new ones from being sent to the HW before taking the > checkpoint. When we take the checkpoint in the CRIU plugin, the CPU threads > are already frozen and cannot submit any more work. If we wait for all > currently pending submissions to drain, I think we don't need to save any > fence state because all the fences will have signaled. (I may be missing some > intricacies and I'm afraid it may not be that simple in reality, but that's > my opening bid. ;) > >>>>>>>> > >>>>>>>> It feels feasible to me too, for the normally behaving clients > at least. > >>>>>>>> > >>>>>>>> Presumably, given that the whole checkpointing is not instant, > it would be okay to wait a second or two longer for the in-progress > submissions complete. After which kernel would need to prune all signalled > fences from the respective container objects before checkpointing. > >>>>>>>> > >>>>>>>> For the "misbehaving" clients who have perhaps queued up too > much work, either still in the scheduler with unsatisfied dependencies, or > already submitted to the hardware and/or driver backend, is there a timeout > concept in CRIU so it would be possible to say something like "try to > checkpoint but if the kernel says no time period t then give up"? > >>>>>>>> > >>>>>>>>>>>> For some more background about our implementation in KFD, > you can refer to this whitepaper: > https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md > > <https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md> > >>>>>>>>>>>> > >>>>>>>>>>>> Potential objections to a KFD-style CRIU API in DRM render > nodes, I'll address each of them in more detail below: > >>>>>>>>>>>> > >>>>>>>>>>>> * Opaque information in the checkpoint data that user mode > can't > >>>>>>>>>>>> interpret or do anything with > >>>>>>>>>>>> * A second API for creating objects (e.g. BOs) that is > separate from > >>>>>>>>>>>> the regular BO creation API > >>>>>>>>>>>> * Kernel mode would need to be involved in restoring BO > sharing > >>>>>>>>>>>> relationships rather than replaying BO creation, export > and import > >>>>>>>>>>>> from user mode > >>>>>>>>>>>> > >>>>>>>>>>>> # Opaque information in the checkpoint > >>>>>>>>>>>> > >>>>>>>>>>>> This comes out of ABI compatibility considerations. Adding > any new objects or attributes to the driver/HW state that needs to be > checkpointed could potentially break the ABI of the CRIU checkpoint/restore > ioctl if the plugin needs to parse that information. Therefore, much of the > information in our KFD CRIU ioctl API is opaque. It is written by kernel mode > in the checkpoint, it is consumed by kernel mode when restoring the > checkpoint, but user mode doesn't care about the contents or binary layout, > so there is no user mode ABI to break. This is how we were able to maintain > CRIU support when we added the SVM API to KFD without changing the CRIU > plugin and without breaking our ABI. > >>>>>>>>>>>> > >>>>>>>>>>>> Opaque information may also lend itself to API abstraction, > if this becomes a generic DRM API with driver-specific callbacks that fill in > HW-specific opaque data. > >>>>>>>>>>> > >>>>>>>>>>> This feels sound in principle to me. Fundamentally the state > is very hardware specfic, and/or driver version specific, so I don't see > anything could be gained in practice by making it much less opaque. (Apart > from making things more complicated.) > >>>>>>>>>>> > >>>>>>>>>>> I was however unsure of the current split of how you dump > buffer objects with some data in the defined bo structure, and some in > completely opaque private data. Is there a benefit to that split, or maybe in > other words, is there a benefit on having part transparent and part opaque > for buffer objects? > >>>>>>>>> > >>>>>>>>> Some of the buffer object state is needed by the plugin. E.g. > the size and mmap offset are needed to match VMAs with BOs. I'd have to > review the plugin in detail to prove that all the fields are, in fact, needed > by the plugin, but that was the intent. Anything that the plugin doesn't need > to know should be in the opaque data structures. > >>>>>>>> > >>>>>>>> Right, got it. > >>>>>>>> > >>>>>>>> Would it make sense to make the opaque data in the same block as > the defined one? I mean instead of separating the two in the binary image for > instance have struct kfd_criu_bo_bucket have a trailing priv_data blob? Maybe > it is too late now if the image format is not versioned or something. > >>>>>>>> > >>>>>>>>>>> To slightly touch upon the question of whether this could > become a generic DRM API. It feels it would be hard to do it from the start. > What sounds more feasible is if/when generic looking helpers can be spotted > while developing the RFC then potentially structure the code they can easily > be promoted to shared/common at some future moment. > >>>>>>>>> > >>>>>>>>> Yes, that's how this usually goes, in my experience. Thanks for > confirming. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> # Second API for creating objects > >>>>>>>>>>>> > >>>>>>>>>>>> Creating BOs and other objects when restoring a checkpoint > needs more information than the usual BO alloc and similar APIs provide. For > example, we need to restore BOs with the same GEM handles so that user mode > can continue using those handles after resuming execution. If BOs are shared > through DMABufs without dynamic attachment, we need to restore pinned BOs as > pinned. Validation of virtual addresses and handling MMU notifiers must be > suspended until the virtual address space is restored. For user mode queues > we need to save and restore a lot of queue execution state so that execution > can resume cleanly. > >>>>>>>>>>> > >>>>>>>>>>> This also sounds justified to me. Restore creating all > internal objects is definitely special and sounds better to add uapi to > create them directly with the correct properties, than to add uapi to adjust > internal properties after creation. And in case you would always need some > new uapi - so at least to adjust after creation. At which point you may have > both in one. Internally implementation can be split or common, whatever makes > sense for a given object type, but new uapi definitely sounds is required. > >>>>>>>>>>>> # Restoring buffer sharing relationships > >>>>>>>>>>>> > >>>>>>>>>>>> Different GEM handles in different render nodes and > processes can refer to the same underlying shared memory, either by directly > pointing to the same GEM object, or by creating an import attachment that may > get its SG tables invalidated and updated dynamically through dynamic > attachment callbacks. In the latter case it's obvious, who is the exporter > and who is the importer. In the first case, either one could be the exporter, > and it's not clear who would need to create the BO and who would need to > >>>>>>>>>>> > >>>>>>>>>>> To see if I follow the former case correctly. > >>>>>>>>>>> > >>>>>>>>>>> This could be two clients A and B, where B has imported a > dma-buf shared BO from A and has since closed the dma-buf fd? Which results > in a single BO with reference count of 2 and obj->import_attach unset. > History of who created the object is lost. > >>>>>>>>> > >>>>>>>>> Yes. In the amdgpu driver this happens when the exporter and > import device are the same. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> In fact it could even be that two imported objects remain > (clients A, B and C) and A, who originally created the BO, has since fully > released it. So any kind of "creator" tracking if added wouldn't be fully > reliable either. > >>>>>>>>> > >>>>>>>>> That's a good point. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> import it when restoring the checkpoint. To further > complicate things, multiple processes in a checkpoint get restored > concurrently. So there is no guarantee that an exporter has restored a shared > BO at the time an importer is trying to restore its import. > >>>>>>>>>>>> > >>>>>>>>>>>> A proposal to deal with these problems would be to treat > importers and exporters the same. Whoever restores first, ends up creating > the BO and potentially attaching to it. The other process(es) can find BOs > that were already restored by another process by looking it up with a unique > ID that could be based on the DMABuf inode number. An alternative would be a > two-pass approach that needs to restore BOs on two passes: > >>>>>>>>>>>> > >>>>>>>>>>>> 1. Restore exported BOs > >>>>>>>>>>>> 2. Restore imports > >>>>>>>>>>>> > >>>>>>>>>>>> With some inter-process synchronization in CRIU itself > between these two passes. This may require changes in the core CRIU, outside > our plugin. Both approaches depend on identifying BOs with some unique ID > that could be based on the DMABuf inode number in the checkpoint. However, we > would need to identify the processes in the same restore session, possibly > based on parent/child process relationships, to create a scope where those > IDs are valid during restore. > >>>>>>>>>>> > >>>>>>>>>>> If my understanding above is on the right track, then I think > this is the only thing which can be done (for all scenarios). > >>>>>>>>> > >>>>>>>>> I presented two alternatives. I think you're in favor of the > first one, where it doesn't matter who is the importer and exporter. I think > the two-pass approach requires that you can identify an exporter. And as you > pointed out, the exporter may already have dropped their reference to the BO. > >>>>>>>> > >>>>>>>> Yep. > >>>>>>>> > >>>>>>>>>>> I also *think* it would be safe. At least at the moment I > cannot think what could go wrong. Semantics are that it doesn't really matter > who created the object. > >>>>>>>>> > >>>>>>>>> I would agree. What matters is that the object is recreated on > the correct device, and that all the direct references and import attachments > pointing to it are restored. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> Finally, we would also need to checkpoint and restore DMABuf > file descriptors themselves. These are anonymous file descriptors. The CRIU > plugin could probably be taught to recreate them from the original exported > BO based on the inode number that could be queried with fstat in the > checkpoint. It would need help from the render node CRIU API to find the > right BO from the inode, which may be from a different process in the same > restore session. > >>>>>>>>>>> > >>>>>>>>>>> This part feels like it is breaking the component separation > a bit because even for buffers fully owned by amdgpu, strictly speaking the > dma-buf fd is not. At least my understanding from the above is that you > propose to attempt to import the fd, from the kernel side, during the > checkpoint process? Like: > >>>>>>>>>>> > >>>>>>>>>>> Checkpoint: > >>>>>>>>>>> > >>>>>>>>>>> CRIU for each anon fd: > >>>>>>>>>>> amdgpu_plugin(fd) > >>>>>>>>>>> -> attempt in kernel dma buf import (passes fd to kernel > via ioctl?) > >>>>>>>>>>> -> is it ours? (no -> error) > >>>>>>>>>>> -> create a record mapping fd number to amdgpu BO > >>>>>>>>>>> > >>>>>>>>>>> Restore: > >>>>>>>>>>> > >>>>>>>>>>> for each dma-buf fd record: > >>>>>>>>>>> create BO if does not exists > >>>>>>>>>>> export BO to same fd > >>>>>>>>>>> close BO handle if not in regular BO handle records > >>>>>>>>>>> > >>>>>>>>>>> Or since you mention lookup by inode, that would need to be > dmabuf_plugin so it can lookup inodes in the private mount space. However how > would it co-operate with amdgpu_plugin is not clear to me. > >>>>>>>>> > >>>>>>>>> The way I think about the ownership is, whichever driver > created the underlying BO owns the checkpointing of the dmabuf. You need > driver-specific information to link the dmabuf with the driver's BO and you > need the right driver to recreate the BO and the dmabuf fd when restoring the > checkpoint. > >>>>>>>>> > >>>>>>>>> It gets really interesting if you have an amdgpu plugin and an > i915 plugin, and they checkpoint an application that shares BOs between the > two devices through DMABufs. E.g. if i915 created a BO and amdgpu imported > it, then during restore, i915 needs to restore the dmabuf before the amdgpu > import of it can be restored. I think that brings us back to a two-phase > approach to restoring the memory sharing relationships. Uff. > >>>>>>>> > >>>>>>>> I think this part of the discussion somewhat depends on the > previous part about idling. If it is feasible to completely idle and prune, > and fail if that is not happening quickly enough, then maybe there wouldn't > be too much hierarchical state to save. > >>>>>>>> > >>>>>>>> Otherwise my idea was that there is a top-level drm_plugin.so > which understands amdgpu fds, i915, syncobj, sync_file, and uses some new > uapi to uniquely identify each, associate with the correct driver, and then > internally dispatches to amdgpu|i915|dmabuf|..._plugin.so. Building the in > memory representation of their relationships. As long as all objects and > their relationships have been recorded I think everything could then be > correctly restored. > >>>>>>>> > >>>>>>>> It is possible there is flaw in my thinking and that something > in CRIU design would make this impossible? I think it would require the > top-level drm_plugin.so to hold all state in memory until the whole > checkpointing is done, and then verify something is not incomplete, failing > it all if it was. (For instance one plugin discovered an reference to an > object which was not discoverd by any other plugin or things like that.) May > need some further tweaks to CRIU common code. > >>>>>>>> > >>>>>>>> Maybe I need to better understand how exactly you mean to query > the DRM driver about random anonymous fds. I see it as a problem in the > design, possibly even implementation, but maybe I am missing something which > makes it not so. I mean even with my general idea I don't know how would one > determine which driver to query about a particular anonymous inode. > >>>>>>>> > >>>>>>>>>> I later also realised that I was maybe increasing the scope > for you here. :) You did state focus is ROCm applications which possibly > doesn't care about dma-resv, fences, syncobjs etc? > >>>>>>>>> > >>>>>>>>> That's my focus for now. But I don't want to engineer a > solution that would preclude your use cases in the future. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> But I think the "how to handle dma-bufs" design question is > still relevant and interesting. For example I had this thought that perhaps > what would be needed is a CRIU plugin hierarchy. > >>>>>>>>>> > >>>>>>>>>> Because fundamentally we would be snapshoting a hierarcy of > kernel objects belonging to different drivers (kfd, amdgpu, dma-buf, ...). > And if one day someone would to try to handle dma fences and drm syncobjs, > the argument for a hierarchial design would be even stronger I think. > >>>>>>>>>> > >>>>>>>>>> Something like a drm_plugin.so could call sub-plugins (amdgpu, > dma-buf, sync file, ...) and internally build the representation of the whole > state and how the relationship between the objects. > >>>>>>>>> > >>>>>>>>> Maybe. I guess a structure similar to libdrm makes sense. I'm > not sure it's strictly a hierarchy. Maybe more like some common code shared > by multiple GPU driver plugins. I think the common checkpoint state is quite > limited and restoring it requires the GPU-specific drivers anyway. > >>>>>>>>> > >>>>>>>>> Also the idea of building a representation of the whole state > doesn't work well with the CRIU design, because "the whole state" can include > multiple processes that restore themselves concurrently and only synchronize > with each other in a few places in the restore process. I feel, if we can > work out how to checkpoint and restore shared objects between processes, we > can do the same for shared objects between drivers without imposing a strict > hierarchy and some omniscient entity that needs to know "the whole state". > >>>>>>>> > >>>>>>>> Okay, this continues on the same problem space as above. And you > obviously know how CRIU works much better than me. > >>>>>>>> > >>>>>>>>>> With that kind of design there probably would be a need to > define some common kernel side api and uapi, so all involved objects can be > enumerated with some unique ids etc. > >>>>>>>>>> > >>>>>>>>>> Now.. the counter argument.. the more state from different > drivers would one want to handle the bigger this project would get. Would it > even be feasible is the question, to the point that it may be simpler to just > run the workload in a VM via SR-IOV and simply hibernate the whole thin > guest. :) > >>>>>>>>> > >>>>>>>>> Well, CRIU kind of tries to do that, but with containers > instead of VMs. ;) > >>>>>>>> > >>>>>>>> It would definitely be useful for hardware and drivers without > SR-IOV support so lets hope it is doable. :) > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> > >>>>>>>> Tvrtko