On 01.07.25 10:03, Zhang, GuoQing (Sam) wrote: > > thaw() is called before writing the hiberation image to swap disk. See > the doc here. > https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552 > > <https://github.com/torvalds/linux/blob/v6.14/Documentation/driver-api/pm/devices.rst?plain=1#L552> > > And amdgpu implemented thaw() callback by calling amdgpu_device_resume(). > https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572 > > <https://github.com/torvalds/linux/blob/v6.14/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c#L2572> > > This patch is skip amdgpu_amdkfd_resume_process() call in thaw() during > hibernation. it is not skipped in restore() during resume from > hibernation when system boot again. > > > I just found the following kernel doc. Thaw() is intended to resume the > storage device for saving the hibernation image.
Ah, that makes much more sense. > Our GPU is not involved > in it, it is not necessary to resume our GPU in thaw(). > https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588 > > <https://github.com/torvalds/linux/blob/v6.14/Documentation/power/pci.rst?plain=1#L588> > > So another implementation is to remove the amdgpu_device_resume() call > in amdgpu_pmops_thaw(), and skip amdgpu_device_ip_suspend() call in > amdgpu_pci_shutdown()for hibernation. > Initial tests show it's working fine for hibernation successful case. > Should I switch to this implementation? No idea. Alex and the KFD guys need to take a look at that. > But thaw() is also called to restore the GPU when hibernation is aborted > due to some error in hibernation image creation stage. In this case, > amdgpu_device_resume() is needed in thaw(). > > So I need a method to check if hibernation is aborted or not to > conditionally skip amdgpu_device_resume() in thaw(). Currently I don't > know how to do this. Yeah that approach here looks fishy to me, but I don't know how to properly fix it either. @Alex any idea? Regards, Christian. > > > Regards > Sam > > > On 2025/6/30 19:58, Christian König wrote: >> On 30.06.25 12:41, Samuel Zhang wrote: >>> The hibernation successful workflow: >>> - prepare: evict VRAM and swapout GTT BOs >>> - freeze >>> - create the hibernation image in system memory >>> - thaw: swapin and restore BOs >> Why should a thaw happen here in between? >> >>> - complete >>> - write hibernation image to disk >>> - amdgpu_pci_shutdown >>> - goto S5, turn off the system. >>> >>> During prepare stage of hibernation, VRAM and GTT BOs will be swapout to >>> shmem. Then in thaw stage, all BOs will be swapin and restored. >> That's not correct. This is done by the application starting again and not >> during thaw. >> >>> On server with 192GB VRAM * 8 dGPUs and 1.7TB system memory, >>> the swapin and restore BOs takes too long (50 minutes) and it is not >>> necessary since the follow-up stages does not use GPU. >>> >>> This patch is to skip BOs restore during thaw to reduce the hibernation >>> time. >> As far as I can see that doesn't make sense. The KFD processes need to be >> resumed here and that can't be skipped. >> >> Regards, >> Christian. >> >>> Signed-off-by: Samuel Zhang <guoqing.zh...@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 ++ >>> 2 files changed, 3 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index a8f4697deb1b..b550d07190a2 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -5328,7 +5328,7 @@ int amdgpu_device_resume(struct drm_device *dev, bool >>> notify_clients) >>> amdgpu_virt_init_data_exchange(adev); >>> amdgpu_virt_release_full_gpu(adev, true); >>> >>> - if (!adev->in_s0ix && !r && !adev->in_runpm) >>> + if (!adev->in_s0ix && !r && !adev->in_runpm && !adev->in_s4) >>> r = amdgpu_amdkfd_resume_process(adev); >>> } >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> index 571b70da4562..23b76e8ac2fd 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> @@ -2734,7 +2734,9 @@ static int amdgpu_pmops_poweroff(struct device *dev) >>> static int amdgpu_pmops_restore(struct device *dev) >>> { >>> struct drm_device *drm_dev = dev_get_drvdata(dev); >>> + struct amdgpu_device *adev = drm_to_adev(drm_dev); >>> >>> + adev->in_s4 = false; >>> return amdgpu_device_resume(drm_dev, true); >>> } >>> >