If a suspend fails the PM core doesn't clean it up, the device is just left in a bad state. If this happens during memory pressure it could be a hung system from just trying to suspend.
For all phases of suspend that return an error code, add an unwind flow that will resume the parts that have failed. If this fails, then reset the GPU during complete() callback. v3: * rebase on amd-staging-drm-next, this got caught up with [1] which was on my tree too. Mario Limonciello (2): drm/amd: Unwind for failed device suspend drm/amd: Reset the GPU if pmops failed drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 80 +++++++++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++ 2 files changed, 83 insertions(+), 8 deletions(-) -- 2.51.1
