[AMD Public Use] Thanks Dennis. Yes, that's valid case. skipping the reset and scheduler resume sound reasonable to me.
The patch is Reviewed-by: Hawking Zhang <hawking.zh...@amd.com> Regards, Hawking -----Original Message----- From: Li, Dennis <dennis...@amd.com> Sent: Thursday, August 20, 2020 16:40 To: Zhang, Hawking <hawking.zh...@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <alexander.deuc...@amd.com>; Kuehling, Felix <felix.kuehl...@amd.com>; Koenig, Christian <christian.koe...@amd.com> Subject: RE: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery [AMD Public Use] Hi, Hawking, When RAS uncorrectable error happens, RAS interrupt will trigger a GPU recovery. At the same time, if a GFX or compute job is timeout, driver will trigger a new one. Best Regards Dennis Li -----Original Message----- From: Zhang, Hawking <hawking.zh...@amd.com> Sent: Thursday, August 20, 2020 4:24 PM To: Li, Dennis <dennis...@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <alexander.deuc...@amd.com>; Kuehling, Felix <felix.kuehl...@amd.com>; Koenig, Christian <christian.koe...@amd.com> Cc: Li, Dennis <dennis...@amd.com> Subject: RE: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery [AMD Public Use] Hi Dennis, Can you elaborate the case that driver re-enter GPU recovery in sGPU system? I'm wondering whether this is a valid case or we shall prevent this from the beginning. Regards, Hawking -----Original Message----- From: Dennis Li <dennis...@amd.com> Sent: Thursday, August 20, 2020 10:21 To: amd-gfx@lists.freedesktop.org; Deucher, Alexander <alexander.deuc...@amd.com>; Kuehling, Felix <felix.kuehl...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com>; Koenig, Christian <christian.koe...@amd.com> Cc: Li, Dennis <dennis...@amd.com> Subject: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery in single gpu system, if driver reenter gpu recovery, amdgpu_device_lock_adev will return false, but hive is nullptr now. Signed-off-by: Dennis Li <dennis...@amd.com> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 82242e2f5658..81b1d9a1dca0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4371,8 +4371,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, if (!amdgpu_device_lock_adev(tmp_adev)) { DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress", job ? job->base.id : -1); - mutex_unlock(&hive->hive_lock); - return 0; + r = 0; + goto skip_recovery; } /* @@ -4505,6 +4505,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, amdgpu_device_unlock_adev(tmp_adev); } +skip_recovery: if (hive) { atomic_set(&hive->in_reset, 0); mutex_unlock(&hive->hive_lock); -- 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx