amdgpu: reset kfd during amdgpu reset

Christian König Fri, 26 Jan 2018 10:41:52 -0800

Alternatively if you really want to add some distinction here you couldadd an enum like trigger source.

Something like manual, SRIOV host, GPU scheduler, KFD, interrupt etc...And then use the user configurable option as bitmask to enable/disableGPU recovery for each trigger source.


Regards,
Christian.

Am 26.01.2018 um 19:37 schrieb Liu, Shaoyun:

Ok, the wrong hang detection when amdgpu_gpu_recovery is enabled may be another 
issue , we can fix it later .

Changed to 'false' flag as suggested and submitted .

Regards
Shaoyun.liu


-----Original Message-----
From: Koenig, Christian
Sent: Friday, January 26, 2018 1:28 PM
To: Liu, Shaoyun; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset

Sorry , I meant if I use the "false" flag and  gpu_recovery is not enabled , 
the  reset will be  ignored.

And exactly that is the intention here. So please use the false flag, apart 
from that the patch looks good to me.

Regards,
Christian.

Am 26.01.2018 um 18:56 schrieb Liu, Shaoyun:

Sorry , I meant if I use the "false" flag and  gpu_recovery is not enabled , 
the  reset will be  ignored.

-----Original Message-----
From: Liu, Shaoyun
Sent: Friday, January 26, 2018 12:54 PM
To: Koenig, Christian; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset

If we did use  force flag and amdgpu_gpu_recovery = 0 , the   reset will be 
ignored .  I'm kind of like this reset can go through like sriov .  If we 
depends on the  parameter  amdgpu_gpu_recovery , it may think the GPU is hang 
and  trigger the GPU reset when rocm submit  some heavy compute stuff running  
and actually not hang  .

Regards
Shaoyun.liu

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
Sent: Friday, January 26, 2018 12:41 PM
To: Liu, Shaoyun; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset

Am 26.01.2018 um 18:38 schrieb Shaoyun Liu:

Change-Id: I222f4bb2c9a91c7a4764e6aa706e7d7f2e6d948d
Signed-off-by: Shaoyun Liu <shaoyun....@amd.com>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 19 +++++++++++++++++++
    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  6 ++++++
    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 +++++
    3 files changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 2d99099..cb1ee26 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -239,6 +239,25 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev)
        return r;
    }

+void amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev) {

+       if (adev->kfd)
+               kgd2kfd->pre_reset(adev->kfd);
+}
+
+void amdgpu_amdkfd_post_reset(struct amdgpu_device *adev) {
+       if (adev->kfd)
+               kgd2kfd->post_reset(adev->kfd);
+}
+
+void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd) {
+       struct amdgpu_device *adev = (struct amdgpu_device *)kgd;
+
+       amdgpu_device_gpu_recover(adev, NULL, true);

Use false for the force parameter here, apart from that the set looks good to 
me.

Regards,
Christian.

+}
+
    int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type 
engine,
                                uint32_t vmid, uint64_t gpu_addr,
                                uint32_t *ib_cmd, uint32_t ib_len) diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 7c36e52..230761f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -155,6 +155,12 @@ int amdgpu_amdkfd_copy_mem_to_mem(struct kgd_dev *kgd, 
struct kgd_mem *src_mem,
    bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev,
                        u32 vmid);

+int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev);

+
+int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev);
+
+void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd);
+
    /* Shared API */
    int map_bo(struct amdgpu_device *rdev, uint64_t va, void *vm,
                struct amdgpu_bo *bo, struct amdgpu_bo_va **bo_va); diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 94f837b..61e7d35 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2660,6 +2660,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
        atomic_inc(&adev->gpu_reset_counter);
        adev->in_gpu_reset = 1;

+ /* Block kfd */

+       amdgpu_amdkfd_pre_reset(adev);
+
        /* block TTM */
        resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
        /* store modesetting */
@@ -2765,6 +2768,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
                amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
        } else {
                dev_info(adev->dev, "GPU reset(%d)
successed!\n",atomic_read(&adev->gpu_reset_counter));
+               /*unlock kfd after a successfully recovery*/
+               amdgpu_amdkfd_post_reset(adev);
        }

amdgpu_vf_error_trans_all(adev);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset

Reply via email to