Re: [PATCHv2 1/2] drm/amd/amdgpu embed hw_fence into amdgpu_job

2021-08-05 Thread Andrey Grodzovsky
On 2021-08-05 4:31 a.m., Jingwen Chen wrote: From: Jack Zhang Why: Previously hw fence is alloced separately with job. It caused historical lifetime issues and corner cases. The ideal situation is to take fence to manage both job and fence's lifetime, and simplify the design of gpu-scheduler.

Re: [PATCHv2 2/2] drm/amd/amdgpu: add tdr support for embeded hw_fence

2021-08-09 Thread Andrey Grodzovsky
On 2021-08-05 4:31 a.m., Jingwen Chen wrote: [Why] After embeded hw_fence to amdgpu_job, we need to add tdr support for this feature. [How] 1. Add a resubmit_flag for resubmit jobs. 2. Clear job fence from RCU and force complete vm flush fences in pre_asic_reset 3. skip dma_fence_get for r

Re: [PATCH v4] drm/amd/amdgpu embed hw_fence into amdgpu_job

2021-08-10 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2021-08-09 11:22 p.m., Jingwen Chen wrote: From: Jack Zhang Why: Previously hw fence is alloced separately with job. It caused historical lifetime issues and corner cases. The ideal situation is to take fence to manage both job and fence's lif

Re: [PATCH] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-17 Thread Andrey Grodzovsky
On 2021-08-17 12:28 a.m., Jingwen Chen wrote: [Why] for bailing job, this commit will delete it from pending list thus the bailing job will never have a chance to be resubmitted even in advance tdr mode. [How] after embeded hw_fence into amdgpu_job is done, the race condition that this commit tr

Re: [PATCH] drm/amd/amdgpu:flush ttm delayed work before cancel_sync

2021-08-17 Thread Andrey Grodzovsky
Looks reasonable to me. Reviewed-by: Andrey Grodzovsky Andrey On 2021-08-17 5:50 a.m., YuBiao Wang wrote: [Why] In some cases when we unload driver, warning call trace will show up in vram_mgr_fini which claims that LRU is not empty, caused by the ttm bo inside delay deleted queue. [How] We

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-17 Thread Andrey Grodzovsky
On 2021-08-02 1:16 a.m., Guchun Chen wrote: In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop scheduler in s3 test, otherwise, fence related failure will arrive after resume. To fix this and for a better clean up, move drm_sched_fini from fence_hw_fini to fence_sw_fini, as

Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-18 Thread Andrey Grodzovsky
On 2021-08-18 10:02 a.m., Alex Deucher wrote: + dri-devel Since scheduler is a shared component, please add dri-devel on all scheduler patches. On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen wrote: [Why] for bailing job, this commit will delete it from pending list thus the bailing job will ne

Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-18 Thread Andrey Grodzovsky
On 2021-08-18 10:32 a.m., Daniel Vetter wrote: On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote: On 2021-08-18 10:02 a.m., Alex Deucher wrote: + dri-devel Since scheduler is a shared component, please add dri-devel on all scheduler patches. On Wed, Aug 18, 2021 at 7:21 AM

Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-18 Thread Andrey Grodzovsky
On 2021-08-18 10:42 a.m., Daniel Vetter wrote: On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote: On 2021-08-18 10:32 a.m., Daniel Vetter wrote: On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote: On 2021-08-18 10:02 a.m., Alex Deucher wrote: + dri-devel

Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-19 Thread Andrey Grodzovsky
On 2021-08-19 5:30 a.m., Daniel Vetter wrote: On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote: On 2021-08-18 10:42 a.m., Daniel Vetter wrote: On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote: On 2021-08-18 10:32 a.m., Daniel Vetter wrote: On Wed, Aug 18

Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-20 Thread Andrey Grodzovsky
age- From: Daniel Vetter Sent: Thursday, August 19, 2021 5:31 PM To: Grodzovsky, Andrey Cc: Daniel Vetter ; Alex Deucher ; Chen, JingWen ; Maling list - DRI developers ; amd-gfx list ; Liu, Monk ; Koenig, Christian Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job

Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-20 Thread Andrey Grodzovsky
- -Original Message- From: Daniel Vetter Sent: Thursday, August 19, 2021 5:31 PM To: Grodzovsky, Andrey Cc: Daniel Vetter ; Alex Deucher ; Chen, JingWen ; Maling list - DRI developers ; amd-gfx list ; Liu, Monk ; Koenig, Christian Subject: Re: [PATCH v2] Rev

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-23 Thread Andrey Grodzovsky
r-handle of fence driver fini in s3 test (v2) Please go ahead.  Thanks! Alex On Thu, Aug 19, 2021 at 8:05 AM Mike Lothian wrote: Hi Do I need to open a new bug report for this? Cheers Mike On Wed, 18 Aug 2021 at 06:26, Andrey Grodzovsky wrote: On 2021-08-02 1:1

Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-24 Thread Andrey Grodzovsky
f they already * signaled. Thanks -- Monk Liu | Cloud-GPU Core team -- -Original Message- From: Daniel Vetter Sent: Thursday, August 19, 2021 5:31 PM To: Grodzovsky, Andrey Cc: Daniel Vetter ; Alex De

Re: [PATCH] drm/sched: fix the bug of time out calculation

2021-08-24 Thread Andrey Grodzovsky
On 2021-08-24 10:46 a.m., Andrey Grodzovsky wrote: On 2021-08-24 5:51 a.m., Monk Liu wrote: the original logic is wrong that the timeout will not be retriggerd after the previous job siganled, and that lead to the scenario that all jobs in the same scheduler shares the same timeout timer

Re: [PATCH] drm/sched: fix the bug of time out calculation

2021-08-24 Thread Andrey Grodzovsky
On 2021-08-24 5:51 a.m., Monk Liu wrote: the original logic is wrong that the timeout will not be retriggerd after the previous job siganled, and that lead to the scenario that all jobs in the same scheduler shares the same timeout timer from the very begining job in this scheduler which is wro

[PATCH 0/4] Various fixes to pass libdrm hotunplug tests

2021-08-24 Thread Andrey Grodzovsky
Bunch of fixes to enable passing hotplug tests i previosly added here[1] with latest code. Once accepted I will enable the tests on libdrm side. [1] - https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/172 Andrey Grodzovsky (4): drm/amdgpu: Move flush VCE idle_work during HW fini drm

[PATCH 2/4] drm/ttm: Create pinned list

2021-08-24 Thread Andrey Grodzovsky
This list will be used to capture all non VRAM BOs not on LRU so when device is hot unplugged we can iterate the list and unmap DMA mappings before device is removed. Signed-off-by: Andrey Grodzovsky Suggested-by: Christian König --- drivers/gpu/drm/ttm/ttm_bo.c | 24

[PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini

2021-08-24 Thread Andrey Grodzovsky
Attepmts to powergate after device is removed lead to crash. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 1 - drivers/gpu/drm/amd/amdgpu/vce_v2_0.c | 4 drivers/gpu/drm/amd/amdgpu/vce_v3_0.c | 5 - drivers/gpu/drm/amd/amdgpu/vce_v4_0.c | 2 ++ 4

[PATCH 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case

2021-08-24 Thread Andrey Grodzovsky
Handle all DMA IOMMU group related dependencies before the group is removed and we try to access it after free. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c| 50 ++ drivers/gpu/drm/amd

[PATCH 4/4] drm/amdgpu: Add a UAPI flag for hot plug/unplug

2021-08-24 Thread Andrey Grodzovsky
To support libdrm tests. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 6400259a7c4b..c2fdf67ff551 100644

Re: [PATCH 1/4] drm/amdgpu: Move flush VCE idle_work during HW fini

2021-08-24 Thread Andrey Grodzovsky
issue here too. https://lists.freedesktop.org/archives/amd-gfx/2021-August/067972.html https://lists.freedesktop.org/archives/amd-gfx/2021-August/067967.html BR Evan -Original Message- From: amd-gfx On Behalf Of Andrey Grodzovsky Sent: Wednesday, August 25, 2021 5:01 AM To: d

Re: [PATCH 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case

2021-08-25 Thread Andrey Grodzovsky
On 2021-08-25 2:43 a.m., Christian König wrote: Am 24.08.21 um 23:01 schrieb Andrey Grodzovsky: Handle all DMA IOMMU group related dependencies before the group is removed and we try to access it after free. Signed-off-by: Andrey Grodzovsky ---   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

Re: [PATCH] drm/sched: fix the bug of time out calculation(v2)

2021-08-25 Thread Andrey Grodzovsky
On 2021-08-25 8:11 a.m., Christian König wrote: No, this would break that logic here. See drm_sched_start_timeout() can be called multiple times, this is intentional and very important! The logic in queue_delayed_work() makes sure that the timer is only started once and then never again.

Re: [PATCH] drm/sched: fix the bug of time out calculation(v2)

2021-08-25 Thread Andrey Grodzovsky
On 2021-08-25 10:31 p.m., Liu, Monk wrote: [AMD Official Use Only] Hi Andrey I'm not quite sure if I read you correctly Seems to me you can only do it for empty pending list otherwise you risk cancelling a legit new timer that was started by the next job or not restarting timer at all sin

Re: [PATCH] drm/sched: fix the bug of time out calculation(v2)

2021-08-25 Thread Andrey Grodzovsky
On 2021-08-26 12:55 a.m., Liu, Monk wrote: [AMD Official Use Only] But for timer pending case (common case) your mod_delayed_work will effectively do exactly the same if you don't use per job TTLs - you mod it to sched->timeout value which resets the pending timer to again count from 0.

Re: [PATCH 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case

2021-08-26 Thread Andrey Grodzovsky
Ping Andrey On 2021-08-25 11:36 a.m., Andrey Grodzovsky wrote: On 2021-08-25 2:43 a.m., Christian König wrote: Am 24.08.21 um 23:01 schrieb Andrey Grodzovsky: Handle all DMA IOMMU group related dependencies before the group is removed and we try to access it after free. Signed-off-by

Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

2021-08-26 Thread Andrey Grodzovsky
On 2021-08-26 12:55 a.m., Monk Liu wrote: issue: in cleanup_job the cancle_delayed_work will cancel a TO timer even the its corresponding job is still running. fix: do not cancel the timer in cleanup_job, instead do the cancelling only when the heading job is signaled, and if there is a "next"

[PATCH v2 1/4] drm/ttm: Create pinned list

2021-08-26 Thread Andrey Grodzovsky
assigned to them. Signed-off-by: Andrey Grodzovsky Suggested-by: Christian König --- drivers/gpu/drm/ttm/ttm_bo.c | 30 ++ drivers/gpu/drm/ttm/ttm_resource.c | 1 + include/drm/ttm/ttm_resource.h | 1 + 3 files changed, 28 insertions(+), 4 deletions(-) diff

[PATCH v2 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case

2021-08-26 Thread Andrey Grodzovsky
Handle all DMA IOMMU group related dependencies before the group is removed and we try to access it after free. v2: Move the actul handling function to TTM Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a

[PATCH v2 0/4] Various fixes to pass libdrm hotunplug tests

2021-08-26 Thread Andrey Grodzovsky
IOMMU hnadling to TTM layer. Andrey Grodzovsky (4): drm/ttm: Create pinned list drm/ttm: Clear all DMA mappings on demand drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case drm/amdgpu: Add a UAPI flag for hot plug/unplug drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + drivers/gpu/drm

[PATCH v2 4/4] drm/amdgpu: Add a UAPI flag for hot plug/unplug

2021-08-26 Thread Andrey Grodzovsky
To support libdrm tests. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 6400259a7c4b..c2fdf67ff551 100644

[PATCH v2 2/4] drm/ttm: Clear all DMA mappings on demand

2021-08-26 Thread Andrey Grodzovsky
Used by drivers supporting hot unplug to handle all DMA IOMMU group related dependencies before the group is removed during device removal and we try to access it after free when last device pointer from user space is dropped. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/ttm

Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

2021-08-26 Thread Andrey Grodzovsky
Attached quick patch for per job TTL calculation to make more precises next timer expiration. It's on top of the patch in this thread. Let me know if this makes sense. Andrey On 2021-08-26 10:03 a.m., Andrey Grodzovsky wrote: On 2021-08-26 12:55 a.m., Monk Liu wrote: issue: in cleanu

Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

2021-08-27 Thread Andrey Grodzovsky
eedesktop.org Subject: Re: [PATCH] drm/sched: fix the bug of time out calculation(v3) Attached quick patch for per job TTL calculation to make more precises next timer expiration. It's on top of the patch in this thread. Let me know if this makes sense. Andrey On 2021-08-26 10:03 a.m., Andrey

Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

2021-08-27 Thread Andrey Grodzovsky
hristian. Am 26.08.21 um 22:14 schrieb Andrey Grodzovsky: Attached quick patch for per job TTL calculation to make more precises next timer expiration. It's on top of the patch in this thread. Let me know if this makes sense. Andrey On 2021-08-26 10:03 a.m., Andrey Grodzovsky wrote: On 2

Re: [PATCH v2 0/4] Various fixes to pass libdrm hotunplug tests

2021-08-27 Thread Andrey Grodzovsky
Ping Andrey On 2021-08-26 1:27 p.m., Andrey Grodzovsky wrote: Bunch of fixes to enable passing hotplug tests i previosly added here[1] with latest code. Once accepted I will enable the tests on libdrm side. [1] - https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/172 v2: Dropping VCE

Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

2021-08-27 Thread Andrey Grodzovsky
serted into the ring buffer, but rather when it starts processing. Starting processing is a bit swampy defined, but just starting the timer when the previous job completes should be fine enough. Christian. Am 27.08.21 um 15:57 schrieb Andrey Grodzovsky: The TS represents the point in time w

Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

2021-08-27 Thread Andrey Grodzovsky
It's still better than starting the timer when pushing the job to the ring buffer, because that is completely off. Christian. Am 27.08.21 um 20:22 schrieb Andrey Grodzovsky: As I mentioned to Monk before - what about cases such as in this test - https://gitlab.freedesktop.org/me

Re: [PATCH] drm/amdgpu: stop scheduler when calling hw_fini

2021-08-27 Thread Andrey Grodzovsky
I don't think it will start/stop twice because amdgpu_fence_driver_hw_fini/inint is not called during reset. I am worried about calling drm_sched_start without calling drm_sched_resubmit_job first since that the place where the jobs are actually restarted. Also calling drm_sched_start with fal

[PATCH v3 0/4] Various fixes to pass libdrm hotunplug tests

2021-08-27 Thread Andrey Grodzovsky
IOMMU hnadling to TTM layer. v3: Move pinned list to ttm device and a few others. Andrey Grodzovsky (4): drm/ttm: Create pinned list drm/ttm: Clear all DMA mappings on demand drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case drm/amdgpu: Add a UAPI flag for hot plug/unplug drivers/gpu

[PATCH v3 1/4] drm/ttm: Create pinned list

2021-08-27 Thread Andrey Grodzovsky
This list will be used to capture all non VRAM BOs not on LRU so when device is hot unplugged we can iterate the list and unmap DMA mappings before device is removed. v2: Reanme function to ttm_bo_move_to_pinned v3: Move the pinned list to ttm device Signed-off-by: Andrey Grodzovsky Suggested

[PATCH v3 4/4] drm/amdgpu: Add a UAPI flag for hot plug/unplug

2021-08-27 Thread Andrey Grodzovsky
To support libdrm tests. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 6400259a7c4b..c2fdf67ff551 100644

[PATCH v3 2/4] drm/ttm: Clear all DMA mappings on demand

2021-08-27 Thread Andrey Grodzovsky
Switch to ttm_tt_unpopulate Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/ttm/ttm_device.c | 47 include/drm/ttm/ttm_device.h | 1 + 2 files changed, 48 insertions(+) diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c index

[PATCH v3 3/4] drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case

2021-08-27 Thread Andrey Grodzovsky
Handle all DMA IOMMU group related dependencies before the group is removed and we try to access it after free. v2: Move the actul handling function to TTM Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a

Re: [PATCH v3 1/4] drm/ttm: Create pinned list

2021-08-30 Thread Andrey Grodzovsky
On 2021-08-30 4:58 a.m., Christian König wrote: Am 27.08.21 um 22:39 schrieb Andrey Grodzovsky: This list will be used to capture all non VRAM BOs not on LRU so when device is hot unplugged we can iterate the list and unmap DMA mappings before device is removed. v2: Reanme function to

Re: [PATCH v3 1/4] drm/ttm: Create pinned list

2021-08-30 Thread Andrey Grodzovsky
On 2021-08-30 12:51 p.m., Christian König wrote: Am 30.08.21 um 16:16 schrieb Andrey Grodzovsky: On 2021-08-30 4:58 a.m., Christian König wrote: Am 27.08.21 um 22:39 schrieb Andrey Grodzovsky: This list will be used to capture all non VRAM BOs not on LRU so when device is hot unplugged we

Re: [PATCH v3 1/4] drm/ttm: Create pinned list

2021-08-30 Thread Andrey Grodzovsky
On 2021-08-30 1:05 p.m., Christian König wrote: Am 30.08.21 um 19:02 schrieb Andrey Grodzovsky: On 2021-08-30 12:51 p.m., Christian König wrote: Am 30.08.21 um 16:16 schrieb Andrey Grodzovsky: On 2021-08-30 4:58 a.m., Christian König wrote: Am 27.08.21 um 22:39 schrieb Andrey

Re: [PATCH] drm/amdgpu: stop scheduler when calling hw_fini (v2)

2021-08-30 Thread Andrey Grodzovsky
empty before suspend. v2: Call drm_sched_resubmit_job before drm_sched_start to restart jobs from the pending list. Suggested-by: Andrey Grodzovsky Suggested-by: Christian König Signed-off-by: Guchun Chen Reviewed-by: Christian König ---   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8

Re: [PATCH] drm/amdgpu: Fix a deadlock if previous GEM object allocation fails

2021-08-30 Thread Andrey Grodzovsky
On 2021-08-30 11:24 p.m., Pan, Xinhui wrote: [AMD Official Use Only] [AMD Official Use Only] Unreserve root BO before return otherwise next allocation got deadlock. Signed-off-by: xinhui pan --- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 11 +-- 1 file changed, 5 insertions(+), 6

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Andrey Grodzovsky
It's says patch [2/2] but i can't find patch 1 On 2021-08-31 6:35 a.m., Monk Liu wrote: tested-by: jingwen chen Signed-off-by: Monk Liu Signed-off-by: jingwen chen --- drivers/gpu/drm/scheduler/sched_main.c | 24 1 file changed, 4 insertions(+), 20 deletions(-) di

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Andrey Grodzovsky
On 2021-08-31 10:03 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote: It's says patch [2/2] but i can't find patch 1 On 2021-08-31 6:35 a.m., Monk Liu wrote: tested-by: jingwen chen Signed-off-by: Monk Liu Signed-off-by: ji

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Andrey Grodzovsky
On 2021-08-31 10:38 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote: On 2021-08-31 10:03 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote: It's says patch [2/2] but i can't find patch 1 On 20

Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-31 Thread Andrey Grodzovsky
On 2021-08-31 9:11 a.m., Daniel Vetter wrote: On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote: On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote: On 2021-08-19 5:30 a.m., Daniel Vetter wrote: On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-08-31 Thread Andrey Grodzovsky
On 2021-08-31 12:01 p.m., Luben Tuikov wrote: On 2021-08-31 11:23, Andrey Grodzovsky wrote: On 2021-08-31 10:38 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote: On 2021-08-31 10:03 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 09:53:36AM

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Andrey Grodzovsky
I will answer everything here - On 2021-08-31 9:58 p.m., Liu, Monk wrote: [AMD Official Use Only] In the previous discussion, you guys stated that we should drop the “kthread_should_park” in cleanup_job. @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) {  

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Andrey Grodzovsky
On 2021-09-01 12:25 a.m., Jingwen Chen wrote: On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote: I will answer everything here - On 2021-08-31 9:58 p.m., Liu, Monk wrote: [AMD Official Use Only] In the previous discussion, you guys stated that we should

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Andrey Grodzovsky
On 2021-09-01 12:40 a.m., Jingwen Chen wrote: On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote: On 2021-09-01 12:25 a.m., Jingwen Chen wrote: On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote: I will answer everything here - On 2021-08-31 9:58 p.m., Liu, Monk

Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-09-02 Thread Andrey Grodzovsky
On 2021-09-02 10:28 a.m., Daniel Vetter wrote: On Tue, Aug 31, 2021 at 02:24:52PM -0400, Andrey Grodzovsky wrote: On 2021-08-31 9:11 a.m., Daniel Vetter wrote: On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote: On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote

Re: [PATCH v2] drm/amdgpu: Fix a race of IB test

2021-09-13 Thread Andrey Grodzovsky
Please add a tag V2 in description explaining what was the delta from V1. Other then that looks good to me. Andrey On 2021-09-12 7:48 p.m., xinhui pan wrote: Direct IB submission should be exclusive. So use write lock. Signed-off-by: xinhui pan --- drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.

Re: [PATCH] drm/amdgpu: Put drm_dev_enter/exit outside hot codepath

2021-09-14 Thread Andrey Grodzovsky
On 2021-09-14 9:42 p.m., xinhui pan wrote: We hit soft hang while doing memory pressure test on one numa system. After a qucik look, this is because kfd invalid/valid userptr memory frequently with process_info lock hold. perf top says below, 75.81% [kernel] [k] __srcu_read_unlock Do

Re: 回复: [PATCH] drm/amdgpu: Put drm_dev_enter/exit outside hot codepath

2021-09-14 Thread Andrey Grodzovsky
I think you missed 'reply all' so bringing  back to public On 2021-09-14 11:40 p.m., Pan, Xinhui wrote: [AMD Official Use Only] perf says it is the lock addl $0x0,-0x4(%rsp) details is below. the contention is huge maybe. Yes - that makes sense to me too as long as the lock here is some

Re: 回复: [PATCH v2] drm/amdgpu: Put drm_dev_enter/exit outside hot codepath

2021-09-15 Thread Andrey Grodzovsky
On 2021-09-15 2:42 a.m., Pan, Xinhui wrote: [AMD Official Use Only] Andrey I hit panic with this plug/unplug test without this patch. Can you please tell which ASIC you are using and which kernel branch and what is the tip commit ? But as we add enter/exit in all its callers. maybe it wo

Re: 回复: [PATCH v2] drm/amdgpu: Put drm_dev_enter/exit outside hot codepath

2021-09-15 Thread Andrey Grodzovsky
On 2021-09-15 9:57 a.m., Christian König wrote: Am 15.09.21 um 15:52 schrieb Andrey Grodzovsky: On 2021-09-15 2:42 a.m., Pan, Xinhui wrote: [AMD Official Use Only] Andrey I hit panic with this plug/unplug test without this patch. Can you please tell which ASIC you are using and which

Re: [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

2021-09-15 Thread Andrey Grodzovsky
Pushed Andrey On 2021-09-15 7:45 a.m., Christian König wrote: Yes, I think so as well. Andrey can you push this? Christian. Am 15.09.21 um 00:59 schrieb Grodzovsky, Andrey: AFAIK this one is independent. Christian, can you confirm ? Andrey --

[PATCH] drm/amdgpu: Fix crash on device remove/driver unload

2021-09-15 Thread Andrey Grodzovsky
spend ee6679aaa61c drm/amdgpu: add missing cleanups for Polaris12 UVD/VCE on suspend Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/uvd_v3_1.c | 24 --- drivers/gpu/drm/amd/amdgpu/uvd_v4_2.c | 24 --- drivers/gpu/drm/amd/amdgpu/uvd_v5_0.c | 24 --- dr

[PATCH] drm/amd/display: Fix crash on device remove/driver unload

2021-09-15 Thread Andrey Grodzovsky
Why: DC core is being released from DM before it's referenced from hpd_rx wq destruction code. How: Move hpd_rx destruction before DC core destruction. Signed-off-by: Andrey Grodzovsky --- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 24 +-- 1 file changed, 12 inser

Re: [PATCH v2] drm/amdgpu: Put drm_dev_enter/exit outside hot codepath

2021-09-15 Thread Andrey Grodzovsky
I fixed 2 regressions and latest code, applied your patch on top and passed libdrm tests on Vega 10. You can pickup those 2 patches and try too if you have time. In any case - Reviewed-and-tested-by: Andrey Grodzovsky Andrey On 2021-09-15 2:37 a.m., xinhui pan wrote: We hit soft hang while

Re: [PATCH] drm/amdgpu: Fix crash on device remove/driver unload

2021-09-16 Thread Andrey Grodzovsky
On 2021-09-16 4:20 a.m., Lazar, Lijo wrote: A minor comment below. On 9/16/2021 1:11 AM, Andrey Grodzovsky wrote: Crash: BUG: unable to handle page fault for address: 10e1 RIP: 0010:vega10_power_gate_vce+0x26/0x50 [amdgpu] Call Trace: pp_set_powergating_by_smu+0x16a/0x2b0 [amdgpu

Re: [PATCH] drm/amdgpu: Fix crash on device remove/driver unload

2021-09-16 Thread Andrey Grodzovsky
On 2021-09-16 11:51 a.m., Lazar, Lijo wrote: On 9/16/2021 9:15 PM, Andrey Grodzovsky wrote: On 2021-09-16 4:20 a.m., Lazar, Lijo wrote: A minor comment below. On 9/16/2021 1:11 AM, Andrey Grodzovsky wrote: Crash: BUG: unable to handle page fault for address: 10e1 RIP: 0010

[PATCH 1/2] drm/amdgpu: Fix MMIO access page fault

2021-09-17 Thread Andrey Grodzovsky
Add more guards to MMIO access post device unbind/unplug Bug:https://bugs.archlinux.org/task/72092?project=1&order=dateopened&sort=desc&pagenum=1 Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c | 8 ++-- drivers/gpu/drm/amd/amdgpu/vcn

[PATCH 2/2] drm/amdgpu: Fix resume failures when device is gone

2021-09-17 Thread Andrey Grodzovsky
PCIe error recovery to avoid accessing registres. This allows to successfully complete pm resume sequence and finish pci remove. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu

Re: [PATCH 1/2] drm/amdgpu: Fix MMIO access page fault

2021-09-17 Thread Andrey Grodzovsky
wed-by: James Zhu Thanks & Best Regards! James On 2021-09-17 7:30 a.m., Andrey Grodzovsky wrote: Add more guards to MMIO access post device unbind/unplug Bug:https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.archlinux.org%2Ftask%2F72092%3Fproject%3D1%26order%3Ddat

Re: [PATCH 2/2] drm/amdgpu: Fix resume failures when device is gone

2021-09-17 Thread Andrey Grodzovsky
Ping Andrey On 2021-09-17 7:30 a.m., Andrey Grodzovsky wrote: Problem: When device goes into suspend and unplugged during it then all HW programming during resume fails leading to a bad SW during pci remove handling which follows. Because device is first resumed and only later removed we

Re: [PATCH] drm/amdkfd: fix svm_migrate_fini warning

2021-09-21 Thread Andrey Grodzovsky
In any case, once you converge on solution please include the relevant ticket in the commit description  - https://gitlab.freedesktop.org/drm/amd/-/issues/1718 Andrey On 2021-09-20 10:20 p.m., Felix Kuehling wrote: Am 2021-09-20 um 5:55 p.m. schrieb Philip Yang: Don't use devm_request_free_me

Re: [PATCH] drm/amdgpu: move amdgpu_virt_release_full_gpu to fini_early stage

2021-09-21 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2021-09-21 9:11 a.m., Chen, Guchun wrote: [Public] Ping... Regards, Guchun -Original Message- From: Chen, Guchun Sent: Saturday, September 18, 2021 2:09 PM To: amd-gfx@lists.freedesktop.org; Koenig, Christian ; Pan, Xinhui ; Deucher

Re: [PATCH v2 1/2] drm/amdkfd: handle svm migrate init error

2021-09-21 Thread Andrey Grodzovsky
Series is Acked-by: Andrey Grodzovsky Andrey On 2021-09-21 2:53 p.m., Philip Yang wrote: If svm migration init failed to create pgmap for device memory, set pgmap type to 0 to disable device SVM support capability. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

Re: [PATCH] drm/amd/amdgpu: Do irq_fini_hw after ip_fini_early

2021-09-29 Thread Andrey Grodzovsky
Can you test  this change with hotunplug tests in libdrm ? Since the tests are still in disabled mode until latest fixes propagate to drm-next upstream you will need to comment out https://gitlab.freedesktop.org/mesa/drm/-/blob/main/tests/amdgpu/hotunplug_tests.c#L65 I recently fixed a few regres

Re: [PATCH] drm/amdgpu: add missed write lock for pci detected state pci_channel_io_normal

2021-09-30 Thread Andrey Grodzovsky
On 2021-09-30 10:00 p.m., Guchun Chen wrote: When a PCI error state pci_channel_io_normal is detectd, it will report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI driver will continue the execution of PCI resume callback report_resume by pci_walk_bridge, and the callback will go into

Re: [PATCH] drm/amdgpu: add missed write lock for pci detected state pci_channel_io_normal

2021-10-01 Thread Andrey Grodzovsky
No, scheduler restart and device unlock must take place inamdgpu_pci_resume (see struct pci_error_handlers for the various states of PCI recovery). So just add a flag (probably in amdgpu_device) so we can remember what pci_channel_state_t we came from (unfortunately it's not passed to us in  am

Re: Lockdep spalt on killing a processes

2021-10-01 Thread Andrey Grodzovsky
From what I see here you supposed to have actual deadlock and not only warning, sched_fence->finished is  first signaled from within hw fence done callback (drm_sched_job_done_cb) but then again from within it's own callback (drm_sched_entity_kill_jobs_cb) and so looks like same fence  object is

Re: Lockdep spalt on killing a processes

2021-10-04 Thread Andrey Grodzovsky
scheduler fence. Daniel is right that this needs an irq_work struct to handle this properly. Christian. Am 01.10.21 um 17:10 schrieb Andrey Grodzovsky: From what I see here you supposed to have actual deadlock and not only warning, sched_fence->finished is  first signaled from within

Re: [PATCH] drm/amdgpu: handle the case of pci_channel_io_frozen only in amdgpu_pci_resume

2021-10-04 Thread Andrey Grodzovsky
only continue the execution in amdgpu_pci_resume when it's pci_channel_io_frozen. Fixes: c9a6b82f45e2("drm/amdgpu: Implement DPC recovery") Suggested-by: Andrey Grodzovsky Signed-off-by: Guchun Chen --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 1 + drivers/gpu/drm/amd/amdgpu/

Re: [PATCH 1/1] drm/amdgpu: recover gart table at resume

2021-10-19 Thread Andrey Grodzovsky
On 2021-10-19 9:22 a.m., Nirmoy Das wrote: Get rid off pin/unpin and evict and swap back gart page table which should make things less likely to break. +Christian Could you guys also clarify what exactly are the stability issues this fixes ? Andrey Also remove 2nd call to amdgpu_devic

Re: [PATCH 1/1] drm/amdgpu: recover gart table at resume

2021-10-19 Thread Andrey Grodzovsky
On 2021-10-19 11:54 a.m., Christian König wrote: Am 19.10.21 um 17:41 schrieb Andrey Grodzovsky: On 2021-10-19 9:22 a.m., Nirmoy Das wrote: Get rid off pin/unpin and evict and swap back gart page table which should make things less likely to break. +Christian Could you guys also clarify

Re: Lockdep spalt on killing a processes

2021-10-20 Thread Andrey Grodzovsky
can and cannot be done there. Andrey Am 01.10.21 um 17:10 schrieb Andrey Grodzovsky: From what I see here you supposed to have actual deadlock and not only warning, sched_fence->finished is  first signaled from within hw fence done callback (drm_sched_job_done_cb) but then agai

Re: FW: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in amdgpu_device_fini_sw()

2021-10-21 Thread Andrey Grodzovsky
On 2021-10-21 3:19 a.m., Yu, Lang wrote: [AMD Official Use Only] -Original Message- From: Yu, Lang Sent: Thursday, October 21, 2021 3:18 PM To: Grodzovsky, Andrey Cc: Deucher, Alexander ; Koenig, Christian ; Huang, Ray ; Yu, Lang Subject: [PATCH 1/3] drm/amdgpu: fix a potential me

Re: FW: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in amdgpu_device_fini_sw()

2021-10-22 Thread Andrey Grodzovsky
.. ring->adev->rings[ring->idx] = NULL; } Regards, Lang Got it, Looks good to me. Reviewed-by: Andrey Grodzovsky Andrey Fixes: 72c8c97b1522 ("drm/amdgpu: Split amdgpu_device_fini into early and late") Signed-off-by: Lang Yu --- drivers/gpu/drm/amd/amdgp

Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow

2021-10-22 Thread Andrey Grodzovsky
What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ? On 2021-10-22 4:14 a.m., JingWen Chen wrote: ping On 2021/10/22 AM11:33, Jingwen Chen wrote: [Why] In advance tdr mode, the real bad job will be resubmitted twice, while in drm_sched_res

Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow

2021-10-25 Thread Andrey Grodzovsky
On 2021-10-24 10:56 p.m., JingWen Chen wrote: On 2021/10/23 上午4:41, Andrey Grodzovsky wrote: What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ? yes Then maybe update the description  because 'underflow' is very confusing

Re: Lockdep spalt on killing a processes

2021-10-25 Thread Andrey Grodzovsky
Adding back Daniel (somehow he got off the addresses list) and Chris who worked a lot in this area. On 2021-10-21 2:34 a.m., Christian König wrote: Am 20.10.21 um 21:32 schrieb Andrey Grodzovsky: On 2021-10-04 4:14 a.m., Christian König wrote: The problem is a bit different. The callback

Re: Lockdep spalt on killing a processes

2021-10-25 Thread Andrey Grodzovsky
know if I am still missing some point of yours. Andrey Regards, Christian. Am 25.10.21 um 21:10 schrieb Andrey Grodzovsky: Adding back Daniel (somehow he got off the addresses list) and Chris who worked a lot in this area. On 2021-10-21 2:34 a.m., Christian König wrote: Am 20.10.21 um 21

Re: Lockdep spalt on killing a processes

2021-10-27 Thread Andrey Grodzovsky
On 2021-10-26 6:54 a.m., Christian König wrote: Am 26.10.21 um 04:33 schrieb Andrey Grodzovsky: On 2021-10-25 3:56 p.m., Christian König wrote: In general I'm all there to get this fixed, but there is one major problem: Drivers don't expect the lock to be dropped. I am probab

Re: Lockdep spalt on killing a processes

2021-10-27 Thread Andrey Grodzovsky
On 2021-10-27 10:34 a.m., Christian König wrote: Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky: [SNIP] Let me please know if I am still missing some point of yours. Well, I mean we need to be able to handle this for all drivers. For sure, but as i said above in my opinion we need to

Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow

2021-10-27 Thread Andrey Grodzovsky
On 2021-10-25 10:57 p.m., JingWen Chen wrote: On 2021/10/25 下午11:18, Andrey Grodzovsky wrote: On 2021-10-24 10:56 p.m., JingWen Chen wrote: On 2021/10/23 上午4:41, Andrey Grodzovsky wrote: What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put

Re: Lockdep spalt on killing a processes

2021-10-27 Thread Andrey Grodzovsky
On 2021-10-27 10:50 a.m., Christian König wrote: Am 27.10.21 um 16:47 schrieb Andrey Grodzovsky: On 2021-10-27 10:34 a.m., Christian König wrote: Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky: [SNIP] Let me please know if I am still missing some point of yours. Well, I mean we need

Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow

2021-10-28 Thread Andrey Grodzovsky
On 2021-10-27 10:43 p.m., JingWen Chen wrote: On 2021/10/28 上午3:43, Andrey Grodzovsky wrote: On 2021-10-25 10:57 p.m., JingWen Chen wrote: On 2021/10/25 下午11:18, Andrey Grodzovsky wrote: On 2021-10-24 10:56 p.m., JingWen Chen wrote: On 2021/10/23 上午4:41, Andrey Grodzovsky wrote: What do

Re: Lockdep spalt on killing a processes

2021-10-28 Thread Andrey Grodzovsky
On 2021-10-27 3:58 p.m., Andrey Grodzovsky wrote: On 2021-10-27 10:50 a.m., Christian König wrote: Am 27.10.21 um 16:47 schrieb Andrey Grodzovsky: On 2021-10-27 10:34 a.m., Christian König wrote: Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky: [SNIP] Let me please know if I am still

Re: Lockdep spalt on killing a processes

2021-11-01 Thread Andrey Grodzovsky
Pushed to drm-misc-next Andrey On 2021-10-29 3:07 a.m., Christian König wrote: Attached a patch. Give it a try please, I tested it on my side and tried to generate the right conditions to trigger this code path by repeatedly submitting commands while issuing GPU reset to stop the scheduler

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-11-10 Thread Andrey Grodzovsky
On 2021-11-10 5:09 a.m., Christian König wrote: Am 10.11.21 um 10:50 schrieb Daniel Vetter: On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote: On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter wrote: On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote: I stumbled across this threa

  1   2   3   4   5   6   7   8   9   10   >