On 2021-08-05 4:31 a.m., Jingwen Chen wrote:
From: Jack Zhang
Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lifetime, and simplify the design of gpu-scheduler.
On 2021-08-05 4:31 a.m., Jingwen Chen wrote:
[Why]
After embeded hw_fence to amdgpu_job, we need to add tdr support
for this feature.
[How]
1. Add a resubmit_flag for resubmit jobs.
2. Clear job fence from RCU and force complete vm flush fences in
pre_asic_reset
3. skip dma_fence_get for r
Reviewed-by: Andrey Grodzovsky
Andrey
On 2021-08-09 11:22 p.m., Jingwen Chen wrote:
From: Jack Zhang
Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lif
On 2021-08-17 12:28 a.m., Jingwen Chen wrote:
[Why]
for bailing job, this commit will delete it from pending list thus the
bailing job will never have a chance to be resubmitted even in advance
tdr mode.
[How]
after embeded hw_fence into amdgpu_job is done, the race condition that
this commit tr
Looks reasonable to me.
Reviewed-by: Andrey Grodzovsky
Andrey
On 2021-08-17 5:50 a.m., YuBiao Wang wrote:
[Why]
In some cases when we unload driver, warning call trace
will show up in vram_mgr_fini which claims that LRU is not empty, caused
by the ttm bo inside delay deleted queue.
[How]
We
On 2021-08-02 1:16 a.m., Guchun Chen wrote:
In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop
scheduler in s3 test, otherwise, fence related failure will arrive
after resume. To fix this and for a better clean up, move drm_sched_fini
from fence_hw_fini to fence_sw_fini, as
On 2021-08-18 10:02 a.m., Alex Deucher wrote:
+ dri-devel
Since scheduler is a shared component, please add dri-devel on all
scheduler patches.
On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen wrote:
[Why]
for bailing job, this commit will delete it from pending list thus the
bailing job will ne
On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
On 2021-08-18 10:02 a.m., Alex Deucher wrote:
+ dri-devel
Since scheduler is a shared component, please add dri-devel on all
scheduler patches.
On Wed, Aug 18, 2021 at 7:21 AM
On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
On 2021-08-18 10:02 a.m., Alex Deucher wrote:
+ dri-devel
On 2021-08-19 5:30 a.m., Daniel Vetter wrote:
On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
On Wed, Aug 18
age-
From: Daniel Vetter
Sent: Thursday, August 19, 2021 5:31 PM
To: Grodzovsky, Andrey
Cc: Daniel Vetter ; Alex Deucher ; Chen, JingWen
; Maling list - DRI developers ; amd-gfx list
; Liu, Monk ; Koenig, Christian
Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job
-
-Original Message-
From: Daniel Vetter
Sent: Thursday, August 19, 2021 5:31 PM
To: Grodzovsky, Andrey
Cc: Daniel Vetter ; Alex Deucher ; Chen, JingWen
; Maling list - DRI developers ; amd-gfx list
; Liu, Monk ; Koenig, Christian
Subject: Re: [PATCH v2] Rev
r-handle of fence driver
fini in s3 test (v2)
Please go ahead. Thanks!
Alex
On Thu, Aug 19, 2021 at 8:05 AM Mike Lothian
wrote:
Hi
Do I need to open a new bug report for this?
Cheers
Mike
On Wed, 18 Aug 2021 at 06:26, Andrey Grodzovsky
wrote:
On 2021-08-02 1:1
f they already
* signaled.
Thanks
--
Monk Liu | Cloud-GPU Core team
--
-Original Message-
From: Daniel Vetter
Sent: Thursday, August 19, 2021 5:31 PM
To: Grodzovsky, Andrey
Cc: Daniel Vetter ; Alex De
On 2021-08-24 10:46 a.m., Andrey Grodzovsky wrote:
On 2021-08-24 5:51 a.m., Monk Liu wrote:
the original logic is wrong that the timeout will not be retriggerd
after the previous job siganled, and that lead to the scenario that all
jobs in the same scheduler shares the same timeout timer
On 2021-08-24 5:51 a.m., Monk Liu wrote:
the original logic is wrong that the timeout will not be retriggerd
after the previous job siganled, and that lead to the scenario that all
jobs in the same scheduler shares the same timeout timer from the very
begining job in this scheduler which is wro
Bunch of fixes to enable passing hotplug tests i previosly added
here[1] with latest code.
Once accepted I will enable the tests on libdrm side.
[1] - https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/172
Andrey Grodzovsky (4):
drm/amdgpu: Move flush VCE idle_work during HW fini
drm
This list will be used to capture all non VRAM BOs not
on LRU so when device is hot unplugged we can iterate
the list and unmap DMA mappings before device is removed.
Signed-off-by: Andrey Grodzovsky
Suggested-by: Christian König
---
drivers/gpu/drm/ttm/ttm_bo.c | 24
Attepmts to powergate after device is removed lead to crash.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 1 -
drivers/gpu/drm/amd/amdgpu/vce_v2_0.c | 4
drivers/gpu/drm/amd/amdgpu/vce_v3_0.c | 5 -
drivers/gpu/drm/amd/amdgpu/vce_v4_0.c | 2 ++
4
Handle all DMA IOMMU group related dependencies before the
group is removed and we try to access it after free.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c| 50 ++
drivers/gpu/drm/amd
To support libdrm tests.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6400259a7c4b..c2fdf67ff551 100644
issue here too.
https://lists.freedesktop.org/archives/amd-gfx/2021-August/067972.html
https://lists.freedesktop.org/archives/amd-gfx/2021-August/067967.html
BR
Evan
-Original Message-
From: amd-gfx On Behalf Of
Andrey Grodzovsky
Sent: Wednesday, August 25, 2021 5:01 AM
To: d
On 2021-08-25 2:43 a.m., Christian König wrote:
Am 24.08.21 um 23:01 schrieb Andrey Grodzovsky:
Handle all DMA IOMMU group related dependencies before the
group is removed and we try to access it after free.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
On 2021-08-25 8:11 a.m., Christian König wrote:
No, this would break that logic here.
See drm_sched_start_timeout() can be called multiple times, this is
intentional and very important!
The logic in queue_delayed_work() makes sure that the timer is only
started once and then never again.
On 2021-08-25 10:31 p.m., Liu, Monk wrote:
[AMD Official Use Only]
Hi Andrey
I'm not quite sure if I read you correctly
Seems to me you can only do it for empty pending list otherwise you risk
cancelling a legit new timer that was started by the next job or not restarting
timer at all sin
On 2021-08-26 12:55 a.m., Liu, Monk wrote:
[AMD Official Use Only]
But for timer pending case (common case) your mod_delayed_work will effectively
do exactly the same if you don't use per job TTLs - you mod it to
sched->timeout value which resets the pending timer to again count from 0.
Ping
Andrey
On 2021-08-25 11:36 a.m., Andrey Grodzovsky wrote:
On 2021-08-25 2:43 a.m., Christian König wrote:
Am 24.08.21 um 23:01 schrieb Andrey Grodzovsky:
Handle all DMA IOMMU group related dependencies before the
group is removed and we try to access it after free.
Signed-off-by
On 2021-08-26 12:55 a.m., Monk Liu wrote:
issue:
in cleanup_job the cancle_delayed_work will cancel a TO timer
even the its corresponding job is still running.
fix:
do not cancel the timer in cleanup_job, instead do the cancelling
only when the heading job is signaled, and if there is a "next"
assigned to them.
Signed-off-by: Andrey Grodzovsky
Suggested-by: Christian König
---
drivers/gpu/drm/ttm/ttm_bo.c | 30 ++
drivers/gpu/drm/ttm/ttm_resource.c | 1 +
include/drm/ttm/ttm_resource.h | 1 +
3 files changed, 28 insertions(+), 4 deletions(-)
diff
Handle all DMA IOMMU group related dependencies before the
group is removed and we try to access it after free.
v2:
Move the actul handling function to TTM
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a
IOMMU hnadling to TTM layer.
Andrey Grodzovsky (4):
drm/ttm: Create pinned list
drm/ttm: Clear all DMA mappings on demand
drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case
drm/amdgpu: Add a UAPI flag for hot plug/unplug
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +
drivers/gpu/drm
To support libdrm tests.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6400259a7c4b..c2fdf67ff551 100644
Used by drivers supporting hot unplug to handle all
DMA IOMMU group related dependencies before the group
is removed during device removal and we try to access
it after free when last device pointer from user space
is dropped.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/ttm
Attached quick patch for per job TTL calculation to make more precises
next timer expiration. It's on top of the patch in this thread. Let me
know if this makes sense.
Andrey
On 2021-08-26 10:03 a.m., Andrey Grodzovsky wrote:
On 2021-08-26 12:55 a.m., Monk Liu wrote:
issue:
in cleanu
eedesktop.org
Subject: Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)
Attached quick patch for per job TTL calculation to make more precises next
timer expiration. It's on top of the patch in this thread. Let me know if this
makes sense.
Andrey
On 2021-08-26 10:03 a.m., Andrey
hristian.
Am 26.08.21 um 22:14 schrieb Andrey Grodzovsky:
Attached quick patch for per job TTL calculation to make more precises
next timer expiration. It's on top of the patch in this thread. Let me
know if this makes sense.
Andrey
On 2021-08-26 10:03 a.m., Andrey Grodzovsky wrote:
On 2
Ping
Andrey
On 2021-08-26 1:27 p.m., Andrey Grodzovsky wrote:
Bunch of fixes to enable passing hotplug tests i previosly added
here[1] with latest code.
Once accepted I will enable the tests on libdrm side.
[1] - https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/172
v2:
Dropping VCE
serted into the
ring buffer, but rather when it starts processing.
Starting processing is a bit swampy defined, but just starting the
timer when the previous job completes should be fine enough.
Christian.
Am 27.08.21 um 15:57 schrieb Andrey Grodzovsky:
The TS represents the point in time w
It's still better than starting the timer when pushing the job to the
ring buffer, because that is completely off.
Christian.
Am 27.08.21 um 20:22 schrieb Andrey Grodzovsky:
As I mentioned to Monk before - what about cases such as in this test
-
https://gitlab.freedesktop.org/me
I don't think it will start/stop twice because
amdgpu_fence_driver_hw_fini/inint is not called during reset.
I am worried about calling drm_sched_start without calling
drm_sched_resubmit_job first since that
the place where the jobs are actually restarted. Also calling
drm_sched_start with fal
IOMMU hnadling to TTM layer.
v3:
Move pinned list to ttm device and a few others.
Andrey Grodzovsky (4):
drm/ttm: Create pinned list
drm/ttm: Clear all DMA mappings on demand
drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case
drm/amdgpu: Add a UAPI flag for hot plug/unplug
drivers/gpu
This list will be used to capture all non VRAM BOs not
on LRU so when device is hot unplugged we can iterate
the list and unmap DMA mappings before device is removed.
v2: Reanme function to ttm_bo_move_to_pinned
v3: Move the pinned list to ttm device
Signed-off-by: Andrey Grodzovsky
Suggested
To support libdrm tests.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6400259a7c4b..c2fdf67ff551 100644
Switch to ttm_tt_unpopulate
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/ttm/ttm_device.c | 47
include/drm/ttm/ttm_device.h | 1 +
2 files changed, 48 insertions(+)
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index
Handle all DMA IOMMU group related dependencies before the
group is removed and we try to access it after free.
v2:
Move the actul handling function to TTM
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a
On 2021-08-30 4:58 a.m., Christian König wrote:
Am 27.08.21 um 22:39 schrieb Andrey Grodzovsky:
This list will be used to capture all non VRAM BOs not
on LRU so when device is hot unplugged we can iterate
the list and unmap DMA mappings before device is removed.
v2: Reanme function to
On 2021-08-30 12:51 p.m., Christian König wrote:
Am 30.08.21 um 16:16 schrieb Andrey Grodzovsky:
On 2021-08-30 4:58 a.m., Christian König wrote:
Am 27.08.21 um 22:39 schrieb Andrey Grodzovsky:
This list will be used to capture all non VRAM BOs not
on LRU so when device is hot unplugged we
On 2021-08-30 1:05 p.m., Christian König wrote:
Am 30.08.21 um 19:02 schrieb Andrey Grodzovsky:
On 2021-08-30 12:51 p.m., Christian König wrote:
Am 30.08.21 um 16:16 schrieb Andrey Grodzovsky:
On 2021-08-30 4:58 a.m., Christian König wrote:
Am 27.08.21 um 22:39 schrieb Andrey
empty before suspend.
v2: Call drm_sched_resubmit_job before drm_sched_start to
restart jobs from the pending list.
Suggested-by: Andrey Grodzovsky
Suggested-by: Christian König
Signed-off-by: Guchun Chen
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8
On 2021-08-30 11:24 p.m., Pan, Xinhui wrote:
[AMD Official Use Only]
[AMD Official Use Only]
Unreserve root BO before return otherwise next allocation got deadlock.
Signed-off-by: xinhui pan
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 11 +--
1 file changed, 5 insertions(+), 6
It's says patch [2/2] but i can't find patch 1
On 2021-08-31 6:35 a.m., Monk Liu wrote:
tested-by: jingwen chen
Signed-off-by: Monk Liu
Signed-off-by: jingwen chen
---
drivers/gpu/drm/scheduler/sched_main.c | 24
1 file changed, 4 insertions(+), 20 deletions(-)
di
On 2021-08-31 10:03 a.m., Daniel Vetter wrote:
On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote:
It's says patch [2/2] but i can't find patch 1
On 2021-08-31 6:35 a.m., Monk Liu wrote:
tested-by: jingwen chen
Signed-off-by: Monk Liu
Signed-off-by: ji
On 2021-08-31 10:38 a.m., Daniel Vetter wrote:
On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote:
On 2021-08-31 10:03 a.m., Daniel Vetter wrote:
On Tue, Aug 31, 2021 at 09:53:36AM -0400, Andrey Grodzovsky wrote:
It's says patch [2/2] but i can't find patch 1
On 20
On 2021-08-31 9:11 a.m., Daniel Vetter wrote:
On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote:
On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote:
On 2021-08-19 5:30 a.m., Daniel Vetter wrote:
On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote
On 2021-08-31 12:01 p.m., Luben Tuikov wrote:
On 2021-08-31 11:23, Andrey Grodzovsky wrote:
On 2021-08-31 10:38 a.m., Daniel Vetter wrote:
On Tue, Aug 31, 2021 at 10:20:40AM -0400, Andrey Grodzovsky wrote:
On 2021-08-31 10:03 a.m., Daniel Vetter wrote:
On Tue, Aug 31, 2021 at 09:53:36AM
I will answer everything here -
On 2021-08-31 9:58 p.m., Liu, Monk wrote:
[AMD Official Use Only]
In the previous discussion, you guys stated that we should drop the
“kthread_should_park” in cleanup_job.
@@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct
drm_gpu_scheduler *sched)
{
On 2021-09-01 12:25 a.m., Jingwen Chen wrote:
On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:
I will answer everything here -
On 2021-08-31 9:58 p.m., Liu, Monk wrote:
[AMD Official Use Only]
In the previous discussion, you guys stated that we should
On 2021-09-01 12:40 a.m., Jingwen Chen wrote:
On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote:
On 2021-09-01 12:25 a.m., Jingwen Chen wrote:
On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:
I will answer everything here -
On 2021-08-31 9:58 p.m., Liu, Monk
On 2021-09-02 10:28 a.m., Daniel Vetter wrote:
On Tue, Aug 31, 2021 at 02:24:52PM -0400, Andrey Grodzovsky wrote:
On 2021-08-31 9:11 a.m., Daniel Vetter wrote:
On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote:
On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote
Please add a tag V2 in description explaining what was the delta from V1.
Other then that looks good to me.
Andrey
On 2021-09-12 7:48 p.m., xinhui pan wrote:
Direct IB submission should be exclusive. So use write lock.
Signed-off-by: xinhui pan
---
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.
On 2021-09-14 9:42 p.m., xinhui pan wrote:
We hit soft hang while doing memory pressure test on one numa system.
After a qucik look, this is because kfd invalid/valid userptr memory
frequently with process_info lock hold.
perf top says below,
75.81% [kernel] [k] __srcu_read_unlock
Do
I think you missed 'reply all' so bringing back to public
On 2021-09-14 11:40 p.m., Pan, Xinhui wrote:
[AMD Official Use Only]
perf says it is the lock addl $0x0,-0x4(%rsp)
details is below. the contention is huge maybe.
Yes - that makes sense to me too as long as the lock here is some
On 2021-09-15 2:42 a.m., Pan, Xinhui wrote:
[AMD Official Use Only]
Andrey
I hit panic with this plug/unplug test without this patch.
Can you please tell which ASIC you are using and which kernel branch and
what is the tip commit ?
But as we add enter/exit in all its callers. maybe it wo
On 2021-09-15 9:57 a.m., Christian König wrote:
Am 15.09.21 um 15:52 schrieb Andrey Grodzovsky:
On 2021-09-15 2:42 a.m., Pan, Xinhui wrote:
[AMD Official Use Only]
Andrey
I hit panic with this plug/unplug test without this patch.
Can you please tell which ASIC you are using and which
Pushed
Andrey
On 2021-09-15 7:45 a.m., Christian König wrote:
Yes, I think so as well. Andrey can you push this?
Christian.
Am 15.09.21 um 00:59 schrieb Grodzovsky, Andrey:
AFAIK this one is independent.
Christian, can you confirm ?
Andrey
--
spend
ee6679aaa61c drm/amdgpu: add missing cleanups for Polaris12 UVD/VCE on suspend
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/uvd_v3_1.c | 24 ---
drivers/gpu/drm/amd/amdgpu/uvd_v4_2.c | 24 ---
drivers/gpu/drm/amd/amdgpu/uvd_v5_0.c | 24 ---
dr
Why:
DC core is being released from DM before it's referenced
from hpd_rx wq destruction code.
How: Move hpd_rx destruction before DC core destruction.
Signed-off-by: Andrey Grodzovsky
---
.../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 24 +--
1 file changed, 12 inser
I fixed 2 regressions and latest code, applied your patch on top and
passed libdrm tests
on Vega 10. You can pickup those 2 patches and try too if you have time.
In any case -
Reviewed-and-tested-by: Andrey Grodzovsky
Andrey
On 2021-09-15 2:37 a.m., xinhui pan wrote:
We hit soft hang while
On 2021-09-16 4:20 a.m., Lazar, Lijo wrote:
A minor comment below.
On 9/16/2021 1:11 AM, Andrey Grodzovsky wrote:
Crash:
BUG: unable to handle page fault for address: 10e1
RIP: 0010:vega10_power_gate_vce+0x26/0x50 [amdgpu]
Call Trace:
pp_set_powergating_by_smu+0x16a/0x2b0 [amdgpu
On 2021-09-16 11:51 a.m., Lazar, Lijo wrote:
On 9/16/2021 9:15 PM, Andrey Grodzovsky wrote:
On 2021-09-16 4:20 a.m., Lazar, Lijo wrote:
A minor comment below.
On 9/16/2021 1:11 AM, Andrey Grodzovsky wrote:
Crash:
BUG: unable to handle page fault for address: 10e1
RIP: 0010
Add more guards to MMIO access post device
unbind/unplug
Bug:https://bugs.archlinux.org/task/72092?project=1&order=dateopened&sort=desc&pagenum=1
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c | 8 ++--
drivers/gpu/drm/amd/amdgpu/vcn
PCIe error recovery to avoid
accessing registres. This allows to successfully complete
pm resume sequence and finish pci remove.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4
1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu
wed-by: James Zhu
Thanks & Best Regards!
James
On 2021-09-17 7:30 a.m., Andrey Grodzovsky wrote:
Add more guards to MMIO access post device
unbind/unplug
Bug:https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.archlinux.org%2Ftask%2F72092%3Fproject%3D1%26order%3Ddat
Ping
Andrey
On 2021-09-17 7:30 a.m., Andrey Grodzovsky wrote:
Problem:
When device goes into suspend and unplugged during it
then all HW programming during resume fails leading
to a bad SW during pci remove handling which follows.
Because device is first resumed and only later removed
we
In any case, once you converge on solution please include
the relevant ticket in the commit description -
https://gitlab.freedesktop.org/drm/amd/-/issues/1718
Andrey
On 2021-09-20 10:20 p.m., Felix Kuehling wrote:
Am 2021-09-20 um 5:55 p.m. schrieb Philip Yang:
Don't use devm_request_free_me
Reviewed-by: Andrey Grodzovsky
Andrey
On 2021-09-21 9:11 a.m., Chen, Guchun wrote:
[Public]
Ping...
Regards,
Guchun
-Original Message-
From: Chen, Guchun
Sent: Saturday, September 18, 2021 2:09 PM
To: amd-gfx@lists.freedesktop.org; Koenig, Christian ; Pan, Xinhui
; Deucher
Series is Acked-by: Andrey Grodzovsky
Andrey
On 2021-09-21 2:53 p.m., Philip Yang wrote:
If svm migration init failed to create pgmap for device memory, set
pgmap type to 0 to disable device SVM support capability.
Signed-off-by: Philip Yang
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
Can you test this change with hotunplug tests in libdrm ?
Since the tests are still in disabled mode until latest fixes propagate
to drm-next upstream you will need to comment out
https://gitlab.freedesktop.org/mesa/drm/-/blob/main/tests/amdgpu/hotunplug_tests.c#L65
I recently fixed a few regres
On 2021-09-30 10:00 p.m., Guchun Chen wrote:
When a PCI error state pci_channel_io_normal is detectd, it will
report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI driver
will continue the execution of PCI resume callback report_resume by
pci_walk_bridge, and the callback will go into
No, scheduler restart and device unlock must take place
inamdgpu_pci_resume (see struct pci_error_handlers for the various
states of PCI recovery). So just add a flag (probably in amdgpu_device)
so we can remember what pci_channel_state_t we came from (unfortunately
it's not passed to us in am
From what I see here you supposed to have actual deadlock and not only
warning, sched_fence->finished is first signaled from within
hw fence done callback (drm_sched_job_done_cb) but then again from
within it's own callback (drm_sched_entity_kill_jobs_cb) and so
looks like same fence object is
scheduler fence.
Daniel is right that this needs an irq_work struct to handle this
properly.
Christian.
Am 01.10.21 um 17:10 schrieb Andrey Grodzovsky:
From what I see here you supposed to have actual deadlock and not
only warning, sched_fence->finished is first signaled from within
only continue the execution in amdgpu_pci_resume
when it's pci_channel_io_frozen.
Fixes: c9a6b82f45e2("drm/amdgpu: Implement DPC recovery")
Suggested-by: Andrey Grodzovsky
Signed-off-by: Guchun Chen
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 1 +
drivers/gpu/drm/amd/amdgpu/
On 2021-10-19 9:22 a.m., Nirmoy Das wrote:
Get rid off pin/unpin and evict and swap back gart
page table which should make things less likely to break.
+Christian
Could you guys also clarify what exactly are the stability issues this
fixes ?
Andrey
Also remove 2nd call to amdgpu_devic
On 2021-10-19 11:54 a.m., Christian König wrote:
Am 19.10.21 um 17:41 schrieb Andrey Grodzovsky:
On 2021-10-19 9:22 a.m., Nirmoy Das wrote:
Get rid off pin/unpin and evict and swap back gart
page table which should make things less likely to break.
+Christian
Could you guys also clarify
can and cannot be done there.
Andrey
Am 01.10.21 um 17:10 schrieb Andrey Grodzovsky:
From what I see here you supposed to have actual deadlock and not
only warning, sched_fence->finished is first signaled from within
hw fence done callback (drm_sched_job_done_cb) but then agai
On 2021-10-21 3:19 a.m., Yu, Lang wrote:
[AMD Official Use Only]
-Original Message-
From: Yu, Lang
Sent: Thursday, October 21, 2021 3:18 PM
To: Grodzovsky, Andrey
Cc: Deucher, Alexander ; Koenig, Christian
; Huang, Ray ; Yu, Lang
Subject: [PATCH 1/3] drm/amdgpu: fix a potential me
..
ring->adev->rings[ring->idx] = NULL;
}
Regards,
Lang
Got it, Looks good to me.
Reviewed-by: Andrey Grodzovsky
Andrey
Fixes: 72c8c97b1522 ("drm/amdgpu: Split amdgpu_device_fini into early
and late")
Signed-off-by: Lang Yu
---
drivers/gpu/drm/amd/amdgp
What do you mean by underflow in this case ? You mean use after free
because of extra dma_fence_put() ?
On 2021-10-22 4:14 a.m., JingWen Chen wrote:
ping
On 2021/10/22 AM11:33, Jingwen Chen wrote:
[Why]
In advance tdr mode, the real bad job will be resubmitted twice, while
in drm_sched_res
On 2021-10-24 10:56 p.m., JingWen Chen wrote:
On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
What do you mean by underflow in this case ? You mean use after free because of
extra dma_fence_put() ?
yes
Then maybe update the description because 'underflow' is very confusing
Adding back Daniel (somehow he got off the addresses list) and Chris who
worked a lot in this area.
On 2021-10-21 2:34 a.m., Christian König wrote:
Am 20.10.21 um 21:32 schrieb Andrey Grodzovsky:
On 2021-10-04 4:14 a.m., Christian König wrote:
The problem is a bit different.
The callback
know if I am still missing some point of yours.
Andrey
Regards,
Christian.
Am 25.10.21 um 21:10 schrieb Andrey Grodzovsky:
Adding back Daniel (somehow he got off the addresses list) and Chris
who worked a lot in this area.
On 2021-10-21 2:34 a.m., Christian König wrote:
Am 20.10.21 um 21
On 2021-10-26 6:54 a.m., Christian König wrote:
Am 26.10.21 um 04:33 schrieb Andrey Grodzovsky:
On 2021-10-25 3:56 p.m., Christian König wrote:
In general I'm all there to get this fixed, but there is one major
problem: Drivers don't expect the lock to be dropped.
I am probab
On 2021-10-27 10:34 a.m., Christian König wrote:
Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky:
[SNIP]
Let me please know if I am still missing some point of yours.
Well, I mean we need to be able to handle this for all drivers.
For sure, but as i said above in my opinion we need to
On 2021-10-25 10:57 p.m., JingWen Chen wrote:
On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
On 2021-10-24 10:56 p.m., JingWen Chen wrote:
On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
What do you mean by underflow in this case ? You mean use after free because of
extra dma_fence_put
On 2021-10-27 10:50 a.m., Christian König wrote:
Am 27.10.21 um 16:47 schrieb Andrey Grodzovsky:
On 2021-10-27 10:34 a.m., Christian König wrote:
Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky:
[SNIP]
Let me please know if I am still missing some point of yours.
Well, I mean we need
On 2021-10-27 10:43 p.m., JingWen Chen wrote:
On 2021/10/28 上午3:43, Andrey Grodzovsky wrote:
On 2021-10-25 10:57 p.m., JingWen Chen wrote:
On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
On 2021-10-24 10:56 p.m., JingWen Chen wrote:
On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
What do
On 2021-10-27 3:58 p.m., Andrey Grodzovsky wrote:
On 2021-10-27 10:50 a.m., Christian König wrote:
Am 27.10.21 um 16:47 schrieb Andrey Grodzovsky:
On 2021-10-27 10:34 a.m., Christian König wrote:
Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky:
[SNIP]
Let me please know if I am still
Pushed to drm-misc-next
Andrey
On 2021-10-29 3:07 a.m., Christian König wrote:
Attached a patch. Give it a try please, I tested it on my side and
tried to generate the right conditions to trigger this code path by
repeatedly submitting commands while issuing GPU reset to stop the
scheduler
On 2021-11-10 5:09 a.m., Christian König wrote:
Am 10.11.21 um 10:50 schrieb Daniel Vetter:
On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote:
On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter wrote:
On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
I stumbled across this threa
1 - 100 of 1500 matches
Mail list logo