Thanks for helping with review and good improvement ideas.
Pushed to drm-misc-next.
Andrey
On 2022-09-30 00:12, Luben Tuikov wrote:
From: Andrey Grodzovsky
When many entities are competing for the same run queue
on the same scheduler, we observe an unusually long wait
times and some jobs
:
Various cosmetical fixes and minor refactoring of fifo update function. (Luben)
v4:
Switch drm_sched_rq_select_entity_fifo to in order search (Luben)
v5: Fix up drm_sched_rq_select_entity_fifo loop
Signed-off-by: Andrey Grodzovsky
Tested-by: Li Yunxiang (Teddy)
---
drivers/gpu/drm/sche
Hey, i have problems with my git-send today so i just attached V5 as a
patch here.
Andrey
On 2022-09-27 19:56, Luben Tuikov wrote:
Inlined:
On 2022-09-22 12:15, Andrey Grodzovsky wrote:
On 2022-09-22 11:03, Luben Tuikov wrote:
The title of this patch has "v3", but "v4"
Ping
Andrey
On 2022-09-22 12:15, Andrey Grodzovsky wrote:
On 2022-09-22 11:03, Luben Tuikov wrote:
The title of this patch has "v3", but "v4" in the title prefix.
If you're using "-v" to git-format-patch, please remove the "v3" from
the t
On 2022-09-22 11:03, Luben Tuikov wrote:
The title of this patch has "v3", but "v4" in the title prefix.
If you're using "-v" to git-format-patch, please remove the "v3" from the title.
Inlined:
On 2022-09-21 14:28, Andrey Grodzovsky wrote:
When
op default option in module control parameter.
v3:
Various cosmetical fixes and minor refactoring of fifo update function. (Luben)
v4:
Switch drm_sched_rq_select_entity_fifo to in order search (Luben)
Signed-off-by: Andrey Grodzovsky
Tested-by: Li Yunxiang (Teddy)
---
drivers/gpu/drm/scheduler/
On 2022-09-19 23:11, Luben Tuikov wrote:
Please run this patch through checkpatch.pl, as it shows
12 warnings with it. Use these command line options:
"--strict --show-types".
Inlined:
On 2022-09-13 16:40, Andrey Grodzovsky wrote:
Given many entities competing for same run queue o
Zhao, Victor
; amd-gfx@lists.freedesktop.org
Cc: Deng, Emily
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
On 2022-09-16 01:18, Christian König wrote:
Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
On 2022-09-15 15:26, Christian König wrote:
Am 15.09.22 um 20:29 sc
On 2022-09-16 01:18, Christian König wrote:
Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
On 2022-09-15 15:26, Christian König wrote:
Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
On 2022-09-15 06:09, Zhao, Victor wrote:
[AMD Official Use Only - General]
Hi Christian,
The test
On 2022-09-15 15:26, Christian König wrote:
Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
On 2022-09-15 06:09, Zhao, Victor wrote:
[AMD Official Use Only - General]
Hi Christian,
The test sequence is executing a compute engine hang while running a
lot of containers submitting gfx jobs
Had a typo - see bellow
On 2022-09-15 14:29, Andrey Grodzovsky wrote:
On 2022-09-15 06:09, Zhao, Victor wrote:
[AMD Official Use Only - General]
Hi Christian,
The test sequence is executing a compute engine hang while running a
lot of containers submitting gfx jobs. We have advanced tdr
On 2022-09-15 06:09, Zhao, Victor wrote:
[AMD Official Use Only - General]
Hi Christian,
The test sequence is executing a compute engine hang while running a lot of
containers submitting gfx jobs. We have advanced tdr mode and mode2 reset
enabled on driver.
When a compute hang job timeout h
le control parameter.
v3:
Various cosmetical fixes and minor refactoring of fifo update function.
Signed-off-by: Andrey Grodzovsky
Tested-by: Li Yunxiang (Teddy)
---
drivers/gpu/drm/scheduler/sched_entity.c | 26 -
drivers/gpu/drm/scheduler/sched_main.c | 132
I guess, but this is kind of implicit assumption which is not really
documented and easily overlooked.
Anyway - for this code it's not directly relevant.
Andrey
On 2022-09-13 03:25, Christian König wrote:
Am 13.09.22 um 04:00 schrieb Andrey Grodzovsky:
[SNIP]
You are right for sche
: Introduce gfx software ring(v3)
On 2022-09-12 12:22, Christian König wrote:
Am 12.09.22 um 17:34 schrieb Andrey Grodzovsky:
On 2022-09-12 09:27, Christian König wrote:
Am 12.09.22 um 15:22 schrieb Andrey Grodzovsky:
On 2022-09-12 06:20, Christian König wrote:
Am 09.09.22 um 18:45 schrieb
On 2022-09-12 12:22, Christian König wrote:
Am 12.09.22 um 17:34 schrieb Andrey Grodzovsky:
On 2022-09-12 09:27, Christian König wrote:
Am 12.09.22 um 15:22 schrieb Andrey Grodzovsky:
On 2022-09-12 06:20, Christian König wrote:
Am 09.09.22 um 18:45 schrieb Andrey Grodzovsky:
On 2022-09
On 2022-09-12 09:27, Christian König wrote:
Am 12.09.22 um 15:22 schrieb Andrey Grodzovsky:
On 2022-09-12 06:20, Christian König wrote:
Am 09.09.22 um 18:45 schrieb Andrey Grodzovsky:
On 2022-09-08 21:50, jiadong@amd.com wrote:
From: "Jiadong.Zhu"
The software ring is
On 2022-09-12 06:20, Christian König wrote:
Am 09.09.22 um 18:45 schrieb Andrey Grodzovsky:
On 2022-09-08 21:50, jiadong@amd.com wrote:
From: "Jiadong.Zhu"
The software ring is created to support priority
context while there is only one hardware queue
for gfx.
Every soft
On 2022-09-08 21:50, jiadong@amd.com wrote:
From: "Jiadong.Zhu"
Trigger MCBP according to the priroty of the
software rings and the hw fence signaling
condition.
The muxer records some lastest locations from the
software ring which is used to resubmit packages
in preemption scenarios.
v
Really can't say to much here as I am not really familiar with queues
map/unmap...
Andrey
On 2022-09-08 21:50, jiadong@amd.com wrote:
From: "Jiadong.Zhu"
1. Modify the unmap_queue package on gfx9.
Add trailing fence to track the preemption done.
2. Modify emit_ce_meta emit_de_meta fu
Acked-by: Andrey Grodzovsky
Andrey
On 2022-09-08 21:50, jiadong@amd.com wrote:
From: "Jiadong.Zhu"
Set ring functions with software ring callbacks
on gfx9.
The software ring could be tested by debugfs_test_ib
case.
v2: set sw_ring 2 to enable software ring by default.
v3:
On 2022-09-08 21:50, jiadong@amd.com wrote:
From: "Jiadong.Zhu"
The software ring is created to support priority
context while there is only one hardware queue
for gfx.
Every software rings has its fence driver and could
be used as an ordinary ring for the gpu_scheduler.
Multiple softwar
Please send everything together because otherwise it's not clear why we
need this.
Andrey
On 2022-09-08 11:09, James Zhu wrote:
Yes, it is for NPI design. I will send out patches for review soon.
Thanks!
James
On 2022-09-08 11:05 a.m., Andrey Grodzovsky wrote:
So this is the real ne
ched_list to track ring which is used in
this ctx in amdgpu_ctx_fini_entity
Best Regards!
James
On 2022-09-08 10:38 a.m., Andrey Grodzovsky wrote:
I guess it's an option but i don't really see what's the added value
? You saved a few lines in this patch
but added a few lines
re derived from patch [3/4]:
entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
I think no special reason to treat single and multiple schedule list
here.
Best Regards!
James
On 2022-09-08 10:08 a.m., Andrey Grodzovsky wrote:
What's the reason for this entire patch set ?
What's the reason for this entire patch set ?
Andrey
On 2022-09-07 16:57, James Zhu wrote:
drm_sched_pick_best returns struct drm_gpu_scheduler ** instead of
struct drm_gpu_scheduler *
Signed-off-by: James Zhu
---
include/drm/gpu_scheduler.h | 2 +-
1 file changed, 1 insertion(+), 1 deleti
Luben, just a ping, whenever you have time.
Andrey
On 2022-09-05 01:57, Christian König wrote:
Am 03.09.22 um 04:48 schrieb Andrey Grodzovsky:
Poblem: Given many entities competing for same rq on
same scheduler an uncceptabliy long wait time for some
jobs waiting stuck in rq before being
e structure for entites based on TS of
oldest job waiting in job queue of enitity. Improves next
enitity extraction to O(1). Enitity TS update
O(log(number of entites in rq))
Drop default option in module control parameter.
Signed-off-by: Andrey Grodzovsky
Tested-by: Li Yunxiang (Teddy)
---
On 2022-08-24 22:29, Luben Tuikov wrote:
Inlined:
On 2022-08-24 12:21, Andrey Grodzovsky wrote:
On 2022-08-23 17:37, Luben Tuikov wrote:
On 2022-08-23 14:57, Andrey Grodzovsky wrote:
On 2022-08-23 14:30, Luben Tuikov wrote:
On 2022-08-23 14:13, Andrey Grodzovsky wrote:
On 2022-08-23 12
On 2022-08-24 22:29, Luben Tuikov wrote:
Inlined:
On 2022-08-24 12:21, Andrey Grodzovsky wrote:
On 2022-08-23 17:37, Luben Tuikov wrote:
On 2022-08-23 14:57, Andrey Grodzovsky wrote:
On 2022-08-23 14:30, Luben Tuikov wrote:
On 2022-08-23 14:13, Andrey Grodzovsky wrote:
On 2022-08-23 12
On 2022-08-23 17:37, Luben Tuikov wrote:
On 2022-08-23 14:57, Andrey Grodzovsky wrote:
On 2022-08-23 14:30, Luben Tuikov wrote:
On 2022-08-23 14:13, Andrey Grodzovsky wrote:
On 2022-08-23 12:58, Luben Tuikov wrote:
Inlined:
On 2022-08-22 16:09, Andrey Grodzovsky wrote:
Poblem: Given
On 2022-08-17 10:01, Andrey Grodzovsky wrote:
On 2022-08-17 09:44, Alex Deucher wrote:
On Tue, Aug 16, 2022 at 10:54 PM Chai, Thomas
wrote:
[AMD Official Use Only - General]
Hi Alex:
When removing an amdgpu device, it may be difficult to change the
order of psp_hw_fini calls.
1. The
On 2022-08-24 04:29, Michel Dänzer wrote:
On 2022-08-22 22:09, Andrey Grodzovsky wrote:
Poblem: Given many entities competing for same rq on
same scheduler an uncceptabliy long wait time for some
jobs waiting stuck in rq before being picked up are
observed (seen using GPUVis).
The issue is
On 2022-08-23 14:30, Luben Tuikov wrote:
On 2022-08-23 14:13, Andrey Grodzovsky wrote:
On 2022-08-23 12:58, Luben Tuikov wrote:
Inlined:
On 2022-08-22 16:09, Andrey Grodzovsky wrote:
Poblem: Given many entities competing for same rq on
^Problem
same scheduler an uncceptabliy long wait
On 2022-08-23 12:58, Luben Tuikov wrote:
Inlined:
On 2022-08-22 16:09, Andrey Grodzovsky wrote:
Poblem: Given many entities competing for same rq on
^Problem
same scheduler an uncceptabliy long wait time for some
^unacceptably
jobs waiting stuck in rq before being picked up are
On 2022-08-23 08:15, Christian König wrote:
Am 22.08.22 um 22:09 schrieb Andrey Grodzovsky:
Poblem: Given many entities competing for same rq on
same scheduler an uncceptabliy long wait time for some
jobs waiting stuck in rq before being picked up are
observed (seen using GPUVis).
The issue
job in the long queue.
Fix:
Add FIFO selection policy to entites in RQ, chose next enitity
on rq in such order that if job on one entity arrived
ealrier then job on another entity the first job will start
executing ealier regardless of the length of the entity's job
queue.
Signed-off-by: An
amdgpu_pci_remove
function, which makes the gpu device inaccessible for userspace operations.
If the call to psp_hw_fini was moved before drm_dev_unplug, userspace could
access the gpu device but the psp might be removing. It has unknown issues.
+Andrey Grodzovsky
We should fix the ordering
On 2022-08-12 14:38, Kim, Jonathan wrote:
[Public]
Hi Andrey,
Here's the load/unload stack trace. This is a 2 GPU xGMI system. I put
dbg_xgmi_hive_get/put refcount print post kobj get/put.
It's stuck at 2 on unload. If it's an 8 GPU system, it's stuck at 8.
e.g. of sysfs leak after drive
On 2022-08-11 11:34, Kim, Jonathan wrote:
[Public]
-Original Message-
From: Kuehling, Felix
Sent: August 11, 2022 11:19 AM
To: amd-gfx@lists.freedesktop.org; Kim, Jonathan
Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference
leak
Am 2022-08-11 um 09:42 schrieb
Series is Acked-by: Andrey Grodzovsky
Andrey
On 2022-08-01 00:07, Victor Zhao wrote:
To meet the requirement for multi container usecase which needs
a quicker reset and not causing VRAM lost, adding the Mode2
reset handler for sienna_cichlid.
v2: move skip mode2 flag part separately
v3
On 2022-07-28 06:30, Victor Zhao wrote:
To meet the requirement for multi container usecase which needs
a quicker reset and not causing VRAM lost, adding the Mode2
reset handler for sienna_cichlid.
v2: move skip mode2 flag part separately
Signed-off-by: Victor Zhao
---
drivers/gpu/drm/amd/
On 2022-07-28 06:30, Victor Zhao wrote:
In multi container use case, reset time is important, so skip ring
tests and cp halt wait during ip suspending for reset as they are
going to fail and cost more time on reset
v2: add a hang flag to indicate the reset comes from a job timeout,
skip ring t
On 2022-07-27 06:35, Zhao, Victor wrote:
[AMD Official Use Only - General]
Hi Andrey,
Problem with status.hang is that it is set at
amdgpu_device_ip_check_soft_reset, which is not implemented in nv or gfx10.
They have to be nicely implemented first.
Another option I thought is to mark status.
The stack trace is expected part of reset procedure so that ok. The
issue you are having is a hang on one of GPU jobs during resume which
triggers a GPU reset attempt.
You can open a ticket with this issue here
https://gitlab.freedesktop.org/drm/amd/-/issues, please attach full
dmesg log.
Got it
Acked-by: Andrey Grodzovsky
Andrey
On 2022-07-26 06:01, Zhao, Victor wrote:
[AMD Official Use Only - General]
Hi Andrey,
For slow tests I mean the slow hang tests by quark tool.
An example here:
hang_vm_gfx_dispatch_slow.lua - This script runs on a graphics engine using
compute
On 2022-07-26 05:40, Zhao, Victor wrote:
[AMD Official Use Only - General]
Hi Andrey,
Reply inline.
Thanks,
Victor
-Original Message-
From: Grodzovsky, Andrey
Sent: Tuesday, July 26, 2022 5:18 AM
To: Zhao, Victor ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Deng, Emi
On 2022-07-22 03:34, Victor Zhao wrote:
For some hang caused by slow tests, engine cannot be stopped which
may cause resume failure after reset. In this case, force halt
engine by reverting context addresses
Can you maybe explain a bit more what exactly you mean by slow test and
why engine ca
Acked-by: Andrey Grodzovsky
Andrey
On 2022-07-22 03:34, Victor Zhao wrote:
Save and restore gfxhub regs as they will be reset during mode 2
Signed-off-by: Victor Zhao
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gfxhub.h| 2 +
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 26
On 2022-07-22 03:34, Victor Zhao wrote:
In multi container use case, reset time is important, so skip ring
tests and cp halt wait during ip suspending for reset as they are
going to fail and cost more time on reset
Why are they failing in this case ? Skipping ring tests is not the best
idea
On 2022-07-25 13:37, Christian König wrote:
Hi Victor,
Am 25.07.22 um 12:45 schrieb Zhao, Victor:
[AMD Official Use Only - General]
Hi @Grodzovsky, Andrey,
Please help review the series, thanks a lot.
Hi @Koenig, Christian,
I thought a module parameter will be exposed to a common user, th
On 2022-07-22 03:33, Victor Zhao wrote:
To meet the requirement for multi container usecase which needs
a quicker reset and not causing VRAM lost, adding the Mode2
reset handler for sienna_cichlid. Adding a AMDGPU_SKIP_MODE2_RESET
flag so driver can fallback to default reset method when mode2
r
Reviewed-by: Andrey Grodzovsky
Andrey
On 2022-07-19 06:39, Andrey Strachuk wrote:
Local variable 'rq' is initialized by an address
of field of drm_sched_job, so it does not make
sense to compare 'rq' with NULL.
Found by Linux Verification Center (linuxtesting.org) with S
amdgpu_job
v4:
add tdr sequence support for this feature. Add a job_run_counter to
indicate whether this job is a resubmit job.
v5
add missing handling in amdgpu_fence_enable_signaling
Signed-off-by: Jingwen Chen
Signed-off-by: Jack Zhang
Reviewed-by: Andrey Gr
Acked-by: Andrey Grodzovsky
Andrey
On 2022-07-14 06:39, Christian König wrote:
Allows submitting jobs as gang which needs to run on multiple engines at the
same time.
All members of the gang get the same implicit, explicit and VM dependencies. So
no gang member will start running until
Reviewed-by: Andrey Grodzovsky
Andrey
On 2022-07-14 06:38, Christian König wrote:
Move setting the job resources into amdgpu_job.c
Signed-off-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 21 ++---
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 17
Found the new use case from the 5/10 of reordering CS ioctl.
Reviewed-by: Andrey Grodzovsky
Andrey
On 2022-07-14 12:26, Christian König wrote:
We need this for limiting codecs like AV1 to the first instance for VCN3.
Essentially the idea is that we first initialize the job with entity,
id
ff-by: Christian König
CC: Andrey Grodzovsky
CC: dri-de...@lists.freedesktop.org
---
drivers/gpu/drm/scheduler/sched_main.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index 68317d3
On 2022-07-14 06:39, Christian König wrote:
Allows submitting jobs as gang which needs to run on multiple
engines at the same time.
Basic idea is that we have a global gang submit fence representing when the
gang leader is finally pushed to run on the hardware last.
Jobs submitted as gang are
On 2022-07-13 13:33, Christian König wrote:
Am 13.07.22 um 19:13 schrieb Andrey Grodzovsky:
This is a follow-up cleanup to [1]. See bellow refcount balancing
for calling amdgpu_job_submit_direct after this cleanup as far
as I calculated.
amdgpu_fence_emit
dma_fence_init 1
g_test_ib
dma_fence_put(fence) 0
[1] -
https://patchwork.kernel.org/project/dri-devel/cover/20220624180955.485440-1-andrey.grodzov...@amd.com/
Signed-off-by: Andrey Grodzovsky
Suggested-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +--
drivers/gpu/drm/a
e EOP interrupt.
Fix:
Before accessing fence array in GPU disable EOP interrupt and flush
all pending interrupt handlers for amdgpu device's interrupt line.
v2: Switch from irq_get/put to full enable/disable_irq for amdgpu
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgp
patch we resumed setting s_fence->parent to NULL
in drm_sched_stop switch to directly checking if job->hw_fence is
signaled to short circuit reset if already signed.
Signed-off-by: Andrey Grodzovsky
Tested-by: Yiqing Yao
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 ++
drivers/
This function should drop the fence refcount when it extracts the
fence from the fence array, just as it's done in amdgpu_fence_process.
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
1 file changed, 3 insertions(
ext patch).
[1] -
https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4...@amd.com/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3
Signed-off-by: Andrey Grodzovsky
Tested-by: Yiqing Yao
---
drivers/gpu/drm/scheduler/sched_main.c | 13 ++---
1 file changed, 10 insertions(+), 3
file/d/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view?usp=sharing
Andrey Grodzovsky (4):
drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences
drm/amdgpu: Prevent race between late signaled fences and GPU reset.
drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer'
drm/amdgpu: Follow up
On 2022-06-22 11:04, Christian König wrote:
Am 22.06.22 um 17:01 schrieb Andrey Grodzovsky:
On 2022-06-22 05:00, Christian König wrote:
Am 21.06.22 um 21:34 schrieb Andrey Grodzovsky:
On 2022-06-21 03:19, Christian König wrote:
Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky:
Problem:
In
On 2022-06-23 01:52, Christian König wrote:
Am 22.06.22 um 19:19 schrieb Andrey Grodzovsky:
On 2022-06-22 03:17, Christian König wrote:
Am 21.06.22 um 22:00 schrieb Andrey Grodzovsky:
On 2022-06-21 03:28, Christian König wrote:
Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
Align
Just a ping
Andrey
On 2022-06-21 15:45, Andrey Grodzovsky wrote:
On 2022-06-21 03:25, Christian König wrote:
Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
Problem:
After we start handling timed out jobs we assume there fences won't be
signaled but we cannot be sure and sometimes they
On 2022-06-22 03:17, Christian König wrote:
Am 21.06.22 um 22:00 schrieb Andrey Grodzovsky:
On 2022-06-21 03:28, Christian König wrote:
Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
Align refcount behaviour for amdgpu_job embedded HW fence with
classic pointer style HW fences by
On 2022-06-22 05:00, Christian König wrote:
Am 21.06.22 um 21:34 schrieb Andrey Grodzovsky:
On 2022-06-21 03:19, Christian König wrote:
Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky:
Problem:
In amdgpu_job_submit_direct - The refcount should drop by 2
but it drops only by 1
21:47, VURDIGERENATARAJ, CHANDAN wrote:
Hi,
Is this a preventive fix or you found errors/oops/hangs?
If you had found errors/oops/hangs, can you please share the details?
BR,
Chandan V N
On 2022-06-21 03:25, Christian König wrote:
Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
Problem:
Aft
On 2022-06-21 03:28, Christian König wrote:
Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
Align refcount behaviour for amdgpu_job embedded HW fence with
classic pointer style HW fences by increasing refcount each
time emit is called so amdgpu code doesn't need to make workarounds
On 2022-06-21 03:25, Christian König wrote:
Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
Problem:
After we start handling timed out jobs we assume there fences won't be
signaled but we cannot be sure and sometimes they fire late. We need
to prevent concurrent accesses to fence array
On 2022-06-21 03:19, Christian König wrote:
Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky:
Problem:
In amdgpu_job_submit_direct - The refcount should drop by 2
but it drops only by 1.
amdgpu_ib_sched->emit -> refcount 1 from first fence init
dma_fence_get -> refcount 2
dme_
patch we resumed setting s_fence->parent to NULL
in drm_sched_stop switch to directly checking if job->hw_fence is
signaled to short circuit reset if already signed.
Signed-off-by: Andrey Grodzovsky
Tested-by: Yiqing Yao
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 ++
drivers/
ext patch).
[1] -
https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4...@amd.com/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3
Signed-off-by: Andrey Grodzovsky
Tested-by: Yiqing Yao
---
drivers/gpu/drm/scheduler/sched_main.c | 16 +---
1 file changed, 13 insertions(+), 3
e EOP interrupt.
Fix:
Before accessing fence array in GPU disable EOP interrupt and flush
all pending interrupt handlers for amdgpu device's interrupt line.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4
drivers/gpu/drm/amd/amdgpu/amdgpu_fen
This function should drop the fence refcount when it extracts the
fence from the fence array, just as it's done in amdgpu_fence_process.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/dr
Problem:
In amdgpu_job_submit_direct - The refcount should drop by 2
but it drops only by 1.
amdgpu_ib_sched->emit -> refcount 1 from first fence init
dma_fence_get -> refcount 2
dme_fence_put -> refcount 1
Fix:
Add put for external_hw_fence in amdgpu_job_free/free_cb
Signed-of
/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view?usp=sharing
Andrey Grodzovsky (5):
drm/amdgpu: Fix possible refcount leak for release of
external_hw_fence
drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences
drm/amdgpu: Prevent race between late signaled fences and GPU reset.
drm/sched: Partial revert
On 2022-06-06 03:43, Yiqing Yao wrote:
[why]
A gfx job may be processed but not finished when reset begin from
compute job timeout. drm_sched_resubmit_jobs_ext in sched_main
assume submitted job unsignaled and always put parent fence.
Resubmission for that job cause underflow. This fix is done
+ Monk
On 2022-05-30 03:52, Christian König wrote:
Am 25.05.22 um 21:04 schrieb Andrey Grodzovsky:
We need to have a work_struct to cancel this reset if another
already in progress.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
drivers/gpu/drm
We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
We skip rest requests if another one is already in progress.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++
1 file changed, 27 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu
We need to have a work_struct to cancel this reset if another
already in progress.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31
We need to have a work_struct to cancel this reset if another
already in progress.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 +--
2 files changed, 19 insertions(+), 2 deletions(-)
diff
Will be read by executors of async reset like debugfs.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 --
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 +
3 files changed, 6 insertions(+), 2 deletions
Save the extra usless work schedule.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 --
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 31207f7eec02
This reverts commit 6417250d3f894e66a68ba1cd93676143f2376a6f.
amdpgu need this function in order to prematurly stop pending
reset works when another reset work already in progress.
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Lai Jiangshan
Reviewed-by: Christian König
---
include/linux
was in v1[1] to eplicit
stopping of each reset request from each reset source
per each request submitter.
v3: Switch back to work_struct from delayed_work (Christian)
[1] -
https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzov...@amd.com/
Andrey Grodzovsky (7):
Revert "work
On 2022-05-20 03:52, Tejun Heo wrote:
On Fri, May 20, 2022 at 08:22:39AM +0200, Christian König wrote:
Am 20.05.22 um 02:47 schrieb Lai Jiangshan:
On Thu, May 19, 2022 at 11:04 PM Andrey Grodzovsky
wrote:
See this patch-set
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F
this only
for delayed_work and not for work_struct.
Andrey
On 2022-05-19 10:52, Lai Jiangshan wrote:
On Thu, May 19, 2022 at 9:57 PM Andrey Grodzovsky
wrote:
This reverts commit 6417250d3f894e66a68ba1cd93676143f2376a6f
and exports the function.
We need this funtion in amdgpu driver to fix
This reverts commit 6417250d3f894e66a68ba1cd93676143f2376a6f
and exports the function.
We need this funtion in amdgpu driver to fix a bug.
Signed-off-by: Andrey Grodzovsky
---
include/linux/workqueue.h | 1 +
kernel/workqueue.c| 9 +
2 files changed, 10 insertions(+)
diff
On 2022-05-19 03:58, Christian König wrote:
Am 18.05.22 um 16:24 schrieb Andrey Grodzovsky:
On 2022-05-18 02:07, Christian König wrote:
Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by
On 2022-05-18 02:07, Christian König wrote:
Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.
Fix: Rework GPU reset to
We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
We skip rest requests if another one is already in progress.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++
1 file changed, 27 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu
We need to have a delayed work to cancel this reset if another
already in progress.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31
We need to have a delayed work to cancel this reset if another
already in progress.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 +--
2 files changed, 19 insertions(+), 2 deletions(-)
diff
1 - 100 of 1001 matches
Mail list logo