[PATCH 1/2] drm/amdgpu/gfx10: implement queue reset via MMIO

2025-01-08 Thread jesse.zh...@amd.com
implement gfx10 kcq reset via mmio. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 121 ++--- 1 file changed, 88 insertions(+), 33 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 88393c

[PATCH 2/2] drm/amdgpu/gfx10: implement gfx queue reset via MMIO

2025-01-08 Thread jesse.zh...@amd.com
implement gfx10 kgq reset via mmio. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 98 ++ 1 file changed, 70 insertions(+), 28 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 89409c

[PATCH v3] drm/amdgpu: Fix the looply call svm_range_restore_pages

2025-01-08 Thread Emily Deng
As the delayed free pt, the wanted freed bo has been reused, which will cause unexpected page fault, and then call svm_range_restore_pages. Detail as below: 1.It wants to free the pt in follow code, but it is not freed immediately and used schedule_work(&vm->pt_free_work); [ 92.276838] Call Tra

Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset

2025-01-08 Thread Lazar, Lijo
On 1/9/2025 1:31 AM, Jonathan Kim wrote: > Per queue reset should be bypassed when gpu recovery is disabled > with module parameter. > > Signed-off-by: Jonathan Kim > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/driver

RE: [PATCH v3] drm/amdkfd: Fix partial migrate issue

2025-01-08 Thread Deng, Emily
[AMD Official Use Only - AMD Internal Distribution Only] From: Yang, Philip Sent: Thursday, January 9, 2025 6:05 AM To: Deng, Emily ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH v3] drm/amdkfd: Fix partial migrate issue On 2025-01-08 08:19, Emily Deng wrote: For partial migrate from r

Re: [PATCH 1/5] drm/amdgpu/gfx: add ring helpers for setting workload profile

2025-01-08 Thread Lazar, Lijo
On 1/9/2025 4:26 AM, Alex Deucher wrote: > Add helpers to switch the workload profile dynamically when > commands are submitted. This allows us to switch to > the FULLSCREEN3D or COMPUTE profile when work is submitted. > Add a delayed work handler to delay switching out of the > selected profil

[PATCH] drm/amd/display: mark static functions noinline_for_stack

2025-01-08 Thread Tzung-Bi Shih
When compiling allmodconfig (CONFIG_WERROR=y) with clang-19, see the following errors: .../display/dc/dml2/display_mode_core.c:6268:13: warning: stack frame size (3128) exceeds limit (3072) in 'dml_prefetch_check' [-Wframe-larger-than] .../display/dc/dml2/dml21/src/dml2_core/dml2_core_dcn4_calcs.

Re:

2025-01-08 Thread Gerry Liu
> 2025年1月9日 00:33,Mario Limonciello 写道: > > On 1/8/2025 07:59, Jiang Liu wrote: >> Subject: [RFC PATCH 00/13] Enhance device state machine to better support >> suspend/resume > > I'm not sure how this happened, but your subject didn't end up in the subject > of the thread on patch 0 so the t

RE: [PATCH 1/5] drm/amdgpu/gfx: add ring helpers for setting workload profile

2025-01-08 Thread Feng, Kenneth
[AMD Official Use Only - AMD Internal Distribution Only] -Original Message- From: Deucher, Alexander Sent: Thursday, January 9, 2025 6:56 AM To: amd-gfx@lists.freedesktop.org Cc: Pillai, Aurabindo ; Feng, Kenneth ; Deucher, Alexander Subject: [PATCH 1/5] drm/amdgpu/gfx: add ring helpers

Re: [v3 4/6] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()

2025-01-08 Thread Gerry Liu
> 2025年1月9日 00:08,Mario Limonciello 写道: > > On 1/8/2025 02:56, Jiang Liu wrote: >> Enhance error handling in function amdgpu_pci_probe() to avoid >> possible resource leakage. >> Signed-off-by: Jiang Liu >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +--- >> 1 file changed, 9

Re: [v3 3/6] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-08 Thread Gerry Liu
> 2025年1月8日 17:54,Lazar, Lijo 写道: > > > > On 1/8/2025 2:26 PM, Jiang Liu wrote: >> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use >> after free bug related to amdgpu_driver_release_kms() as: >> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, >>

[PATCH 5/7] drm/amdgpu: enable VCN/JPEG CGPG for GC IP version 11.5.3

2025-01-08 Thread Tim Huang
Enable VCN/JPEG CGPG for ASIC with GFX version 11.5.3. Signed-off-by: Saleemkhan Jamadar Signed-off-by: Tim Huang Reviewed-by: Yifan Zhang --- drivers/gpu/drm/amd/amdgpu/soc21.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/soc21.c b

[PATCH 6/7] drm/amdgpu: add support for SMU IP version 14.0.5

2025-01-08 Thread Tim Huang
This initializes SMU IP version 14.0.5. Signed-off-by: Tim Huang Reviewed-by: Yifan Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 1 + drivers/gpu/drm/amd/amdgpu/soc21.c | 1 + drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 1 + drivers/gpu/drm/amd/pm/swsmu/smu14/smu_

[PATCH 3/7] drm/amdgpu: add support for NBIO IP version 7.11.2

2025-01-08 Thread Tim Huang
This initializes NBIO IP version 7.11.2. Signed-off-by: Tim Huang Reviewed-by: Yifan Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 1 + drivers/gpu/drm/amd/amdgpu/soc21.c| 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c

[PATCH 7/7] drm/amdgpu: add support for PSP IP version 14.0.5

2025-01-08 Thread Tim Huang
This initializes PSP IP version 14.0.5. Signed-off-by: Tim Huang Reviewed-by: Yifan Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 drivers/gpu/drm/amd/amdgpu/psp_v14_0.c| 10 ++ 3 files changed, 15 inserti

[PATCH 4/7] drm/amdgpu: add support for MMHUB IP version 3.3.2

2025-01-08 Thread Tim Huang
This initializes MMHUB IP version 3.3.2. Signed-off-by: Tim Huang Reviewed-by: Yifan Zhang --- drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 1 + drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/a

[PATCH 1/7] drm/amdgpu: add support for GC IP version 11.5.3

2025-01-08 Thread Tim Huang
This initializes GC IP version 11.5.3. Signed-off-by: Tim Huang Reviewed-by: Yifan Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 6 + drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 1 + drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c| 12 +- drivers/gpu/drm/amd/amdgpu/

[PATCH 2/7] drm/amdgpu: add support for SDMA IP version 6.1.3

2025-01-08 Thread Tim Huang
This initializes SDMA IP version 6.1.3. Signed-off-by: Tim Huang Reviewed-by: Yifan Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 1 + drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c| 1 + drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 ++ 3 files changed, 4 insertions(+) diff --

Re: [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-08 Thread Gerry Liu
> 2025年1月9日 00:04,Mario Limonciello 写道: > > On 1/8/2025 02:56, Jiang Liu wrote: >> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific >> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used >> to do error recovery. >> Signed-off-by: Jiang Liu >> --- >> driv

RE: [PATCH] drm/amdgpu/smu13: update powersave optimizations

2025-01-08 Thread Feng, Kenneth
[AMD Official Use Only - AMD Internal Distribution Only] Reviewed-by: Kenneth Feng kenneth.f...@amd.com -Original Message- From: amd-gfx On Behalf Of Alex Deucher Sent: Thursday, January 9, 2025 4:26 AM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander Subject: [PATCH] drm/amdgp

Re: [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-08 Thread Gerry Liu
> 2025年1月8日 17:31,Lazar, Lijo 写道: > > > > On 1/8/2025 2:26 PM, Jiang Liu wrote: >> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific >> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used >> to do error recovery. >> >> Signed-off-by: Jiang Liu >> --- >>

Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

2025-01-08 Thread Philip Yang
On 2025-01-07 22:08, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] Hi Philip, It still has the deadlock, maybe the best way is

[PATCH 2/5] drm/amdgpu: add dynamic workload profile switching for gfx10

2025-01-08 Thread Alex Deucher
Enable dynamic workload profile switching for gfx10. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10

[PATCH v4] drm/amdkfd: Have kfd driver use same PASID values from graphic driver

2025-01-08 Thread Xiaogang . Chen
From: Xiaogang Chen Current kfd driver has its own PASID value for a kfd process and uses it to locate vm at interrupt handler or mapping between kfd process and vm. That design is not working when a physical gpu device has multiple spatial partitions, ex: adev in CPX mode. This patch has kfd dri

[PATCH 1/5] drm/amdgpu/gfx: add ring helpers for setting workload profile

2025-01-08 Thread Alex Deucher
Add helpers to switch the workload profile dynamically when commands are submitted. This allows us to switch to the FULLSCREEN3D or COMPUTE profile when work is submitted. Add a delayed work handler to delay switching out of the selected profile if additional work comes in. This works the same as

[PATCH 5/5] drm/amdgpu/swsmu: set workload profile to bootup default

2025-01-08 Thread Alex Deucher
Now that we can select a workload profile dynamically when we submit work, it's best to default to the bootup default workload profile. Defaulting to other profiles prevents some power management features from kicking in during idle periods. Once all jobs have finished, the workload profile will

[PATCH 4/5] drm/amdgpu: add dynamic workload profile switching for gfx12

2025-01-08 Thread Alex Deucher
Enable dynamic workload profile switching for gfx12. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12

[PATCH 3/5] drm/amdgpu: add dynamic workload profile switching for gfx11

2025-01-08 Thread Alex Deucher
Enable dynamic workload profile switching for gfx11. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11

Re: [PATCH v3] drm/amdkfd: Fix partial migrate issue

2025-01-08 Thread Philip Yang
On 2025-01-08 08:19, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages is not equal to migrate->npages, should use migrate->npages to check all needed migrate pages which could be copied or not. And only need to set those pages could be m

Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset

2025-01-08 Thread Alex Deucher
On Wed, Jan 8, 2025 at 3:27 PM Jonathan Kim wrote: > > Per queue reset should be bypassed when gpu recovery is disabled > with module parameter. > > Signed-off-by: Jonathan Kim Maybe add a fixes tag? With that, Reviewed-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9

[PATCH] drm/amdgpu/smu13: update powersave optimizations

2025-01-08 Thread Alex Deucher
Only apply when compute profile is selected. This is the only supported configuration. Selecting other profiles can lead to performane degradations. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletio

[PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset

2025-01-08 Thread Jonathan Kim
Per queue reset should be bypassed when gpu recovery is disabled with module parameter. Signed-off-by: Jonathan Kim --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c b/drivers/gpu/drm

Re: [PATCH] drm/amdgpu: Fix shift type in amdgpu_debugfs_sdma_sched_mask_set()

2025-01-08 Thread Mario Limonciello
On 1/8/2025 03:41, Dan Carpenter wrote: The "mask" and "val" variables are type u64. The problem is that the BIT() macros are type unsigned long which is just 32 bits on 32bit systems. It's unlikely that people will be using this driver on 32bit kernels and even if they did we only use the lowe

[PATCH v3] drm/amdkfd: Uninitialized and Unused variables

2025-01-08 Thread Andrew Martin
This patch initialized key variables and removed unused ones. Signed-off-by: Andrew Martin --- .../gpu/drm/amd/amdkfd/cik_event_interrupt.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 24 +-- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +- .../drm/amd/amdkfd/kfd_devi

Re: [PATCH v2] drm/amdgpu/gfx10: Enable cleaner shader for GFX10.3.2/10.3.4/10.3.5 GPUs

2025-01-08 Thread Alex Deucher
On Sat, Jan 4, 2025 at 3:52 AM Srinivasan Shanmugam wrote: > > Enable the cleaner shader for GFX10.3.2/10.3.4/10.3.5 GPUs to provide > data isolation between GPU workloads. The cleaner shader is responsible > for clearing the Local Data Store (LDS), Vector General Purpose > Registers (VGPRs), and

Re: [RFC PATCH 03/13] drm/amdgpu: add a flag to track ras debugfs creation status

2025-01-08 Thread Mario Limonciello
On 1/8/2025 07:59, Jiang Liu wrote: Add a flag to track ras debugfs creation status, to avoid possible incorrect reference count management for ras block object in function amdgpu_ras_aca_is_supported(). Rather than taking a marker position, why not just check for obj->fs_data.debugfs_name to

Re: [PATCH] drm/amdgpu: Fix Circular Locking Dependency in AMDGPU GFX Isolation

2025-01-08 Thread Alex Deucher
On Sat, Jan 4, 2025 at 1:02 AM Srinivasan Shanmugam wrote: > > This commit addresses a circular locking dependency issue within the GFX > isolation mechanism. The problem was identified by a warning indicating > a potential deadlock due to inconsistent lock acquisition order. > > - The `amdgpu_gfx

Re: [RFC PATCH 10/13] drm/admgpu: make device state machine work in stack like way

2025-01-08 Thread Mario Limonciello
On 1/8/2025 08:00, Jiang Liu wrote: Make the device state machine work in stack like way to better support suspend/resume by following changes: 1. amdgpu_driver_load_kms() amdgpu_device_init() amdgpu_device_ip_early_init() ip_blocks[i].early_init()

Re: [RFC PATCH 09/13] drm/amdgpu: make IP block state machine works in stack like way

2025-01-08 Thread Mario Limonciello
On 1/8/2025 08:00, Jiang Liu wrote: There are some mismatch between IP block state machine and its associated status flags, especially about the meaning of `status.late_initialized`. So let's make the state machine and associated status flas work in stack-like s/flas/flag/ way as below: Callb

[PATCH] Mark debug KFD module params as unsafe

2025-01-08 Thread Kent Russell
Mark options only meant to be used for debugging as unsafe so that the kernel is tainted when they are used. Signed-off-by: Kent Russell --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dr

Re:

2025-01-08 Thread Mario Limonciello
On 1/8/2025 07:59, Jiang Liu wrote: Subject: [RFC PATCH 00/13] Enhance device state machine to better support suspend/resume I'm not sure how this happened, but your subject didn't end up in the subject of the thread on patch 0 so the thread just looks like an unsubjected thread. Recentl

Re: [v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-08 Thread Chen, Xiaogang
On 1/8/2025 3:16 AM, Christian König wrote: Am 08.01.25 um 09:56 schrieb Jiang Liu: Function detects initialization status by checking sched->ops, Where is that done? Inside the scheduler or inside amdgpu? Inside amdgpu set ring->sched.ops to null if ring's scheduler init fail since we use

Re: [v3 4/6] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()

2025-01-08 Thread Mario Limonciello
On 1/8/2025 02:56, Jiang Liu wrote: Enhance error handling in function amdgpu_pci_probe() to avoid possible resource leakage. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/

Re: [PATCH 1/3] drm/amdgpu: add stucture to populate doorbell info

2025-01-08 Thread Sharma, Shashank
Hello Saleem, On 06/01/2025 17:45, Saleemkhan Jamadar wrote: This structure basically used to the populate the doorbell information that is required to be mapped. Signed-off-by: Saleemkhan Jamadar --- drivers/gpu/drm/amd/include/amdgpu_userqueue.h | 7 +++ 1 file changed, 7 insertions(+

Re: [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-08 Thread Mario Limonciello
On 1/8/2025 02:56, Jiang Liu wrote: Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used to do error recovery. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 76 +-

Re: [PATCH 3/3] drm/amdgpu: add db size and offset range for VCN and VPE

2025-01-08 Thread Sharma, Shashank
On 06/01/2025 17:45, Saleemkhan Jamadar wrote: VCN and VPE have different offset range, update the doorbell offset range repsectively. Doorbell size for VCN and VPE is 32bit. Signed-off-by: Saleemkhan Jamadar --- drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 15 +++ 1 file chan

Re: [RFC PATCH 01/13] amdgpu: wrong array index to get ip block for PSP

2025-01-08 Thread Alex Deucher
Applied this one. Thanks! Alex On Wed, Jan 8, 2025 at 9:00 AM Jiang Liu wrote: > > The adev->ip_blocks array is not indexed by AMD_IP_BLOCK_TYPE_xxx, > instead we should call amdgpu_device_ip_get_ip_block() to get the > corresponding IP block oject. > > Signed-off-by: Jiang Liu > --- > driver

Re: [PATCH 2/3] drm/amdgpu: map doorbell for the requested userq

2025-01-08 Thread Sharma, Shashank
On 06/01/2025 17:45, Saleemkhan Jamadar wrote: Made changes to the doorbell mapping func more generic, by taking parameters that vary based on IPs and/or usecase into db_info structure. Signed-off-by: Saleemkhan Jamadar ] This line above is garbage, please make sure that you pass checkpatch.

Re: [RFC PATCH] amd/ttm: test fence->ops->signaled before use

2025-01-08 Thread James Zhu
MyQorAisinline. Thanks! JamesZhu On 2025-01-08 04:18, Christian König wrote: Am 07.01.25 um 21:01 schrieb James Zhu: this original test condition is unclear. No that is completely unnecessary. The point is that with fence->ops->signaled provided the fence should make progress even without

[PATCH] drm/amdgpu: Mark debug KFD module params as unsafe

2025-01-08 Thread Kent Russell
Mark options only meant to be used for debugging as unsafe so that the kernel is tainted when they are used. Signed-off-by: Kent Russell --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dr

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-08 Thread Philip Yang
On 2025-01-07 19:31, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only]     From: Yan

Re:

2025-01-08 Thread Christian König
Am 08.01.25 um 14:59 schrieb Jiang Liu: Subject: [RFC PATCH 00/13] Enhance device state machine to better support suspend/resume Recently we were testing suspend/resume functionality with AMD GPUs, we have encountered several resource tracking related bugs, such as double buffer free, use after

[PATCH] drm/amdgpu: Fix shift type in amdgpu_debugfs_sdma_sched_mask_set()

2025-01-08 Thread Dan Carpenter
The "mask" and "val" variables are type u64. The problem is that the BIT() macros are type unsigned long which is just 32 bits on 32bit systems. It's unlikely that people will be using this driver on 32bit kernels and even if they did we only use the lower AMDGPU_MAX_SDMA_INSTANCES (16) bits. So

Re: [PATCH] drm/amdgpu: always sync the GFX pipe on ctx switch

2025-01-08 Thread Alex Deucher
On Wed, Jan 8, 2025 at 4:52 AM Christian König wrote: > > That is needed to enforce isolation between contexts. > > Signed-off-by: Christian König Reviewed-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --gi

[RFC PATCH 12/13] drm/amdgpu/nbio: improve the way to manage irq reference count

2025-01-08 Thread Jiang Liu
Refactor nbio related code to improve the way to manage irq reference count. Originally amdgpu_irq_get() is called from ip_blocks[].late_init and amdgpu_irq_put is called from ip_blocks[].hw_fini. The asymmetric design may cause issue under certain conditions. So 1) introduce amdgpu_nbio_ras_early_

[RFC PATCH 02/13] drm/admgpu: add helper functions to track status for ras manager

2025-01-08 Thread Jiang Liu
Add helper functions to track status for ras manager and ip blocks. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 38 + drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 + drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 10 +++ 3

[RFC PATCH 07/13] drm/amdgpu: enhance amdgpu_ras_pre_fini() to better support SR

2025-01-08 Thread Jiang Liu
Enhance amdgpu_ras_pre_fini() to better support suspend/resume by: 1) fix possible resource leakage. amdgpu_release_ras_context() only kfree(con) but doesn't release resources associated with the con object. 2) call amdgpu_ras_pre_fini() in amdgpu_device_suspend() to undo what has been don

[no subject]

2025-01-08 Thread Jiang Liu
Subject: [RFC PATCH 00/13] Enhance device state machine to better support suspend/resume Recently we were testing suspend/resume functionality with AMD GPUs, we have encountered several resource tracking related bugs, such as double buffer free, use after free and unbalanced irq reference count.

[RFC PATCH 06/13] drm/amdgpu: enhance amdgpu_ras_block_late_fini()

2025-01-08 Thread Jiang Liu
Enhance amdgpu_ras_block_late_fini() to revert what has been done by amdgpu_ras_block_late_init(), and fix a possible resource leakage in function amdgpu_ras_block_late_init(). Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 16 ++-- 1 file changed, 10 insertio

[RFC PATCH 08/13] drm/admgpu: rename amdgpu_ras_pre_fini() to amdgpu_ras_early_fini()

2025-01-08 Thread Jiang Liu
Rename amdgpu_ras_pre_fini() to amdgpu_ras_early_fini(), to keep same style with other code. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c| 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +- drivers/gpu/dr

[RFC PATCH 03/13] drm/amdgpu: add a flag to track ras debugfs creation status

2025-01-08 Thread Jiang Liu
Add a flag to track ras debugfs creation status, to avoid possible incorrect reference count management for ras block object in function amdgpu_ras_aca_is_supported(). Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++

[RFC PATCH 09/13] drm/amdgpu: make IP block state machine works in stack like way

2025-01-08 Thread Jiang Liu
There are some mismatch between IP block state machine and its associated status flags, especially about the meaning of `status.late_initialized`. So let's make the state machine and associated status flas work in stack-like way as below: CallbackStatus early_init: valid = true sw_init:

[RFC PATCH 05/13] drm/amdgpu: introduce a flag to track refcount held for features

2025-01-08 Thread Jiang Liu
Currently we track the refcount on ras block object for features by checking `if (obj && amdgpu_ras_is_feature_enabled(adev, head))`, which is a little unreliable. So introduce a dedicated flag to track the reference count. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2

[RFC PATCH 13/13] drm/amdgpu/asic: make ip block operations symmetric by .early_fini()

2025-01-08 Thread Jiang Liu
Make ip block operations for asic symmetric by making using of the .early_fini() hook, which will undo work done by the .late_init() hook. 1) introduce xxx_common_early_fini() for nv/soc15/soc21/soc24. 2) move `enable_doorbell_selfring_aperture(adev, false)` from .hw_init() into .early_fini(). 3

[RFC PATCH 11/13] drm/amdgpu/sdma: improve the way to manage irq reference count

2025-01-08 Thread Jiang Liu
Refactor sdma related code to improve the way to manage irq reference count. Originally amdgpu_irq_get() is called from ip_blocks[].late_init and amdgpu_irq_put is called from ip_blocks[].hw_fini. The asymmetric design may cause issue under certain conditions. So 1) introduce amdgpu_sdma_ras_early_

[RFC PATCH 01/13] amdgpu: wrong array index to get ip block for PSP

2025-01-08 Thread Jiang Liu
The adev->ip_blocks array is not indexed by AMD_IP_BLOCK_TYPE_xxx, instead we should call amdgpu_device_ip_get_ip_block() to get the corresponding IP block oject. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) d

[RFC PATCH 10/13] drm/admgpu: make device state machine work in stack like way

2025-01-08 Thread Jiang Liu
Make the device state machine work in stack like way to better support suspend/resume by following changes: 1. amdgpu_driver_load_kms() amdgpu_device_init() amdgpu_device_ip_early_init() ip_blocks[i].early_init() ip_blocks[i].

[RFC PATCH 04/13] drm/amdgpu: free all resources on error recovery path of amdgpu_ras_init()

2025-01-08 Thread Jiang Liu
Free all allocated resources on error recovery path in function amdgpu_ras_init(). Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/

[PATCH v3] drm/amdkfd: Fix partial migrate issue

2025-01-08 Thread Emily Deng
For partial migrate from ram to vram, the migrate->cpages is not equal to migrate->npages, should use migrate->npages to check all needed migrate pages which could be copied or not. And only need to set those pages could be migrated to migrate->dst[i], or the migrate_vma_pages will migrate the wro

Re: [PATCH v2] drm/amdkfd: Move gfx12 trap handler to separate file

2025-01-08 Thread Lancelot SIX
On 06/01/2025 19:25, Jay Cornwall wrote: gfx12 derivatives will have substantially different trap handler implementations from gfx10/gfx11. Add a separate source file for gfx12+ and remove unneeded conditional code. No functional change. v2: Revert copyright date to 2018, minor comment fixes

Re: [v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-08 Thread Lazar, Lijo
On 1/8/2025 2:46 PM, Christian König wrote: > Am 08.01.25 um 09:56 schrieb Jiang Liu: >> Function detects initialization status by checking sched->ops, > > Where is that done? Inside the scheduler or inside amdgpu? Down below inside amdgpu_fence_driver_sw_fini(). I think sched.ready is repurpo

Re: [v3 3/6] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-08 Thread Lazar, Lijo
On 1/8/2025 2:26 PM, Jiang Liu wrote: > If some GPU device failed to probe, `rmmod amdgpu` will trigger a use > after free bug related to amdgpu_driver_release_kms() as: > 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, > address: > 2024-12-26 16:17:45

Re: [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()

2025-01-08 Thread Gerry Liu
> 2025年1月8日 18:02,Lazar, Lijo 写道: > > > > On 1/8/2025 2:26 PM, Jiang Liu wrote: >> If error happens before amdgpu_fence_driver_hw_init() gets called during >> device probe, it will trigger a false warning in amdgpu_irq_put() as >> below: >> [ 1209.300996] [ cut here ]

Re: [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()

2025-01-08 Thread Lazar, Lijo
On 1/8/2025 2:26 PM, Jiang Liu wrote: > If error happens before amdgpu_fence_driver_hw_init() gets called during > device probe, it will trigger a false warning in amdgpu_irq_put() as > below: > [ 1209.300996] [ cut here ] > [ 1209.301061] WARNING: CPU: 48 PID: 293 at >

Re: [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-08 Thread Lazar, Lijo
On 1/8/2025 2:26 PM, Jiang Liu wrote: > Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific > drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used > to do error recovery. > > Signed-off-by: Jiang Liu > --- > drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 76 ++

[PATCH] drm/amdgpu: always sync the GFX pipe on ctx switch

2025-01-08 Thread Christian König
That is needed to enforce isolation between contexts. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c index d751995dc

Re: [RFC PATCH] amd/ttm: test fence->ops->signaled before use

2025-01-08 Thread Christian König
Am 07.01.25 um 21:01 schrieb James Zhu: this original test condition is unclear. No that is completely unnecessary. The point is that with fence->ops->signaled provided the fence should make progress even without enabling signaling. Why would you want to add this? Regards, Christian. Si

Re: [v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-08 Thread Christian König
Am 08.01.25 um 09:56 schrieb Jiang Liu: Function detects initialization status by checking sched->ops, Where is that done? Inside the scheduler or inside amdgpu? Regards, Christian. so set sched->ops to non-NULL just before return in function amdgpu_fence_driver_sw_fini() and amdgpu_device

Re: [PATCH] drm/amd: Fix random crashes due to bad kfree

2025-01-08 Thread Chris Bainbridge
#regzbot introduced: c6a837088bed ^

Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()

2025-01-08 Thread Gerry Liu
> 2025年1月8日 06:53,Chen, Xiaogang 写道: > > > > On 1/4/2025 8:45 PM, Jiang Liu wrote: >> Fix possible resource leakage on error recovery path in function >> kgd2kfd_device_init(). >> >> Signed-off-by: Jiang Liu >> >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_devi

Re: [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-08 Thread Gerry Liu
> 2025年1月8日 06:55,Chen, Xiaogang 写道: > > > > On 1/4/2025 8:45 PM, Jiang Liu wrote: >> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use >> after free bug related to amdgpu_driver_release_kms() as: >> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference,

[RESEND PATCH] drm/amdgpu: add tracepoint while dump mca bank

2025-01-08 Thread Ruidong Tian
RAS errors are typically exposed to user-space programs using tracepoints, allowing tools like rasdaemon to decode and post-process them. AMDGPU might also follow this, offering the following benefits: 1. It can proactively notify users of RAS events, eliminating the need to monitor /dev/kmsg. 2

Re: [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()

2025-01-08 Thread Christian König
Am 08.01.25 um 09:56 schrieb Jiang Liu: If error happens before amdgpu_fence_driver_hw_init() gets called during device probe, it will trigger a false warning in amdgpu_irq_put() as below: [ 1209.300996] [ cut here ] [ 1209.301061] WARNING: CPU: 48 PID: 293 at /tmp/amd.Rc

[v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()

2025-01-08 Thread Jiang Liu
If error happens before amdgpu_fence_driver_hw_init() gets called during device probe, it will trigger a false warning in amdgpu_irq_put() as below: [ 1209.300996] [ cut here ] [ 1209.301061] WARNING: CPU: 48 PID: 293 at /tmp/amd.Rc9jFrl7/amd/amdgpu/amdgpu_irq.c:633 amdgpu_

[v3 4/6] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()

2025-01-08 Thread Jiang Liu
Enhance error handling in function amdgpu_pci_probe() to avoid possible resource leakage. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/d

[v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-08 Thread Jiang Liu
Function detects initialization status by checking sched->ops, so set sched->ops to non-NULL just before return in function amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers() to avoid possible invalid memory access on error recover path. Signed-off-by: Jiang Liu --- drivers/gpu/dr

[v3 3/6] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-08 Thread Jiang Liu
If some GPU device failed to probe, `rmmod amdgpu` will trigger a use after free bug related to amdgpu_driver_release_kms() as: 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kerne

[v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-08 Thread Jiang Liu
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used to do error recovery. Signed-off-by: Jiang Liu --- drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 76 + drivers/gpu/drm/amd/amdxcp/amdgpu_

[PATCH v3 0/6] Fix several bugs in error handling during device probe

2025-01-08 Thread Jiang Liu
This patchset tries to fix several memory leakages/invalid memory accesses on error handling path during GPU driver loading/unloading. They applies to: https://gitlab.freedesktop.org/agd5f/linux.git amd-staging-drm-next v3: 1) drop first patch of v2 2) rework the 0003/0004 patches of v2 according

[v3 1/6] drm/amdgpu: clear adev->in_suspend flag when fails to suspend

2025-01-08 Thread Jiang Liu
Clear adev->in_suspend flag when fails to suspend, otherwise it will cause too much warnings like: [ 1802.212027] [ cut here ] [ 1802.212028] WARNING: CPU: 97 PID: 11282 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:452 amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu] [ 1802.2121