[PATCH] drm/amdgpu: update SDMA sysfs reset mask in late_init

2025-02-24 Thread jesse.zhang
From: "jesse.zh...@amd.com" - Added `sdma_v4_4_2_update_reset_mask` function to update the reset mask. - update the sysfs reset mask to the `late_init` stage to ensure that the SMU initialization and capability setup are completed before checking the SDMA reset capability. - For IP versio

[PATCH] drm/amdgpu: update SDMA reset mask in late_init

2025-02-21 Thread jesse.zhang
From: "jesse.zh...@amd.com" - Added `sdma_v4_4_2_update_reset_mask` function to update the reset mask. - update the sysfs reset mask to the `late_init` stage to ensure that the SMU initialization and capability setup are completed before checking the SDMA reset capability. - For IP versio

[PATCH v3 1/2] drm/amd/pm: add support for checking SDMA reset capability

2025-02-20 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch introduces a new function to check if the SMU supports resetting the SDMA engine. This capability check ensures that the driver does not attempt to reset the SDMA engine on hardware that does not support it. The following changes are included: - New funct

[PATCH v3 2/2] drm/amdgpu: Initialize SDMA sysfs reset mask in late_init

2025-02-20 Thread jesse.zhang
From: "jesse.zh...@amd.com" - Introduce a new function `sdma_v4_4_2_init_sysfs_reset_mask` to initialize the sysfs reset mask for SDMA. - Move the initialization of the sysfs reset mask to the `late_init` stage to ensure that the SMU initialization and capability setup are completed befor

[PATCH V2 2/2] drm/amdgpu: Enable per-queue reset support

2025-02-20 Thread jesse.zhang
From: "jesse.zh...@amd.com" - Modify the `sdma_v4_4_2_sw_init` function to conditionally enable per-queue reset support. - For IP versions 9.4.3 and 9.4.4, enable per-queue reset if the MEC firmware version is at least 0xb0 and PMFW supports queue reset. - Add a TODO comment for future support

[PATCH v2 1/2] drm/amd/pm: add support for checking SDMA reset capability

2025-02-20 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch introduces a new function to check if the SMU supports resetting the SDMA engine. This capability check ensures that the driver does not attempt to reset the SDMA engine on hardware that does not support it. The following changes are included: - New funct

[PATCH v3 2/2] drm/amdgpu: Optimize VM invalidation engine allocation and synchronize GPU TLB flush

2025-02-19 Thread jesse.zhang
From: "jesse.zh...@amd.com" - Modify the VM invalidation engine allocation logic to handle SDMA page rings. SDMA page rings now share the VM invalidation engine with SDMA gfx rings instead of allocating a separate engine. This change ensures efficient resource management and avoids the is

[PATCH v3 1/2] drm/amd/amdgpu: Increase max rings to enable SDMA page ring

2025-02-19 Thread jesse.zhang
From: "jesse.zh...@amd.com" Increase the maximum number of rings supported by the AMDGPU driver from 132 to 148. This change is necessary to enable support for the SDMA page ring. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 +- 1 file changed, 1 insertion(+), 1

[PATCH V2 2/2] drm/amdgpu: Optimize VM invalidation engine allocation and synchronize GPU TLB flush

2025-02-19 Thread jesse.zhang
From: "jesse.zh...@amd.com" - Modify the VM invalidation engine allocation logic to handle SDMA page rings. SDMA page rings now share the VM invalidation engine with SDMA gfx rings instead of allocating a separate engine. This change ensures efficient resource management and avoids the is

[PATCH V2 1/2] drm/amd/amdgpu: Increase max rings to enable SDMA page ring

2025-02-19 Thread jesse.zhang
From: "jesse.zh...@amd.com" Increase the maximum number of rings supported by the AMDGPU driver from 132 to 148. This change is necessary to enable support for the SDMA page ring. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 +- 1 file changed, 1 insertion(+), 1

[PATCH 1/2] drm/amd/amdgpu: Increase max rings to enable SDMA page ring

2025-02-18 Thread jesse.zhang
From: "jesse.zh...@amd.com" Increase the maximum number of rings supported by the AMDGPU driver from 132 to 148. This change is necessary to enable support for the SDMA page ring. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 +- 1 file changed, 1 insertion(+), 1

[PATCH 2/2] drm/amdgpu: Optimize VM invalidation engine allocation and synchronize GPU TLB flush

2025-02-18 Thread jesse.zhang
From: "jesse.zh...@amd.com" - Modify the VM invalidation engine allocation logic to handle SDMA page rings. SDMA page rings now share the VM invalidation engine with SDMA gfx rings instead of allocating a separate engine. This change ensures efficient resource management and avoids the is

[PATCH 2/2] drm/amdgpu: Enable per-queue reset support

2025-02-13 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch updates the SDMA v4.4.2 software initialization to enable per-queue reset support when the MEC firmware version is 0xb0 or higher and the PMFW supports SDMA reset. The following changes are included: - Added a condition to check if the MEC firmware version

[PATCH 1/2] drm/amd/pm: add support for checking SDMA reset capability

2025-02-13 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch introduces a new function to check if the SMU supports resetting the SDMA engine. This capability check ensures that the driver does not attempt to reset the SDMA engine on hardware that does not support it. The following changes are included: - New funct

[PATCH V7 8/9] drm/amdgpu: Add reset function pointer for SDMA v4.4.2 page ring

2025-02-12 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch adds a reset function pointer to the SDMA v4.4.2 page ring functionality. The new function pointer `reset` is set to `sdma_v4_4_2_reset_queue`, which is responsible for resetting the SDMA queue. Changes: - Add `reset` function pointer to `sdma_v4_4_2_page_r

[PATCH v79/9] drm/amdgpu: Update SDMA scheduler mask handling to include page queue

2025-02-12 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch updates the SDMA scheduler mask handling to include the page queue if it exists. The scheduler mask is calculated based on the number of SDMA instances and the presence of the page queue. The mask is updated to reflect the state of both the SDMA gfx ring and

[PATCH V7 7/9] drm/amdgpu: Improve SDMA reset logic with guilty queue tracking

2025-02-12 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch includes the remaining improvements to the SDMA reset logic: - Added `gfx_guilty` and `page_guilty` flags to track guilty queues. - Updated the reset and resume functions to handle the guilty state. - Cached the `rptr` before reset. v2: 1.replace the cal

[PATCH v7 6/9] drm/amdgpu/sdma: Introduce is_guilty callbacks for sdma GFX and PAGE rings

2025-02-12 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch introduces the `is_guilty` callbacks for the GFX and PAGE rings. These callbacks check if a ring is guilty of causing a timeout or error. Suggested-by: Alex Deucher Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 30

[PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty

2025-02-12 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch updates the `amdgpu_job_timedout` function to check if the ring is actually guilty of causing the timeout. If not, it skips error handling and fence completion. Suggested-by: Alex Deucher Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_j

[PATCH V7 3/9] drm/amdgpu: Add common lock and reset caller parameter for SDMA reset synchronization

2025-02-12 Thread jesse.zhang
From: "jesse.zh...@amd.com" This commit introduces a caller parameter to the amdgpu_sdma_reset_instance function to differentiate between reset requests originating from the KGD and KFD. This change ensures proper synchronization between KGD and KFD during SDMA resets. If the caller is KFD, th

[PATCH V7 4/9] drm/amdgpu: Introduce cached_rptr and is_guilty callback in amdgpu_ring

2025-02-12 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch introduces the following changes: - Add `cached_rptr` to the `amdgpu_ring` structure to store the read pointer before a reset. - Add `is_guilty` callback to the `amdgpu_ring_funcs` structure to check if a ring is guilty of causing a timeout. Suggested-by:

[PATCH v7 2/9] drm/amdgpu/sdma: Refactor SDMA reset functionality and add callback support

2025-02-12 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch refactors the SDMA reset functionality in the `sdma_v4_4_2` driver to improve modularity and support shared usage between AMDGPU and KFD. The changes include: 1. **Refactored SDMA Reset Logic**: - Split the `sdma_v4_4_2_reset_queue` function into two sep

[PATCH v7 1/9] drm/amdgpu/kfd: Add shared SDMA reset functionality with callback support

2025-02-12 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch introduces shared SDMA reset functionality between AMDGPU and KFD. The implementation includes the following key changes: 1. Added `amdgpu_sdma_reset_queue`: - Resets a specific SDMA queue by instance ID. - Invokes registered pre-reset and post-reset

[PATCH 4/4] drm/amdgpu: Improve SDMA reset logic with guilty queue tracking

2025-02-07 Thread jesse.zhang
From: "jesse.zh...@amd.com" This commit introduces several improvements to the SDMA reset logic: 1. Added `cached_rptr` to the `amdgpu_ring` structure to store the read pointer before a reset, ensuring proper state restoration after reset. 2. Introduced `gfx_guilty` and `page_guilty` flags i

[PATCH 3/4] drm/amdgpu: Add common lock and reset caller parameter for SDMA reset synchronization

2025-02-07 Thread jesse.zhang
From: "jesse.zh...@amd.com" This commit introduces a caller parameter to the amdgpu_sdma_reset_instance function to differentiate between reset requests originating from the KGD and KFD. This change ensures proper synchronization between KGD and KFD during SDMA resets. If the caller is KFD, th

[PATCH 2/4] drm/amdgpu/sdma: Refactor SDMA reset functionality and add callback support

2025-02-07 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch refactors the SDMA reset functionality in the `sdma_v4_4_2` driver to improve modularity and support shared usage between AMDGPU and KFD. The changes include: 1. **Refactored SDMA Reset Logic**: - Split the `sdma_v4_4_2_reset_queue` function into two sep

[PATCH 1/4] drm/amdgpu/kfd: Add shared SDMA reset functionality with callback support

2025-02-07 Thread jesse.zhang
From: "jesse.zh...@amd.com" This patch introduces shared SDMA reset functionality between AMDGPU and KFD. The implementation includes the following key changes: 1. Added `amdgpu_sdma_reset_queue`: - Resets a specific SDMA queue by instance ID. - Invokes registered pre-reset and post-reset

[PATCH 2/2] drm/amdgpu: add the ring id schedule module parameter for amdgpu

2024-10-10 Thread jesse.zhang
From: "jesse.zh...@amd.com" Added ring id schedule to switch scheduling policy when cs submits. Schedule the ring by setting the ring id. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 9 +++-- drivers/gpu/drm/amd/amd

[PATCH 1/2] drm/sched: adding a new scheduling policy

2024-10-10 Thread jesse.zhang
From: "jesse.zh...@amd.com" Added ring ID scheduling. In some cases, userspace needs to run a job on a specific ring. Instead of selecting the best ring to run based on the ring score. For example, The user want to run a bad job on a specific ring to check whether the ring can recover from a queu

[PATCH] drm/amdkfd: Fix resource leak in kriu rsetore queue

2024-09-05 Thread jesse.zhang
From: "jesse.zh...@amd.com" To avoid memory leaks, release q_extra_data when exiting the restore queue. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager

[PATCH 4/4] drm/amdgpu: Using uninitialized value *size when calling amdgpu_vce_cs_reloc

2024-04-23 Thread jesse.zhang
From: Jesse Zhang Initialize the size before calling amdgpu_vce_cs_reloc, such as case 0x0301. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/d

[PATCH 3/4] drm/amdgpu: Using uninitialized value new_state.jpeg when calling adev->vcn.pause_dpg_mode

2024-04-23 Thread jesse.zhang
From: Jesse Zhang Initialize the new_state.jpeg before it used Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c index 677eb141554e..

[PATCH 2/4] Initialize the last_jump_jiffies in atom_exec_context before it used

2024-04-23 Thread jesse.zhang
From: Jesse Zhang The parameter "last_jump_jiffies" should be initialized before being used in the function atom_op_jump. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/atom.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/a

[PATCH 1/4] drm/amdgpu: add check before free wb entry

2024-04-23 Thread jesse.zhang
From: Jesse Zhang check if ring is not mes queue before free wb entry. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 3 ++- drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 3 ++- drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 3 ++- 3 files changed, 6 insertions(+), 3 deletions(-)

[PATCH V2] drm/ttm: remove unused paramter

2024-03-31 Thread jesse.zhang
From: Jesse Zhang remove the unsed the paramter in the function ttm_bo_bounce_temp_buffer and ttm_bo_add_move_fence. V2:rebase the patch on top of drm-misc-next (Christian) Signed-off-by: Jesse Zhang Reviewed-by: Christian König --- drivers/gpu/drm/ttm/ttm_bo.c | 8 +++- 1 file changed,

[PATCH] drm/amdgpu : remove unused code

2024-03-04 Thread jesse.zhang
From: Jesse Zhang Remove the unused function - amdgpu_vm_pt_is_root_clean and remove the impossible condition v1: entries == 0 is not possible any more, so this condition could probably be removed (Felix) Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 2 - driv

[PATCH V2] drm/amdkfd: fix shift out of bounds about gpu debug

2024-03-03 Thread jesse.zhang
From: Jesse Zhang [ 3810.410040] UBSAN: shift-out-of-bounds in drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_int_process_v10.c:345:5 [ 3810.410044] shift exponent 4294967295 is too large for 64-bit type 'long long unsigned int' [ 3810.410047] CPU: 6 PID: 331 Comm: kworker/6:1H Not tainted 6.5.0+ #50

[PATCH V2] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" for Raven

2024-02-28 Thread jesse.zhang
From: "Jesse.Zhang" fix the issue: "amdgpu: Failed to create process VM object". [Why]when amdgpu initialized, seq64 do mampping and update bo mapping in vm page table. But when clifo run. It also initializes a vm for a process device through the function kfd_process_devic