Re: [PATCH] drm/amdkfd: Differentiate logging message for driver oversubscription

2024-10-29 Thread Mukul Joshi
On 10/28/2024 5:40 PM, Xiaogang.Chen wrote: > From: Xiaogang Chen > > To allow user better understand the cause triggering runlist oversubscription. > No function change. > > Signed-off-by: Xiaogang Chen xiaogang.c...@amd.com > --- > .../gpu/drm/amd/amdkfd/kfd_packet_manager.c | 55 +++

[PATCHv3 2/2] drm/amdkfd: Fix CU occupancy for GFX 9.4.3

2024-09-23 Thread Mukul Joshi
Make CU occupancy calculations work on GFX 9.4.3 by updating the logic to handle multiple XCCs correctly. Signed-off-by: Mukul Joshi Reviewed-by: Harish Kasiviswanathan (v2) --- v1->v2: - Break into 2 patches, one for the generic change and the other for GFX v9.4.3. - Incorporate Haris

[PATCHv3 1/2] drm/amdkfd: Update logic for CU occupancy calculations

2024-09-23 Thread Mukul Joshi
ching doorbell offset of the queue with valid wave counts against the process's queues, Signed-off-by: Mukul Joshi Reviewed-by: Harish Kasiviswanathan --- v1-v2: - Break into 2 patches, one for the generic change and the other for GFX v9.4.3. - Incorporate Harish's comments. v2->v3: - U

[PATCHv2 1/2] drm/amdkfd: Update logic for CU occupancy calculations

2024-09-20 Thread Mukul Joshi
ching doorbell offset of the queue with valid wave counts against the process's queues, Signed-off-by: Mukul Joshi --- v1->v2: - Break into 2 patches, one for the generic change and the other for GFX v9.4.3. - Incorporate Harish's comments. .../gpu/drm/amd/amdgpu/amdgpu_am

[PATCH 2/2] drm/amdkfd: Fix CU occupancy for GFX 9.4.3

2024-09-20 Thread Mukul Joshi
Make CU occupancy calculations work on GFX 9.4.3 by updating the logic to handle multiple XCCs correctly. Signed-off-by: Mukul Joshi --- v1->v2: - Break into 2 patches, one for the generic change and the other for GFX v9.4.3. - Incorporate Harish's comments. drivers/gpu/drm/am

[PATCH] drm/amdkfd: Fix CU occupancy calculations for GFX 9.4.3

2024-09-19 Thread Mukul Joshi
with CP updating the VMID-PASID mapping. Signed-off-by: Mukul Joshi --- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 92 --- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.h | 5 +- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 20 .../drm/amd/amdkfd/kfd_device_queue_mana

[PATCHv3 3/3] drm/amdkfd: Update BadOpcode Interrupt handling with MES

2024-08-16 Thread Mukul Joshi
thereby causing a GPU reset. Signed-off-by: Mukul Joshi Acked-by: Harish Kasiviswanathan Acked-by: Alex Deucher --- v1->v2: - No change. v2->v3: - No change. .../drm/amd/amdkfd/kfd_device_queue_manager.c | 51 +++ .../gpu/drm/amd/amdkfd/kfd_int_process_v11.c | 9 ++-- d

[PATCHv3 2/3] drm/amdkfd: Update queue unmap after VM fault with MES

2024-08-16 Thread Mukul Joshi
MEC FW expects MES to unmap all queues when a VM fault is observed on a queue and then resumed once the affected process is terminated. Use the MES Suspend and Resume APIs to achieve this. Signed-off-by: Mukul Joshi Acked-by: Alex Deucher --- v1->v2: - Add MES FW version check. - Separate

[PATCHv3 1/3] drm/amdgpu: Implement MES Suspend and Resume APIs for GFX11

2024-08-16 Thread Mukul Joshi
Add implementation for MES Suspend and Resume APIs to unmap/map all queues for GFX11. Support for GFX12 will be added when the corresponding firmware support is in place. Signed-off-by: Mukul Joshi Reviewed-by: Alex Deucher --- v1->v2: - Add MES FW version check. - Update amdgpu_mes_susp

[PATCHv2 3/3] drm/amdkfd: Update BadOpcode Interrupt handling with MES

2024-08-14 Thread Mukul Joshi
thereby causing a GPU reset. Signed-off-by: Mukul Joshi Acked-by: Harish Kasiviswanathan --- v1->v2: - No change. .../drm/amd/amdkfd/kfd_device_queue_manager.c | 51 +++ .../gpu/drm/amd/amdkfd/kfd_int_process_v11.c | 9 ++-- drivers/gpu/drm/amd/amdkfd/kfd_priv.h |

[PATCHv2 2/3] drm/amdkfd: Update queue unmap after VM fault with MES

2024-08-14 Thread Mukul Joshi
MEC FW expects MES to unmap all queues when a VM fault is observed on a queue and then resumed once the affected process is terminated. Use the MES Suspend and Resume APIs to achieve this. Signed-off-by: Mukul Joshi --- v1->v2: - Add MES FW version check. - Separate out the kfd_dqm_evict_pa

[PATCHv2 1/3] drm/amdgpu: Implement MES Suspend and Resume APIs for GFX11

2024-08-14 Thread Mukul Joshi
Add implementation for MES Suspend and Resume APIs to unmap/map all queues for GFX11. Support for GFX12 will be added when the corresponding firmware support is in place. Signed-off-by: Mukul Joshi --- v1->v2: - Add MES FW version check. - Update amdgpu_mes_suspend/amdgpu_mes_resume handl

[PATCH 3/3] drm/amdkfd: Update BadOpcode Interrupt handling with MES

2024-08-13 Thread Mukul Joshi
thereby causing a GPU reset. Signed-off-by: Mukul Joshi --- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 51 +++ .../gpu/drm/amd/amdkfd/kfd_int_process_v11.c | 9 ++-- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + 3 files changed, 58 insertions(+), 3 deletions(-) diff

[PATCH 2/3] drm/amdkfd: Update queue unmap after VM fault with MES

2024-08-13 Thread Mukul Joshi
MEC FW expects MES to unmap all queues when a VM fault is observed on a queue and then resumed once the affected process is terminated. Use the MES Suspend and Resume APIs to achieve this. Signed-off-by: Mukul Joshi --- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 75 ++- 1

[PATCH 1/3] drm/amdgpu: Implement MES Suspend and Resume APIs for GFX11

2024-08-13 Thread Mukul Joshi
Add implementation for MES Suspend and Resume APIs to unmap/map all queues for GFX11. Support or GFX12 will be added when the corresponding firmware support is in place. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 32 -- 1 file changed, 30

[PATCH] Revert "drm/amdgpu: Add missing locking for MES API calls"

2024-06-14 Thread Mukul Joshi
] [ 61.604621] gfx_v11_0_hw_fini+0xda/0x100 [amdgpu] [ 61.604814] gfx_v11_0_suspend+0xe/0x20 [amdgpu] [ 61.605008] amdgpu_device_ip_suspend_phase2+0x135/0x1d0 [amdgpu] [ 61.605175] amdgpu_device_suspend+0xec/0x180 [amdgpu] Signed-off-by: Mukul Joshi Reviewed-by: Alex Deucher --- drivers/gpu

[PATCH] drm/ttm: Add cgroup memory accounting for GTT memory

2024-06-06 Thread Mukul Joshi
Make sure we do not overflow the memory limits set for a cgroup when doing GTT memory allocations. Suggested-by: Philip Yang Signed-off-by: Mukul Joshi --- drivers/gpu/drm/ttm/ttm_pool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b

[PATCH] drm/amdgpu: Add missing locking for MES API calls

2024-06-06 Thread Mukul Joshi
Add missing locking at a few places when calling MES APIs to ensure exclusive access to MES queue. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm

[PATCH] drm/amdkfd: Fix CU Masking for GFX 9.4.3

2024-05-09 Thread Mukul Joshi
We are incorrectly passing the first XCC's MQD when updating CU masks for other XCCs in the partition. Fix this by passing the MQD for the XCC currently being updated with CU mask to update_cu_mask function. Fixes: fc6efed2c728 ("drm/amdkfd: Update CU masking for GFX 9.4.3") Signe

[PATCH] drm/amdgpu: Fix VRAM memory accounting

2024-04-23 Thread Mukul Joshi
Subtract the VRAM pinned memory when checking for available memory in amdgpu_amdkfd_reserve_mem_limit function since that memory is not available for use. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff

[PATCH] drm/amdkfd: Add VRAM accounting for SVM migration

2024-04-19 Thread Mukul Joshi
Do VRAM accounting when doing migrations to vram to make sure there is enough available VRAM and migrating to VRAM doesn't evict other possible non-unified memory BOs. If migrating to VRAM fails, driver can fall back to using system memory seamlessly. Signed-off-by: Mukul Joshi --- driver

[PATCH] drm/amdgpu: Fix leak when GPU memory allocation fails

2024-04-18 Thread Mukul Joshi
Free the sync object if the memory allocation fails for any reason. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu

[PATCH] drm/amdkfd: Cleanup workqueue during module unload

2024-03-20 Thread Mukul Joshi
Destroy the high priority workqueue that handles interrupts during KFD node cleanup. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_interrupt.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_interrupt.c b/drivers/gpu/drm/amd/amdkfd

[PATCH] drm/amdkfd: Check cgroup when returning DMABuf info

2024-03-15 Thread Mukul Joshi
Check cgroup permissions when returning DMA-buf info and based on cgroup check return the id of the GPU that has access to the BO. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd

[PATCH 1/2] drm/amdkfd: Rename read_doorbell_id in MQD functions

2024-03-14 Thread Mukul Joshi
Rename read_doorbell_id function to a more meaningful name, implying what it is used for. No functional change. Suggested-by: Jay Cornwall Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h | 2

[PATCH 2/2] drm/amdkfd: Check preemption status on all XCDs

2024-03-14 Thread Mukul Joshi
return a bool instead of uint32_t and pass the MQD manager as an argument. Suggested-by: Jay Cornwall Signed-off-by: Mukul Joshi --- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 3 +-- drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.c | 18 + drivers/gpu/drm/amd/amdkfd

[PATCH] drm/amdgpu: Handle duplicate BOs during process restore

2024-03-08 Thread Mukul Joshi
In certain situations, some apps can import a BO multiple times (through IPC for example). To restore such processes successfully, we need to tell drm to ignore duplicate BOs. While at it, also add additional logging to prevent silent failures when process restore fails. Signed-off-by: Mukul

[PATCH] drm/amdkfd: Use correct drm device for cgroup permission check

2024-01-26 Thread Mukul Joshi
On GFX 9.4.3, for a given KFD node, fetch the correct drm device from XCP manager when checking for cgroup permissions. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd

[PATCH] drm/amdgpu: Fix module unload hang with RAS enabled

2024-01-23 Thread Mukul Joshi
mdgpu: Prepare for asynchronous processing of umc page retirement") Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index a3

[PATCH] drm/amdkfd: Use common function for IP version check

2023-11-22 Thread Mukul Joshi
KFD_GC_VERSION was recently updated to use a new function for IP version checks. As a result, use KFD_GC_VERSION as the common function for all IP version checks in KFD. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion

[PATCHv2 2/2] drm/amdkfd: Update cache info for GFX 9.4.3

2023-10-27 Thread Mukul Joshi
Update cache info reporting based on compute and memory partitioning modes. Signed-off-by: Mukul Joshi --- v1->v2: - Separate into a separate patch. - Simplify the if condition to reduce indentation and make it logically more clear. drivers/gpu/drm/amd/amdkfd/kfd_topology.c |

[PATCHv2 1/2] drm/amdkfd: Populate cache info for GFX 9.4.3

2023-10-27 Thread Mukul Joshi
GFX 9.4.3 uses a new version of the GC info table which contains the cache info. This patch adds a new function to populate the cache info from IP discovery for GFX 9.4.3. Signed-off-by: Mukul Joshi --- v1->v2: - Separate out the original patch into 2 patches. drivers/gpu/drm/amd/amd

[PATCH] drm/amdkfd: Update cache reporting for GFX 9.4.3

2023-10-26 Thread Mukul Joshi
GFX 9.4.3 uses a new version of the GC info table in IP discovery. This patch adds a new function to parse and fill the cache information based on the new table. Also, update cache reporting based on compute and memory partitioning modes. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd

[PATCHv2] drm/amdgpu: Fix typo in IP discovery parsing

2023-10-26 Thread Mukul Joshi
Fix a typo in parsing of the GC info table header when reading the IP discovery table. Fixes: ecb70926eb86 ("drm/amdgpu: add type conversion for gc info") Signed-off-by: Mukul Joshi --- v1->v2: - Add the Fixes tag. drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 2 +- 1 fi

[PATCH] drm/amdgpu: Fix typo in IP discovery parsing

2023-10-26 Thread Mukul Joshi
Fix a typo in parsing of the GC info table header when reading the IP discovery table. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c b/drivers/gpu

[PATCH 4/4] drm/amdgpu: Rename KGD_MAX_QUEUES to AMDGPU_MAX_QUEUES

2023-09-06 Thread Mukul Joshi
Rename KGD_MAX_QUEUES to AMDGPU_MAX_QUEUES to conform with the naming convention followed in amdgpu_gfx.h. No functional change. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 4 ++-- drivers

[PATCHv2 1/4] drm/amdgpu: Store CU info from all XCCs for GFX v9.4.3

2023-09-06 Thread Mukul Joshi
Currently, we store CU info only for a single XCC assuming that it is the same for all XCCs. However, that may not be true. As a result, store CU info for all XCCs. This info is later used for CU masking. Signed-off-by: Mukul Joshi --- v1->v2: - Incorporate Felix's review comments. dri

[PATCHv2 3/4] drm/amdkfd: Update CU masking for GFX 9.4.3

2023-09-06 Thread Mukul Joshi
The CU mask passed from user-space will change based on different spatial partitioning mode. As a result, update CU masking code for GFX9.4.3 to work for all partitioning modes. Signed-off-by: Mukul Joshi --- v1->v2: - Incorporate Felix's review comments. drivers/gpu/drm/am

[PATCHv2 2/4] drm/amdkfd: Update cache info reporting for GFX v9.4.3

2023-09-06 Thread Mukul Joshi
Update cache info reporting in sysfs to report the correct number of CUs and associated cache information based on different spatial partitioning modes. Signed-off-by: Mukul Joshi --- v1->v2: - Revert the change in kfd_crat.c - Add a comment to not change value of CRAT_SIBLINGMAP_SIZE. driv

[PATCHv3] drm/amdkfd: Fix unaligned 64-bit doorbell warning

2023-09-06 Thread Mukul Joshi
] amdgpu_pci_probe+0x197/0x400 [amdgpu] Fixes: cfeaeb3c0ce7 ("drm/amdgpu: use doorbell mgr for kfd kernel doorbells") Signed-off-by: Mukul Joshi --- v1->v2: - Update the logic to make it work with both 32 bit 64 bit doorbells. - Add the Fixed tag v2->v3: - Revert to the original

[PATCHv2] drm/amdkfd: Fix unaligned 64-bit doorbell warning

2023-08-30 Thread Mukul Joshi
[amdgpu] [ +0.000545] amdgpu_pci_probe+0x197/0x400 [amdgpu] Fixes: cfeaeb3c0ce7 ("drm/amdgpu: use doorbell mgr for kfd kernel doorbells") Signed-off-by: Mukul Joshi --- v1->v2: - Update the logic to make it work with both 32 bit 64 bit doorbells. - Add the Fixed tag. drivers/gpu/d

[PATCH] drm/amdkfd: Fix unaligned 64-bit doorbell warning

2023-08-29 Thread Mukul Joshi
[amdgpu] [ +0.000545] amdgpu_pci_probe+0x197/0x400 [amdgpu] Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c index

[PATCH] drm/amdkfd: Fix reg offset for setting CWSR grace period

2023-08-29 Thread Mukul Joshi
parameter. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c| 3 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.h| 3 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.h | 3

[PATCH] drm/amdkfd: Update CWSR grace period for GFX9.4.3

2023-07-10 Thread Mukul Joshi
For GFX9.4.3, setup a reduced default CWSR grace period equal to 1000 cycles instead of 64000 cycles. Signed-off-by: Mukul Joshi Reviewed-by: Felix Kuehling --- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 2 +- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 22 ++- 2

[PATCHv2] drm/amdkfd: Use KIQ to unmap HIQ

2023-06-29 Thread Mukul Joshi
Currently, we unmap HIQ by directly writing to HQD registers. This doesn't work for GFX9.4.3. Instead, use KIQ to unmap HIQ, similar to how we use KIQ to map HIQ. Using KIQ to unmap HIQ works for all GFX series post GFXv9. Signed-off-by: Mukul Joshi --- v1->v2: - Use kiq_unmap_queues

[PATCH 2/2] drm/amdgpu: Correctly setup TMR region size for GFX9.4.3

2023-06-22 Thread Mukul Joshi
A faulty check was causing TMR region size to be setup incorrectly for GFX9.4.3. Remove the check and setup TMR region size as 280MB for GFX9.4.3. Fixes: b6780d70db5e ("drm/amdgpu: bypass bios dependent operations") Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/amdgpu

[PATCH 1/2] drm/amdkfd: Update interrupt handling for GFX 9.4.3

2023-06-22 Thread Mukul Joshi
process drain interrupt. Signed-off-by: Mukul Joshi Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 43 ++- .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 29 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + drivers/gpu/drm/amd

[PATCHv4] drm/amdgpu: Update invalid PTE flag setting

2023-06-19 Thread Mukul Joshi
invalid PTE settings, one for TF enabled, the other for TF disabled. The setting with TF disabled, doesn't work with TF enabled. Signed-off-by: Mukul Joshi --- v1->v2: - Update handling according to Christian's feedback. v2->v3: - Remove ASIC specific callback (Felix). v3->v4:

[PATCHv2] drm/amdkfd: Enable GWS on GFX9.4.3

2023-06-16 Thread Mukul Joshi
Enable GWS capable queue creation for forward progress gaurantee on GFX 9.4.3. Signed-off-by: Mukul Joshi --- v1->v2: - Update the condition for setting pqn->q->gws for GFX 9.4.3. drivers/gpu/drm/amd/amdkfd/kfd_device.c | 1 + .../amd/amdkfd/kfd_process_queue_manager.

[PATCH] drm/amdkfd: Use KIQ to unmap HIQ

2023-06-16 Thread Mukul Joshi
Currently, we unmap HIQ by directly writing to HQD registers. This doesn't work for GFX9.4.3. Instead, use KIQ to unmap HIQ, similar to how we use KIQ to map HIQ. Using KIQ to unmap HIQ works for all GFX series post GFXv9. Signed-off-by: Mukul Joshi --- .../drm/amd/amdgpu/amdgpu_amdkfd_gc_

[PATCH] drm/amdkfd: Enable GWS on GFX9.4.3

2023-06-16 Thread Mukul Joshi
Enable GWS capable queue creation for forward progress gaurantee on GFX 9.4.3. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 1 + .../amd/amdkfd/kfd_process_queue_manager.c| 31 --- 2 files changed, 20 insertions(+), 12 deletions(-) diff

[PATCHv3] drm/amdgpu: Update invalid PTE flag setting

2023-06-13 Thread Mukul Joshi
invalid PTE settings, one for TF enabled, the other for TF disabled. The setting with TF disabled, doesn't work with TF enabled. Signed-off-by: Mukul Joshi --- v1->v2: - Update handling according to Christian's feedback. v2->v3: - Remove ASIC specific callback (Felix). drivers/gp

[PATCH] drm/amdkfd: Remove DUMMY_VRAM_SIZE

2023-06-12 Thread Mukul Joshi
Remove DUMMY_VRAM_SIZE as it is not needed and can result in reporting incorrect memory size. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 5 - 1 file changed, 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c b/drivers/gpu/drm/amd/amdkfd

[PATCHv2] drm/amdgpu: Update invalid PTE flag setting

2023-06-12 Thread Mukul Joshi
invalid PTE settings, one for TF enabled, the other for TF disabled. The setting with TF disabled, doesn't work with TF enabled. Signed-off-by: Mukul Joshi --- v1->v2: - Update handling according to Christian's feedback. drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 7 +++ driver

[PATCH] drm/amdkfd: Fix reserved SDMA queues handling

2023-06-07 Thread Mukul Joshi
9.4.3") Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 ++--- .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 +- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 +- 3 files changed, 12 insertions(+), 13 deletions(-) di

[PATCH] drm/amdgpu: Raname DRM schedulers in amdgpu TTM

2023-06-07 Thread Mukul Joshi
Rename mman.entity to mman.high_pr to make the distinction clearer that this is a high priority scheduler. Similarly, rename the recently added mman.delayed to mman.low_pr to make it clear it is a low priority scheduler. No functional change in this patch. Signed-off-by: Mukul Joshi --- drivers

[PATCH] drm/amdkfd: Set event interrupt class for GFX 9.4.3

2023-05-23 Thread Mukul Joshi
Fix the warning during driver load because the event interrupt class is not set for GFX9.4.3. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd

[PATCH] drm/amdgpu: Add a low priority scheduler for VRAM clearing

2023-05-17 Thread Mukul Joshi
Add a low priority DRM scheduler for VRAM clearing instead of using the exisiting high priority scheduler. Use the high priority scheduler for migrations and evictions. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 4 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c

[PATCHv2 2/3] drm/amdgpu: Set GTT size equal to TTM mem limit

2023-04-26 Thread Mukul Joshi
Use the helper function in TTM to get TTM mem limit and set GTT size to be equal to TTL mem limit. Signed-off-by: Mukul Joshi Reviewed-by: Christian König --- v1->v2: - Remove AMDGPU_DEFAULT_GTT_SIZE_MB as well as it is unused. drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 - drivers/gpu/

[PATCH 3/3] drm/amdkfd: Update KFD TTM mem limit

2023-04-25 Thread Mukul Joshi
Use the helper function in TTM to get TTM memory limit and set KFD's internal mem limit. This ensures that KFD's TTM mem limit and actual TTM mem limit are exactly same. Signed-off-by: Mukul Joshi Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 ++

[PATCH 2/3] drm/amdgpu: Set GTT size equal to TTM mem limit

2023-04-25 Thread Mukul Joshi
Use the helper function in TTM to get TTM mem limit and set GTT size to be equal to TTL mem limit. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 25 ++--- 1 file changed, 6 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu

[PATCH 1/3] drm/ttm: Helper function to get TTM mem limit

2023-04-25 Thread Mukul Joshi
Add a helper function to get TTM memory limit. This is needed by KFD to set its own internal memory limits. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/ttm/ttm_tt.c | 6 ++ include/drm/ttm/ttm_tt.h | 2 +- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm

[PATCH] drm/amdgpu: Update invalid PTE flag setting

2023-04-04 Thread Mukul Joshi
Update the invalid PTE flag setting to ensure, in addition to transitioning the retry fault to a no-retry fault, it also causes the wavefront to enter the trap handler. With the current setting, it only transitions to a no-retry fault. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu

[PATCHv2] drm/amdgpu: Enable IH retry CAM on GFX9

2023-01-19 Thread Mukul Joshi
This patch enables the IH retry CAM on GFX9 series cards. This retry filter is used to prevent sending lots of retry interrupts in a short span of time and overflowing the IH ring buffer. This will also help reduce CPU interrupt workload. Signed-off-by: Mukul Joshi --- v1: - Reviewed by Felix

[PATCH] drm/amdkfd: Fix kernel warning during topology setup

2022-12-20 Thread Mukul Joshi
tirqs last disabled at (59649): [] irq_exit_rcu+0xd7/0x130 [ +0.004203] ---[ end trace ]--- Fixes: 0f28cca87e9a ("drm/amdkfd: Extend KFD device topology to surface peer-to-peer links") Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 2 +- 1 file

[PATCH 2/2] drm/amdgpu: Rework retry fault removal

2022-12-12 Thread Mukul Joshi
sw filter. This helps in avoiding stale faults being added back into the filter and preventing legitimate faults from being handled. Suggested-by: Felix Kuehling Signed-off-by: Mukul Joshi Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 36

[PATCH 1/2] drm/amdgpu: Enable IH retry CAM on GFX9

2022-12-12 Thread Mukul Joshi
This patch enables the IH retry CAM on GFX9 series cards. This retry filter is used to prevent sending lots of retry interrupts in a short span of time and overflowing the IH ring buffer. This will also help reduce CPU interrupt workload. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd

[PATCH] drm/amdgpu: Update PTE flags with TF enabled

2022-09-13 Thread Mukul Joshi
to translate a retry fault into a no-retry fault, doesn't work with TF enabled. As a result, update invalid PTE flags settings which works for both TF enabled and disabled case. Fixes: 2abf2573b1c69 ("drm/amdgpu: Enable translate_further to extend UTCL2 reach") Signed-off-

[PATCH] drm/amdgpu: Fix page table setup on Arcturus

2022-08-22 Thread Mukul Joshi
When translate_further is enabled, page table depth needs to be updated. This was missing on Arcturus MMHUB init. This was causing address translations to fail for SDMA user-mode queues. Fixes: 2abf2573b1c69 ("drm/amdgpu: Enable translate_further to extend UTCL2 reach" Signed-off-by: M

[PATCHv2] drm/amdgpu: Fix interrupt handling on ih_soft ring

2022-08-15 Thread Mukul Joshi
There are no backing hardware registers for ih_soft ring. As a result, don't try to access hardware registers for read and write pointers when processing interrupts on the IH soft ring. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/navi10_ih.c | 7 ++- drivers/gpu/drm/amd/a

[PATCH] drm/amdgpu: Fix interrupt handling on ih_soft ring

2022-08-12 Thread Mukul Joshi
There are no backing hardware registers for ih_soft ring. As a result, don't try to access hardware registers for read and write pointers when processing interrupts on the IH soft ring. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 7 ++- 1 file chang

[PATCH 2/2] drm/amdkfd: Fix circular lock dependency warning

2022-04-22 Thread Mukul Joshi
by ensuring pm.mutex is not held while holding the topology lock. For this, kfd_local_mem_info is moved into the KFD dev struct and filled during device init. This cached value can then be used instead of querying the value again and again. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/k

[PATCH 1/2] drm/amdkfd: Fix updating IO links during device removal

2022-04-22 Thread Mukul Joshi
: 9be62cbcc62f ("drm/amdkfd: Cleanup IO links during KFD device removal") Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/

[PATCHv2] drm/amdkfd: Cleanup IO links during KFD device removal

2022-04-11 Thread Mukul Joshi
generation_count to let user-mode know that topology has changed due to device removal. CC: Shuotao Xu Signed-off-by: Mukul Joshi Reviewed-by: Shuotao Xu --- v1->v2: - Remove comments from inside kfd_topology_update_io_links() and add them as kernel-doc comments. drivers/gpu/drm/amd/amd

[PATCH] drm/amdkfd: Cleanup IO links during KFD device removal

2022-04-07 Thread Mukul Joshi
generation_count to let user-mode know that topology has changed due to device removal. CC: Shuotao Xu Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 4 +- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 + drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 79

[PATCHv2 3/3] drm/amdkfd: Consolidate MQD manager functions

2022-02-07 Thread Mukul Joshi
A few MQD manager functions are duplicated for all versions of MQD manager. Remove this duplication by moving the common functions into kfd_mqd_manager.c file. Signed-off-by: Mukul Joshi --- v1->v2: - Add "kfd_" prefix to functions moved to kfd_mqd_manager.c. - Also, suffix &quo

[PATCHv2 2/3] drm/amdkfd: Remove unused old debugger implementation

2022-02-07 Thread Mukul Joshi
Cleanup the kfd code by removing the unused old debugger implementation. Only a small piece of resetting wavefronts is kept and is moved to kfd_device_queue_manager.c Signed-off-by: Mukul Joshi --- v1->v2: - Rename AMDKFD_IOC_DBG_* to AMDKFD_IOC_DBG_*_DEPRECATED. - Cleanup address_watch_disa

[PATCHv2 1/3] drm/amdkfd: Fix TLB flushing in KFD SVM with no HWS

2022-02-07 Thread Mukul Joshi
With no HWS, TLB flushing will not work in SVM code. Fix this by calling kfd_flush_tlb() which works for both HWS and no HWS case. Signed-off-by: Mukul Joshi Reviewed-by: Philip Yang --- v1->v2: - Don't pass adev to svm_range_map_to_gpu(). drivers/gpu/drm/amd/amdkfd/kfd_sv

[PATCH 2/3] drm/amdkfd: Remove unused old debugger implementation

2022-02-04 Thread Mukul Joshi
Cleanup the kfd code by removing the unused old debugger implementation. Only a small piece of resetting wavefronts is kept and is moved to kfd_device_queue_manager.c Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/Makefile | 2 - drivers/gpu/drm/amd/amdkfd/kfd_chardev.c

[PATCH 3/3] drm/amdkfd: Consolidate MQD manager functions

2022-02-04 Thread Mukul Joshi
A few MQD manager functions are duplicated for all versions of MQD manager. Remove this duplication by moving the common functions into kfd_mqd_manager.c file. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.c | 63 + drivers/gpu/drm/amd/amdkfd

[PATCH 1/3] drm/amdkfd: Fix TLB flushing in KFD SVM with no HWS

2022-02-04 Thread Mukul Joshi
With no HWS, TLB flushing will not work in SVM code. Fix this by calling kfd_flush_tlb() which works for both HWS and no HWS case. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/drivers

[PATCH 2/2] drm/amdgpu: Fix RAS page retirement with mode2 reset on Aldebaran

2021-10-11 Thread Mukul Joshi
occurred on a GPU that supports MCE notifier based page retirement. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu

[PATCH 1/2] drm/amdgpu: Enable RAS error injection after mode2 reset on Aldebaran

2021-10-11 Thread Mukul Joshi
Add the missing call to re-enable RAS error injections on the Aldebaran mode2 reset code path. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/aldebaran.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c b/drivers/gpu/drm/amd/amdgpu

[PATCHv4 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

2021-09-23 Thread Mukul Joshi
On Aldebaran, GPU driver will handle bad page retirement for GPU memory even though UMC is host managed. As a result, register a bad page retirement handler on the mce notifier chain to retire bad pages on Aldebaran. Signed-off-by: Mukul Joshi --- v1->v2: - Use smca_get_bank_type() to determ

[PATCHv3 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

2021-09-22 Thread Mukul Joshi
On Aldebaran, GPU driver will handle bad page retirement even though UMC is host managed. As a result, register a bad page retirement handler on the mce notifier chain to retire bad pages on Aldebaran. Signed-off-by: Mukul Joshi --- v1->v2: - Use smca_get_bank_type() to determine MCA b

[PATCHv2 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

2021-09-12 Thread Mukul Joshi
def CONFIG_X86_MCE_AMD. - Use MCE_PRIORITY_UC instead of MCE_PRIO_ACCEL as we are only handling uncorrectable errors. - Use macros to determine UMC instance and channel instance where the uncorrectable error occured. - Update the headline. Signed-off-by: Mukul Joshi Link: https://lore.kernel.

[PATCHv2 1/2] x86/MCE/AMD: Export smca_get_bank_type symbol

2021-09-12 Thread Mukul Joshi
-by: Mukul Joshi --- arch/x86/include/asm/mce.h| 2 +- arch/x86/kernel/cpu/mce/amd.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index fc3d36f1f9d0..d90d3ccb583a 100644 --- a/arch/x86/include/asm/mce.h +++ b/a

[PATCH] drm/amdkfd: CWSR with sw scheduler on Aldebaran and Arcturus

2021-08-20 Thread Mukul Joshi
Program trap handler settings to enable CWSR with software scheduler on Aldebaran and Arcturus. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 3 ++- drivers/gpu/drm/amd/amdgpu

[PATCH] drm/amdkfd: CWSR with software scheduler

2021-08-09 Thread Mukul Joshi
This patch adds support to program trap handler settings when loading driver with software scheduler (sched_policy=2). Signed-off-by: Mukul Joshi Suggested-by: Jay Cornwall --- .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c| 31 + .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10_3.c

[PATCH] drm/amdgpu: Fix channel_index table layout for Aldebaran

2021-07-29 Thread Mukul Joshi
Fix the channel_index table layout to fetch the correct channel_index when calculating physical address from normalized address during page retirement. Also, fix the number of UMC instances and number of channels within each UMC instance for Aldebaran. Signed-off-by: Mukul Joshi --- drivers/gpu

[PATCH] drm/amdgpu: Conditionally reset SDMA RAS error counts

2021-06-29 Thread Mukul Joshi
Reset SDMA RAS error counts during init only if persistent EDC harvesting is not supported. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm

[PATCH] drm/amdgpu: Correctly clear GCEA error status

2021-05-25 Thread Mukul Joshi
While clearing GCEA error status, do not clear the bits set by RAS TA. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c b/drivers/gpu/drm/amd/amdgpu

[PATCH] drm/amdgpu: Query correct register for DF hashing on Aldebaran

2021-05-18 Thread Mukul Joshi
For Aldebaran, driver needs to query DramMegaBaseAddress to check if DF hashing is enabled. Signed-off-by: Mukul Joshi Acked-by: Alex Deucher Reviewed-by: Harish Kasiviswanathan --- drivers/gpu/drm/amd/amdgpu/df_v3_6.c| 9 + drivers/gpu/drm/amd/include/asic_reg/df

[PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-11 Thread Mukul Joshi
On Aldebaran, GPU driver will handle bad page retirement even though UMC is host managed. As a result, register a bad page retirement handler on the mce notifier chain to retire bad pages on Aldebaran. Signed-off-by: Mukul Joshi Reviewed-by: John Clements Acked-by: Felix Kuehling --- drivers

[PATCH] drm/amdgpu: Enable TCP channel hashing for Aldebaran

2021-05-06 Thread Mukul Joshi
Enable TCP channel hashing to match DF hash settings for Aldebaran. Signed-off-by: Mukul Joshi Signed-off-by: Oak Zeng Reviewed-by: Joseph Greathouse --- drivers/gpu/drm/amd/amdgpu/df_v3_6.c| 17 +++-- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 ++- .../amd

[PATCH v2] drm/amdgpu: Enable SDMA utilization for Arcturus

2020-09-11 Thread Mukul Joshi
SDMA utilization calculations are enabled/disabled by writing to SDMAx_PUB_DUMMY_REG2 register. Currently, enable this only for Arcturus. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu

[PATCH] drm/amdgpu: Enable SDMA utilization for Arcturus

2020-09-11 Thread Mukul Joshi
SDMA utilization calculations are enabled/disabled by writing to SDMAx_PUB_DUMMY_REG2 register. Currently, enable this only for Arcturus. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/gpu/drm/amd

[PATCH] drm/amdkfd: Move process doorbell allocation into kfd device

2020-09-01 Thread Mukul Joshi
manage. In a system with mix of such devices, KFD would need to request process doorbell space based on the type of device, either from amdgpu or from its own doorbell space. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 30 +-- drivers/gpu/drm/amd/amdkfd

[PATCH] include/uapi/linux: Fix indentation in kfd_smi_event enum

2020-08-28 Thread Mukul Joshi
Replace spaces with Tabs to fix indentation in kfd_smi_event enum. Signed-off-by: Mukul Joshi --- include/uapi/linux/kfd_ioctl.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h index 8b7368bfbd84

[PATCH v3] drm/amdkfd: Add GPU reset SMI event

2020-08-28 Thread Mukul Joshi
Add support for reporting GPU reset events through SMI. KFD would report both pre and post GPU reset events. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 +++ drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 ++ drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 35

  1   2   >