On 10/28/2024 5:40 PM, Xiaogang.Chen wrote:
> From: Xiaogang Chen
>
> To allow user better understand the cause triggering runlist oversubscription.
> No function change.
>
> Signed-off-by: Xiaogang Chen xiaogang.c...@amd.com
> ---
> .../gpu/drm/amd/amdkfd/kfd_packet_manager.c | 55 +++
Make CU occupancy calculations work on GFX 9.4.3 by
updating the logic to handle multiple XCCs correctly.
Signed-off-by: Mukul Joshi
Reviewed-by: Harish Kasiviswanathan (v2)
---
v1->v2:
- Break into 2 patches, one for the generic change
and the other for GFX v9.4.3.
- Incorporate Haris
ching doorbell offset of the queue
with valid wave counts against the process's queues,
Signed-off-by: Mukul Joshi
Reviewed-by: Harish Kasiviswanathan
---
v1-v2:
- Break into 2 patches, one for the generic change
and the other for GFX v9.4.3.
- Incorporate Harish's comments.
v2->v3:
- U
ching doorbell offset of the queue
with valid wave counts against the process's queues,
Signed-off-by: Mukul Joshi
---
v1->v2:
- Break into 2 patches, one for the generic change
and the other for GFX v9.4.3.
- Incorporate Harish's comments.
.../gpu/drm/amd/amdgpu/amdgpu_am
Make CU occupancy calculations work on GFX 9.4.3 by
updating the logic to handle multiple XCCs correctly.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Break into 2 patches, one for the generic change
and the other for GFX v9.4.3.
- Incorporate Harish's comments.
drivers/gpu/drm/am
with CP updating the VMID-PASID mapping.
Signed-off-by: Mukul Joshi
---
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 92 ---
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.h | 5 +-
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 20
.../drm/amd/amdkfd/kfd_device_queue_mana
thereby causing a GPU reset.
Signed-off-by: Mukul Joshi
Acked-by: Harish Kasiviswanathan
Acked-by: Alex Deucher
---
v1->v2:
- No change.
v2->v3:
- No change.
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 51 +++
.../gpu/drm/amd/amdkfd/kfd_int_process_v11.c | 9 ++--
d
MEC FW expects MES to unmap all queues when a VM fault is observed
on a queue and then resumed once the affected process is terminated.
Use the MES Suspend and Resume APIs to achieve this.
Signed-off-by: Mukul Joshi
Acked-by: Alex Deucher
---
v1->v2:
- Add MES FW version check.
- Separate
Add implementation for MES Suspend and Resume APIs to unmap/map
all queues for GFX11. Support for GFX12 will be added when the
corresponding firmware support is in place.
Signed-off-by: Mukul Joshi
Reviewed-by: Alex Deucher
---
v1->v2:
- Add MES FW version check.
- Update amdgpu_mes_susp
thereby causing a GPU reset.
Signed-off-by: Mukul Joshi
Acked-by: Harish Kasiviswanathan
---
v1->v2:
- No change.
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 51 +++
.../gpu/drm/amd/amdkfd/kfd_int_process_v11.c | 9 ++--
drivers/gpu/drm/amd/amdkfd/kfd_priv.h |
MEC FW expects MES to unmap all queues when a VM fault is observed
on a queue and then resumed once the affected process is terminated.
Use the MES Suspend and Resume APIs to achieve this.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Add MES FW version check.
- Separate out the kfd_dqm_evict_pa
Add implementation for MES Suspend and Resume APIs to unmap/map
all queues for GFX11. Support for GFX12 will be added when the
corresponding firmware support is in place.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Add MES FW version check.
- Update amdgpu_mes_suspend/amdgpu_mes_resume handl
thereby causing a GPU reset.
Signed-off-by: Mukul Joshi
---
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 51 +++
.../gpu/drm/amd/amdkfd/kfd_int_process_v11.c | 9 ++--
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 +
3 files changed, 58 insertions(+), 3 deletions(-)
diff
MEC FW expects MES to unmap all queues when a VM fault is observed
on a queue and then resumed once the affected process is terminated.
Use the MES Suspend and Resume APIs to achieve this.
Signed-off-by: Mukul Joshi
---
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 75 ++-
1
Add implementation for MES Suspend and Resume APIs to unmap/map
all queues for GFX11. Support or GFX12 will be added when the
corresponding firmware support is in place.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 32 --
1 file changed, 30
]
[ 61.604621] gfx_v11_0_hw_fini+0xda/0x100 [amdgpu]
[ 61.604814] gfx_v11_0_suspend+0xe/0x20 [amdgpu]
[ 61.605008] amdgpu_device_ip_suspend_phase2+0x135/0x1d0 [amdgpu]
[ 61.605175] amdgpu_device_suspend+0xec/0x180 [amdgpu]
Signed-off-by: Mukul Joshi
Reviewed-by: Alex Deucher
---
drivers/gpu
Make sure we do not overflow the memory limits set for a cgroup when doing
GTT memory allocations.
Suggested-by: Philip Yang
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/ttm/ttm_pool.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b
Add missing locking at a few places when calling MES APIs to ensure
exclusive access to MES queue.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 12
1 file changed, 12 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
b/drivers/gpu/drm
We are incorrectly passing the first XCC's MQD when
updating CU masks for other XCCs in the partition. Fix
this by passing the MQD for the XCC currently being
updated with CU mask to update_cu_mask function.
Fixes: fc6efed2c728 ("drm/amdkfd: Update CU masking for GFX 9.4.3")
Signe
Subtract the VRAM pinned memory when checking for available memory
in amdgpu_amdkfd_reserve_mem_limit function since that memory is not
available for use.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff
Do VRAM accounting when doing migrations to vram to make sure
there is enough available VRAM and migrating to VRAM doesn't evict
other possible non-unified memory BOs. If migrating to VRAM fails,
driver can fall back to using system memory seamlessly.
Signed-off-by: Mukul Joshi
---
driver
Free the sync object if the memory allocation fails for any
reason.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
b/drivers/gpu/drm/amd/amdgpu
Destroy the high priority workqueue that handles interrupts
during KFD node cleanup.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_interrupt.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_interrupt.c
b/drivers/gpu/drm/amd/amdkfd
Check cgroup permissions when returning DMA-buf info and
based on cgroup check return the id of the GPU that has
access to the BO.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd
Rename read_doorbell_id function to a more meaningful name,
implying what it is used for. No functional change.
Suggested-by: Jay Cornwall
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h | 2
return a bool instead of uint32_t and pass the MQD manager
as an argument.
Suggested-by: Jay Cornwall
Signed-off-by: Mukul Joshi
---
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 3 +--
drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.c | 18 +
drivers/gpu/drm/amd/amdkfd
In certain situations, some apps can import a BO multiple times
(through IPC for example). To restore such processes successfully,
we need to tell drm to ignore duplicate BOs.
While at it, also add additional logging to prevent silent failures
when process restore fails.
Signed-off-by: Mukul
On GFX 9.4.3, for a given KFD node, fetch the correct drm device from
XCP manager when checking for cgroup permissions.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 9 +++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd
mdgpu: Prepare for asynchronous processing of umc
page retirement")
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4
1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index a3
KFD_GC_VERSION was recently updated to use a new function
for IP version checks. As a result, use KFD_GC_VERSION as
the common function for all IP version checks in KFD.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion
Update cache info reporting based on compute and
memory partitioning modes.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Separate into a separate patch.
- Simplify the if condition to reduce indentation and make it
logically more clear.
drivers/gpu/drm/amd/amdkfd/kfd_topology.c |
GFX 9.4.3 uses a new version of the GC info table which
contains the cache info. This patch adds a new function
to populate the cache info from IP discovery for GFX 9.4.3.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Separate out the original patch into 2 patches.
drivers/gpu/drm/amd/amd
GFX 9.4.3 uses a new version of the GC info table in IP
discovery. This patch adds a new function to parse and
fill the cache information based on the new table. Also,
update cache reporting based on compute and memory
partitioning modes.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd
Fix a typo in parsing of the GC info table header when
reading the IP discovery table.
Fixes: ecb70926eb86 ("drm/amdgpu: add type conversion for gc info")
Signed-off-by: Mukul Joshi
---
v1->v2:
- Add the Fixes tag.
drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 2 +-
1 fi
Fix a typo in parsing of the GC info table header when
reading the IP discovery table.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
b/drivers/gpu
Rename KGD_MAX_QUEUES to AMDGPU_MAX_QUEUES to conform with
the naming convention followed in amdgpu_gfx.h. No functional
change.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 4 ++--
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 4 ++--
drivers
Currently, we store CU info only for a single XCC assuming
that it is the same for all XCCs. However, that may not be
true. As a result, store CU info for all XCCs. This info is
later used for CU masking.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Incorporate Felix's review comments.
dri
The CU mask passed from user-space will change based on
different spatial partitioning mode. As a result, update
CU masking code for GFX9.4.3 to work for all partitioning
modes.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Incorporate Felix's review comments.
drivers/gpu/drm/am
Update cache info reporting in sysfs to report the correct
number of CUs and associated cache information based on
different spatial partitioning modes.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Revert the change in kfd_crat.c
- Add a comment to not change value of CRAT_SIBLINGMAP_SIZE.
driv
] amdgpu_pci_probe+0x197/0x400 [amdgpu]
Fixes: cfeaeb3c0ce7 ("drm/amdgpu: use doorbell mgr for kfd kernel doorbells")
Signed-off-by: Mukul Joshi
---
v1->v2:
- Update the logic to make it work with both 32 bit
64 bit doorbells.
- Add the Fixed tag
v2->v3:
- Revert to the original
[amdgpu]
[ +0.000545] amdgpu_pci_probe+0x197/0x400 [amdgpu]
Fixes: cfeaeb3c0ce7 ("drm/amdgpu: use doorbell mgr for kfd kernel doorbells")
Signed-off-by: Mukul Joshi
---
v1->v2:
- Update the logic to make it work with both 32 bit
64 bit doorbells.
- Add the Fixed tag.
drivers/gpu/d
[amdgpu]
[ +0.000545] amdgpu_pci_probe+0x197/0x400 [amdgpu]
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
index
parameter.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c| 3 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.h| 3 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 6 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.h | 3
For GFX9.4.3, setup a reduced default CWSR grace period equal to
1000 cycles instead of 64000 cycles.
Signed-off-by: Mukul Joshi
Reviewed-by: Felix Kuehling
---
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 2 +-
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 22 ++-
2
Currently, we unmap HIQ by directly writing to HQD
registers. This doesn't work for GFX9.4.3. Instead,
use KIQ to unmap HIQ, similar to how we use KIQ to
map HIQ. Using KIQ to unmap HIQ works for all GFX
series post GFXv9.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Use kiq_unmap_queues
A faulty check was causing TMR region size to be setup incorrectly
for GFX9.4.3. Remove the check and setup TMR region size as 280MB
for GFX9.4.3.
Fixes: b6780d70db5e ("drm/amdgpu: bypass bios dependent operations")
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/amdgpu
process drain interrupt.
Signed-off-by: Mukul Joshi
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 43 ++-
.../gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 29 +
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 +
drivers/gpu/drm/amd
invalid PTE settings, one for
TF enabled, the other for TF disabled. The setting with
TF disabled, doesn't work with TF enabled.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Update handling according to Christian's feedback.
v2->v3:
- Remove ASIC specific callback (Felix).
v3->v4:
Enable GWS capable queue creation for forward
progress gaurantee on GFX 9.4.3.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Update the condition for setting pqn->q->gws
for GFX 9.4.3.
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 1 +
.../amd/amdkfd/kfd_process_queue_manager.
Currently, we unmap HIQ by directly writing to HQD
registers. This doesn't work for GFX9.4.3. Instead,
use KIQ to unmap HIQ, similar to how we use KIQ to
map HIQ. Using KIQ to unmap HIQ works for all GFX
series post GFXv9.
Signed-off-by: Mukul Joshi
---
.../drm/amd/amdgpu/amdgpu_amdkfd_gc_
Enable GWS capable queue creation for forward
progress gaurantee on GFX 9.4.3.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 1 +
.../amd/amdkfd/kfd_process_queue_manager.c| 31 ---
2 files changed, 20 insertions(+), 12 deletions(-)
diff
invalid PTE settings, one for
TF enabled, the other for TF disabled. The setting with
TF disabled, doesn't work with TF enabled.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Update handling according to Christian's feedback.
v2->v3:
- Remove ASIC specific callback (Felix).
drivers/gp
Remove DUMMY_VRAM_SIZE as it is not needed and can result
in reporting incorrect memory size.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 5 -
1 file changed, 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
b/drivers/gpu/drm/amd/amdkfd
invalid PTE settings, one for
TF enabled, the other for TF disabled. The setting with
TF disabled, doesn't work with TF enabled.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Update handling according to Christian's feedback.
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 7 +++
driver
9.4.3")
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 ++---
.../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 +-
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 +-
3 files changed, 12 insertions(+), 13 deletions(-)
di
Rename mman.entity to mman.high_pr to make the distinction
clearer that this is a high priority scheduler. Similarly,
rename the recently added mman.delayed to mman.low_pr to
make it clear it is a low priority scheduler.
No functional change in this patch.
Signed-off-by: Mukul Joshi
---
drivers
Fix the warning during driver load because the event
interrupt class is not set for GFX9.4.3.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd
Add a low priority DRM scheduler for VRAM clearing instead of using
the exisiting high priority scheduler. Use the high priority scheduler
for migrations and evictions.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 4 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
Use the helper function in TTM to get TTM mem limit and
set GTT size to be equal to TTL mem limit.
Signed-off-by: Mukul Joshi
Reviewed-by: Christian König
---
v1->v2:
- Remove AMDGPU_DEFAULT_GTT_SIZE_MB as well as it is
unused.
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 -
drivers/gpu/
Use the helper function in TTM to get TTM memory
limit and set KFD's internal mem limit. This ensures
that KFD's TTM mem limit and actual TTM mem limit are
exactly same.
Signed-off-by: Mukul Joshi
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 ++
Use the helper function in TTM to get TTM mem limit and
set GTT size to be equal to TTL mem limit.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 25 ++---
1 file changed, 6 insertions(+), 19 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu
Add a helper function to get TTM memory limit. This is
needed by KFD to set its own internal memory limits.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/ttm/ttm_tt.c | 6 ++
include/drm/ttm/ttm_tt.h | 2 +-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm
Update the invalid PTE flag setting to ensure, in addition
to transitioning the retry fault to a no-retry fault, it
also causes the wavefront to enter the trap handler. With the
current setting, it only transitions to a no-retry fault.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu
This patch enables the IH retry CAM on GFX9 series cards. This
retry filter is used to prevent sending lots of retry interrupts
in a short span of time and overflowing the IH ring buffer. This
will also help reduce CPU interrupt workload.
Signed-off-by: Mukul Joshi
---
v1:
- Reviewed by Felix
tirqs last disabled at (59649): []
irq_exit_rcu+0xd7/0x130
[ +0.004203] ---[ end trace ]---
Fixes: 0f28cca87e9a ("drm/amdkfd: Extend KFD device topology to surface
peer-to-peer links")
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 2 +-
1 file
sw filter.
This helps in avoiding stale faults being added back into the
filter and preventing legitimate faults from being handled.
Suggested-by: Felix Kuehling
Signed-off-by: Mukul Joshi
Reviewed-by: Philip Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 36
This patch enables the IH retry CAM on GFX9 series cards. This
retry filter is used to prevent sending lots of retry interrupts
in a short span of time and overflowing the IH ring buffer. This
will also help reduce CPU interrupt workload.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd
to translate a retry fault
into a no-retry fault, doesn't work with TF enabled. As a result,
update invalid PTE flags settings which works for both TF enabled
and disabled case.
Fixes: 2abf2573b1c69 ("drm/amdgpu: Enable translate_further to extend UTCL2
reach")
Signed-off-
When translate_further is enabled, page table depth needs to
be updated. This was missing on Arcturus MMHUB init. This was
causing address translations to fail for SDMA user-mode queues.
Fixes: 2abf2573b1c69 ("drm/amdgpu: Enable translate_further to extend UTCL2
reach"
Signed-off-by: M
There are no backing hardware registers for ih_soft ring.
As a result, don't try to access hardware registers for read
and write pointers when processing interrupts on the IH soft
ring.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/navi10_ih.c | 7 ++-
drivers/gpu/drm/amd/a
There are no backing hardware registers for ih_soft ring.
As a result, don't try to access hardware registers for read
and write pointers when processing interrupts on the IH soft
ring.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 7 ++-
1 file chang
by ensuring pm.mutex is not
held while holding the topology lock. For this, kfd_local_mem_info
is moved into the KFD dev struct and filled during device init.
This cached value can then be used instead of querying the value
again and again.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/k
: 9be62cbcc62f ("drm/amdkfd: Cleanup IO links during KFD device removal")
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 10 +-
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
b/drivers/gpu/drm/
generation_count to let user-mode know that topology has
changed due to device removal.
CC: Shuotao Xu
Signed-off-by: Mukul Joshi
Reviewed-by: Shuotao Xu
---
v1->v2:
- Remove comments from inside kfd_topology_update_io_links()
and add them as kernel-doc comments.
drivers/gpu/drm/amd/amd
generation_count to let user-mode know that topology has
changed due to device removal.
CC: Shuotao Xu
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 4 +-
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 +
drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 79
A few MQD manager functions are duplicated for all versions of
MQD manager. Remove this duplication by moving the common
functions into kfd_mqd_manager.c file.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Add "kfd_" prefix to functions moved to kfd_mqd_manager.c.
- Also, suffix &quo
Cleanup the kfd code by removing the unused old debugger
implementation.
Only a small piece of resetting wavefronts is kept and
is moved to kfd_device_queue_manager.c
Signed-off-by: Mukul Joshi
---
v1->v2:
- Rename AMDKFD_IOC_DBG_* to AMDKFD_IOC_DBG_*_DEPRECATED.
- Cleanup address_watch_disa
With no HWS, TLB flushing will not work in SVM code.
Fix this by calling kfd_flush_tlb() which works for both
HWS and no HWS case.
Signed-off-by: Mukul Joshi
Reviewed-by: Philip Yang
---
v1->v2:
- Don't pass adev to svm_range_map_to_gpu().
drivers/gpu/drm/amd/amdkfd/kfd_sv
Cleanup the kfd code by removing the unused old debugger
implementation.
Only a small piece of resetting wavefronts is kept and
is moved to kfd_device_queue_manager.c
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/Makefile | 2 -
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
A few MQD manager functions are duplicated for all versions of
MQD manager. Remove this duplication by moving the common
functions into kfd_mqd_manager.c file.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.c | 63 +
drivers/gpu/drm/amd/amdkfd
With no HWS, TLB flushing will not work in SVM code.
Fix this by calling kfd_flush_tlb() which works for both
HWS and no HWS case.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 16 ++--
1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/drivers
occurred on a GPU that supports MCE notifier based page retirement.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 24
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu
Add the missing call to re-enable RAS error injections on the Aldebaran
mode2 reset code path.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/aldebaran.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
b/drivers/gpu/drm/amd/amdgpu
On Aldebaran, GPU driver will handle bad page retirement
for GPU memory even though UMC is host managed. As a result,
register a bad page retirement handler on the mce notifier
chain to retire bad pages on Aldebaran.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Use smca_get_bank_type() to determ
On Aldebaran, GPU driver will handle bad page retirement
even though UMC is host managed. As a result, register a
bad page retirement handler on the mce notifier chain to
retire bad pages on Aldebaran.
Signed-off-by: Mukul Joshi
---
v1->v2:
- Use smca_get_bank_type() to determine MCA b
def CONFIG_X86_MCE_AMD.
- Use MCE_PRIORITY_UC instead of MCE_PRIO_ACCEL as we are
only handling uncorrectable errors.
- Use macros to determine UMC instance and channel instance
where the uncorrectable error occured.
- Update the headline.
Signed-off-by: Mukul Joshi
Link: https://lore.kernel.
-by: Mukul Joshi
---
arch/x86/include/asm/mce.h| 2 +-
arch/x86/kernel/cpu/mce/amd.c | 3 ++-
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index fc3d36f1f9d0..d90d3ccb583a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/a
Program trap handler settings to enable CWSR with software scheduler
on Aldebaran and Arcturus.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 3 ++-
drivers/gpu/drm/amd/amdgpu
This patch adds support to program trap handler settings
when loading driver with software scheduler (sched_policy=2).
Signed-off-by: Mukul Joshi
Suggested-by: Jay Cornwall
---
.../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c| 31 +
.../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10_3.c
Fix the channel_index table layout to fetch the correct
channel_index when calculating physical address from
normalized address during page retirement.
Also, fix the number of UMC instances and number of channels
within each UMC instance for Aldebaran.
Signed-off-by: Mukul Joshi
---
drivers/gpu
Reset SDMA RAS error counts during init only if persistent
EDC harvesting is not supported.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 7 +--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
b/drivers/gpu/drm
While clearing GCEA error status, do not clear the bits
set by RAS TA.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c | 10 +++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
b/drivers/gpu/drm/amd/amdgpu
For Aldebaran, driver needs to query DramMegaBaseAddress to
check if DF hashing is enabled.
Signed-off-by: Mukul Joshi
Acked-by: Alex Deucher
Reviewed-by: Harish Kasiviswanathan
---
drivers/gpu/drm/amd/amdgpu/df_v3_6.c| 9 +
drivers/gpu/drm/amd/include/asic_reg/df
On Aldebaran, GPU driver will handle bad page retirement
even though UMC is host managed. As a result, register a
bad page retirement handler on the mce notifier chain to
retire bad pages on Aldebaran.
Signed-off-by: Mukul Joshi
Reviewed-by: John Clements
Acked-by: Felix Kuehling
---
drivers
Enable TCP channel hashing to match DF hash settings for Aldebaran.
Signed-off-by: Mukul Joshi
Signed-off-by: Oak Zeng
Reviewed-by: Joseph Greathouse
---
drivers/gpu/drm/amd/amdgpu/df_v3_6.c| 17 +++--
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 ++-
.../amd
SDMA utilization calculations are enabled/disabled by
writing to SDMAx_PUB_DUMMY_REG2 register. Currently,
enable this only for Arcturus.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 9 +
1 file changed, 9 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu
SDMA utilization calculations are enabled/disabled by
writing to SDMAx_PUB_DUMMY_REG2 register. Currently,
enable this only for Arcturus.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 10 ++
1 file changed, 10 insertions(+)
diff --git a/drivers/gpu/drm/amd
manage. In a system with mix of such devices, KFD would need
to request process doorbell space based on the type of device,
either from amdgpu or from its own doorbell space.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 30 +--
drivers/gpu/drm/amd/amdkfd
Replace spaces with Tabs to fix indentation in kfd_smi_event
enum.
Signed-off-by: Mukul Joshi
---
include/uapi/linux/kfd_ioctl.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index 8b7368bfbd84
Add support for reporting GPU reset events through SMI. KFD
would report both pre and post GPU reset events.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 +++
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 ++
drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 35
1 - 100 of 120 matches
Mail list logo