from:"Belanger, David"

RE: [PATCH] drm/amdgpu: Enable SA software trap.

2022-09-22 Thread Belanger, David

[AMD Official Use Only - General]



> -Original Message-
> From: Kuehling, Felix 
> Sent: Thursday, September 22, 2022 1:14 PM
> To: Belanger, David ; amd-
> g...@lists.freedesktop.org
> Cc: Cornwall, Jay 
> Subject: Re: [PATCH] drm/amdgpu: Enable SA software trap.
> 
> Am 2022-09-22 um 12:17 schrieb David Belanger:
> > Enables support for software trap for MES >= 4.
> > Adapted from implementation from Jay Cornwall.
> >
> > v2: Add IP version check in conditions.
> >
> > Signed-off-by: Jay Cornwall 
> > Signed-off-by: David Belanger 
> > Reviewed-by: Felix Kuehling 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c|   6 +-
> >   .../gpu/drm/amd/amdkfd/cwsr_trap_handler.h| 771 +-
> >   .../amd/amdkfd/cwsr_trap_handler_gfx10.asm|  21 +
> >   .../gpu/drm/amd/amdkfd/kfd_int_process_v11.c  |  26 +-
> >   4 files changed, 437 insertions(+), 387 deletions(-)
> [snip]
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v11.c
> > b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v11.c
> > index a6fcbeeb7428..4e03d19e9333 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v11.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v11.c
> > @@ -358,13 +358,35 @@ static void event_interrupt_wq_v11(struct
> kfd_dev *dev,
> > break;
> > case SQ_INTERRUPT_WORD_ENCODING_ERROR:
> > print_sq_intr_info_error(context_id0,
> context_id1);
> > +   sq_int_priv = REG_GET_FIELD(context_id0,
> > +
>   SQ_INTERRUPT_WORD_WAVE_CTXID0, PRIV);
> > sq_int_errtype =
> REG_GET_FIELD(context_id0,
> >
>   SQ_INTERRUPT_WORD_ERROR_CTXID0, TYPE);
> > -   if (sq_int_errtype !=
> SQ_INTERRUPT_ERROR_TYPE_ILLEGAL_INST &&
> > -   sq_int_errtype !=
> SQ_INTERRUPT_ERROR_TYPE_MEMVIOL) {
> > +
> > +   switch (sq_int_errtype) {
> > +   case SQ_INTERRUPT_ERROR_TYPE_EDC_FUE:
> > +   case SQ_INTERRUPT_ERROR_TYPE_EDC_FED:
> >
>   event_interrupt_poison_consumption_v11(
> > dev, pasid,
> source_id);
> > return;
> > +   case
> SQ_INTERRUPT_ERROR_TYPE_ILLEGAL_INST:
> > +   /*if (!(((adev->mes.sched_version &
> AMDGPU_MES_VERSION_MASK) >= 4) &&
> > + (adev-
> >ip_versions[GC_HWIP][0] >= IP_VERSION(11, 0, 0)) &&
> > + (adev-
> >ip_versions[GC_HWIP][0] <= IP_VERSION(11, 0, 3)))
> > +   && sq_int_priv)
> > +
>   kfd_set_dbg_ev_from_interrupt(dev, pasid, -1,
> > +
>   KFD_EC_MASK(EC_QUEUE_WAVE_ILLEGAL_INSTRUCTION),
> > +   NULL, 0);*/
> > +   return;
> > +   case
> SQ_INTERRUPT_ERROR_TYPE_MEMVIOL:
> > +   /*if (!(((adev->mes.sched_version &
> AMDGPU_MES_VERSION_MASK) >= 4) &&
> > + (adev-
> >ip_versions[GC_HWIP][0] >= IP_VERSION(11, 0, 0)) &&
> > + (adev-
> >ip_versions[GC_HWIP][0] <= IP_VERSION(11, 0, 3)))
> > +   && sq_int_priv)
> > +
>   kfd_set_dbg_ev_from_interrupt(dev, pasid, -1,
> > +
>   KFD_EC_MASK(EC_QUEUE_WAVE_MEMORY_VIOLATION),
> > +   NULL, 0);*/
> 
> Which branch is this for? kfd_set_dbg_ev_from_interrupt shouldn't exist on
> the upstream branch yet. That code is still under review for upstream.
> 

My understanding is that it is for branch amd-staging-drm-next to make its way 
upstream.
The code that calls that function is commented out.  There are other 
pre-existing instances in that file in amd-staging-drm-next branch that are 
commented out also with that function.
Please advise if I should remove it from the patch for now or keep it as 
commented out.

Thanks,
David B.

> Regards,
>    Felix
> 
> 
> > +   return;
> > }
> > break;
> > default:

RE: [PATCH] drm/amdgpu: Enable SA software trap.

2022-09-22 Thread Belanger, David

[Public]



> -Original Message-
> From: Sider, Graham 
> Sent: Thursday, September 22, 2022 1:56 PM
> To: Belanger, David ; amd-
> g...@lists.freedesktop.org
> Cc: Cornwall, Jay ; Kuehling, Felix
> ; Belanger, David 
> Subject: RE: [PATCH] drm/amdgpu: Enable SA software trap.
> 
> [Public]
> 
> > -Original Message-
> > From: amd-gfx  On Behalf Of
> > David Belanger
> > Sent: Thursday, September 22, 2022 12:17 PM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Cornwall, Jay ; Kuehling, Felix
> > ; Belanger, David 
> > Subject: [PATCH] drm/amdgpu: Enable SA software trap.
> >
> > Caution: This message originated from an External Source. Use proper
> > caution when opening attachments, clicking links, or responding.
> >
> >
> > Enables support for software trap for MES >= 4.
> > Adapted from implementation from Jay Cornwall.
> >
> > v2: Add IP version check in conditions.
> >
> > Signed-off-by: Jay Cornwall 
> > Signed-off-by: David Belanger 
> > Reviewed-by: Felix Kuehling 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c|   6 +-
> >  .../gpu/drm/amd/amdkfd/cwsr_trap_handler.h| 771 +-
> >  .../amd/amdkfd/cwsr_trap_handler_gfx10.asm|  21 +
> >  .../gpu/drm/amd/amdkfd/kfd_int_process_v11.c  |  26 +-
> >  4 files changed, 437 insertions(+), 387 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> > b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> > index b64cd46a159a..cbc506b958b1 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> > @@ -185,7 +185,11 @@ static int mes_v11_0_add_hw_queue(struct
> > amdgpu_mes *mes,
> > mes_add_queue_pkt.trap_handler_addr = input->tba_addr;
> > mes_add_queue_pkt.tma_addr = input->tma_addr;
> > mes_add_queue_pkt.is_kfd_process = input->is_kfd_process;
> > -   mes_add_queue_pkt.trap_en = 1;
> > +
> > +   if (!(((adev->mes.sched_version & AMDGPU_MES_VERSION_MASK) >=
> > 4) &&
> > + (adev->ip_versions[GC_HWIP][0] >= IP_VERSION(11, 0, 0)) &&
> > + (adev->ip_versions[GC_HWIP][0] <= IP_VERSION(11, 0, 3
> > +   mes_add_queue_pkt.trap_en = 1;
> 
> I think the value for trap_en here is backwards. It should be set to 0 for 
> this
> condition and default to 1 otherwise.
> 
> Best,
> Graham

Note that the condition is reversed with the "!" operator.

David B.

> 
> >
> > return mes_v11_0_submit_pkt_and_poll_completion(mes,
> > &mes_add_queue_pkt, sizeof(mes_add_queue_pkt),
> > diff --git a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
> > b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
> > index 60a81649cf12..c7118843db05 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
> > +++ b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
> > @@ -742,7 +742,7 @@ static const uint32_t cwsr_trap_nv1x_hex[] = {
> > 0xbf88fffe, 0x877aff7f,
> > 0x0400, 0x8f7a857a,
> > 0x886d7a6d, 0xb97b02dc,
> > -   0x8f7b997b, 0xb97a2a05,
> > +   0x8f7b997b, 0xb97a3a05,
> > 0x807a817a, 0xbf0d997b,
> > 0xbf850002, 0x8f7a897a,
> > 0xbf820001, 0x8f7a8a7a,
> > @@ -819,7 +819,7 @@ static const uint32_t cwsr_trap_nv1x_hex[] = {
> > 0xbefe037c, 0xbefc0370,
> > 0xf4611c7a, 0xf800,
> > 0x80708470, 0xbefc037e,
> > -   0xb9702a05, 0x80708170,
> > +   0xb9703a05, 0x80708170,
> > 0xbf0d9973, 0xbf850002,
> > 0x8f708970, 0xbf820001,
> > 0x8f708a70, 0xb97a1e06,
> > @@ -1069,7 +1069,7 @@ static const uint32_t cwsr_trap_nv1x_hex[] = {
> > 0xb9f9f816, 0x876f7bff,
> > 0xf800, 0x906f8b6f,
> > 0xb9efa2c3, 0xb9f3f801,
> > -   0xb96e2a05, 0x806e816e,
> > +   0xb96e3a05, 0x806e816e,
> > 0xbf0d9972, 0xbf850002,
> > 0x8f6e896e, 0xbf820001,
> > 0x8f6e8a6e, 0xb96f1e06,
> > @@ -2114,7 +2114,7 @@ static const uint32_t cwsr_trap_gfx10_hex[] = {
> > 0x007a, 0x7e000280,
> > 0xbefe037a, 0xbeff037b,
> > 0xb97b02dc, 0x8f7b997b,
> > -   0xb97a2a05, 0x807a817a,
> > +   0xb97a3a05, 0x807a817a,
> > 0xbf0d997b, 0xbf850002,
> > 0x8f7a897a, 0xbf820001,
> > 0x8f7a8a7a, 0xb97b1e06,
> > @@ -2157,7 +2157,7 @@ static const uint32_t c

RE: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module exit.

2023-03-07 Thread Belanger, David

[AMD Official Use Only - General]


The test case is a python program that will load the driver, do some 
operations, then unload the driver.
When the driver exists, there is still the python process space around holding 
on the address space.
When the python process space exits, the mmu_notifier gets called but the 
driver has already been unloaded.

The goal of the fix is to address case where there could be outstanding address 
space / worker threads for process
cleanup that needs to be cleared/completed at exit time.

Regards,
David B.

> -Original Message-
> From: Koenig, Christian 
> Sent: Tuesday, March 7, 2023 2:05 AM
> To: Belanger, David ; amd-
> g...@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module
> exit.
> 
> Am 06.03.23 um 22:58 schrieb David Belanger:
> > Handle case when module is unloaded (kfd_exit) before a process space
> > (mm_struct) is released.
> 
> Well that should never ever happen in the first place. It sounds like we are
> missing grabbing module references.
> 
> Regards,
> Christian.
> 
> >
> > Signed-off-by: David Belanger 
> > ---
> >   drivers/gpu/drm/amd/amdkfd/kfd_module.c  |  4 ++
> >   drivers/gpu/drm/amd/amdkfd/kfd_process.c | 57
> 
> >   2 files changed, 61 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> > b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> > index 09b966dc3768..8ef4bd9e4f7d 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> > @@ -26,6 +26,9 @@
> >   #include "kfd_priv.h"
> >   #include "amdgpu_amdkfd.h"
> >
> > +void kfd_cleanup_processes(void);
> > +
> > +
> >   static int kfd_init(void)
> >   {
> > int err;
> > @@ -77,6 +80,7 @@ static int kfd_init(void)
> >
> >   static void kfd_exit(void)
> >   {
> > +   kfd_cleanup_processes();
> > kfd_debugfs_fini();
> > kfd_process_destroy_wq();
> > kfd_procfs_shutdown();
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > index ebabe92f7edb..b5b28a32639d 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > @@ -1181,6 +1181,17 @@ static void kfd_process_notifier_release(struct
> mmu_notifier *mn,
> > return;
> >
> > mutex_lock(&kfd_processes_mutex);
> > +   /*
> > +* Do early return if p is not in the table.
> > +*
> > +* This could potentially happen if this function is called concurrently
> > +* by mmu_notifier and by kfd_cleanup_pocesses.
> > +*
> > +*/
> > +   if (!hash_hashed(&p->kfd_processes)) {
> > +   mutex_unlock(&kfd_processes_mutex);
> > +   return;
> > +   }
> > hash_del_rcu(&p->kfd_processes);
> > mutex_unlock(&kfd_processes_mutex);
> > synchronize_srcu(&kfd_processes_srcu);
> > @@ -1200,6 +1211,52 @@ static const struct mmu_notifier_ops
> kfd_process_mmu_notifier_ops = {
> > .free_notifier = kfd_process_free_notifier,
> >   };
> >
> > +
> > +void kfd_cleanup_processes(void)
> > +{
> > +   struct kfd_process *p;
> > +   unsigned int temp;
> > +
> > +   /*
> > +* Iterate over remaining processes in table, calling notifier release
> > +* to free up notifier and process resources.
> > +*
> > +* This code handles the case when driver is unloaded before all
> mm_struct
> > +* are released.
> > +*/
> > +   int idx = srcu_read_lock(&kfd_processes_srcu);
> > +
> > +   hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
> > +   if (p) {
> > +   /*
> > +* Obtain a reference on p to avoid a late
> mmu_notifier release
> > +* call triggering freeing the process.
> > +*/
> > +
> > +   kref_get(&p->ref);
> > +
> > +   srcu_read_unlock(&kfd_processes_srcu, idx);
> > +
> > +   kfd_process_notifier_release(&p->mmu_notifier, p-
> >mm);
> > +
> > +   kfd_unref_process(p);
> > +
> > +   idx = srcu_read_lock(&kfd_processes_srcu);
> > +   }
> > +   }
> > +   srcu_read_unlock(&kfd_processes_srcu, idx);
> > +
> > +   /*
> > +* Must be called after all mmu_notifier_put are done and before
> > +* kfd_process_wq is released.
> > +*
> > +* Ensures that all outstanding free_notifier gets called, triggering 
> > the
> release
> > +* of the process.
> > +*/
> > +   mmu_notifier_synchronize();
> > +}
> > +
> > +
> >   static int kfd_process_init_cwsr_apu(struct kfd_process *p, struct file
> *filep)
> >   {
> > unsigned long  offset;

RE: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module exit.

2023-03-07 Thread Belanger, David

[AMD Official Use Only - General]

Thanks Felix for your feedback.
I will work on applying your suggested approach and uploaded a second version 
when it is ready.

David B.

> -Original Message-
> From: Kuehling, Felix 
> Sent: Monday, March 6, 2023 6:46 PM
> To: amd-gfx@lists.freedesktop.org; Belanger, David
> 
> Subject: Re: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module
> exit.
> 
> 
> Am 2023-03-06 um 16:58 schrieb David Belanger:
> > Handle case when module is unloaded (kfd_exit) before a process space
> > (mm_struct) is released.
> >
> > Signed-off-by: David Belanger 
> > ---
> >   drivers/gpu/drm/amd/amdkfd/kfd_module.c  |  4 ++
> >   drivers/gpu/drm/amd/amdkfd/kfd_process.c | 57
> 
> >   2 files changed, 61 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> > b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> > index 09b966dc3768..8ef4bd9e4f7d 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> > @@ -26,6 +26,9 @@
> >   #include "kfd_priv.h"
> >   #include "amdgpu_amdkfd.h"
> >
> > +void kfd_cleanup_processes(void);
> > +
> > +
> >   static int kfd_init(void)
> >   {
> > int err;
> > @@ -77,6 +80,7 @@ static int kfd_init(void)
> >
> >   static void kfd_exit(void)
> >   {
> > +   kfd_cleanup_processes();
> > kfd_debugfs_fini();
> > kfd_process_destroy_wq();
> > kfd_procfs_shutdown();
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > index ebabe92f7edb..b5b28a32639d 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> > @@ -1181,6 +1181,17 @@ static void kfd_process_notifier_release(struct
> mmu_notifier *mn,
> > return;
> >
> > mutex_lock(&kfd_processes_mutex);
> > +   /*
> > +* Do early return if p is not in the table.
> > +*
> > +* This could potentially happen if this function is called concurrently
> > +* by mmu_notifier and by kfd_cleanup_pocesses.
> > +*
> > +*/
> > +   if (!hash_hashed(&p->kfd_processes)) {
> > +   mutex_unlock(&kfd_processes_mutex);
> > +   return;
> > +   }
> > hash_del_rcu(&p->kfd_processes);
> > mutex_unlock(&kfd_processes_mutex);
> > synchronize_srcu(&kfd_processes_srcu);
> > @@ -1200,6 +1211,52 @@ static const struct mmu_notifier_ops
> kfd_process_mmu_notifier_ops = {
> > .free_notifier = kfd_process_free_notifier,
> >   };
> >
> > +
> > +void kfd_cleanup_processes(void)
> > +{
> > +   struct kfd_process *p;
> > +   unsigned int temp;
> > +
> > +   /*
> > +* Iterate over remaining processes in table, calling notifier release
> > +* to free up notifier and process resources.
> > +*
> > +* This code handles the case when driver is unloaded before all
> mm_struct
> > +* are released.
> > +*/
> > +   int idx = srcu_read_lock(&kfd_processes_srcu);
> > +
> > +   hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
> > +   if (p) {
> > +   /*
> > +* Obtain a reference on p to avoid a late
> mmu_notifier release
> > +* call triggering freeing the process.
> > +*/
> > +
> > +   kref_get(&p->ref);
> > +
> > +   srcu_read_unlock(&kfd_processes_srcu, idx);
> 
> I don't think it's valid to drop the lock in the middle of the loop. You need 
> to
> hold the lock throughout the loop to protect the consistency of the data
> structure. I guess you're doing this because you got a deadlock from
> synchronize_srcu in kfd_process_notifier_release.
> 
> 
> > +
> > +   kfd_process_notifier_release(&p->mmu_notifier, p-
> >mm);
> 
> This calls hash_del_rcu to remove the process from the hash table. To make
> this safe, you need to hold the kfd_processes_mutex.
> 
> Since this is outside the RCU read lock, the entry in the hlist can be freed,
> which can cause problems when the hash_for_each_rcu loop tries to find the
> next entry in the hlist.
> 
> 
> > +
> > +   kfd_unref_process(p);
> 
> This schedules a worker that can free the process at any time, which also
> frees the

RE: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module exit.

2023-03-08 Thread Belanger, David

[AMD Official Use Only - General]



> -Original Message-
> From: Christian König 
> Sent: Wednesday, March 8, 2023 4:08 AM
> To: Belanger, David ; Koenig, Christian
> ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module
> exit.
> 
> Caution: This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
> 
> 
> Am 07.03.23 um 16:28 schrieb Belanger, David:
> > [AMD Official Use Only - General]
> >
> >
> > The test case is a python program that will load the driver, do some
> operations, then unload the driver.
> 
> What do you mean with unloading the driver? Removing the module? Or
> destroying the device?
> 

The python program calls a shell script that does "modprobe amdgpu".
Calls some SMI operation to get some events.
Then it calls a shell scripts that does "modprobe -r amdgpu".
Then it exits.

There will be ref on the kfd_process that will remain, which will be released 
only when mmu_notifier ops->release is called.
This does not get called until the python process ends.

The test program is definitively not the typical use a general user would do.

> > When the driver exists, there is still the python process space around
> holding on the address space.
> > When the python process space exits, the mmu_notifier gets called but the
> driver has already been unloaded.
> >
> > The goal of the fix is to address case where there could be
> > outstanding address space / worker threads for process cleanup that needs
> to be cleared/completed at exit time.
> 
> Yeah and when the module is unloaded this is a completely futile effort.
> 
> The general upstream approach is to take references on the struct device and
> module and prevent unloading as long as those references exists.
> 
> The device might be non-functional any more (because for example of hot
> plug), but the driver should never be unloaded before the python program
> exits.

Thank you for your feedback, I will investigate that approach.

> 
> Regards,
> Christian.
> 
> >
> > Regards,
> > David B.
> >
> >> -Original Message-
> >> From: Koenig, Christian 
> >> Sent: Tuesday, March 7, 2023 2:05 AM
> >> To: Belanger, David ; amd-
> >> g...@lists.freedesktop.org
> >> Subject: Re: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module
> >> exit.
> >>
> >> Am 06.03.23 um 22:58 schrieb David Belanger:
> >>> Handle case when module is unloaded (kfd_exit) before a process
> >>> space
> >>> (mm_struct) is released.
> >> Well that should never ever happen in the first place. It sounds like
> >> we are missing grabbing module references.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> Signed-off-by: David Belanger 
> >>> ---
> >>>drivers/gpu/drm/amd/amdkfd/kfd_module.c  |  4 ++
> >>>drivers/gpu/drm/amd/amdkfd/kfd_process.c | 57
> >> 
> >>>2 files changed, 61 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> >>> b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> >>> index 09b966dc3768..8ef4bd9e4f7d 100644
> >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
> >>> @@ -26,6 +26,9 @@
> >>>#include "kfd_priv.h"
> >>>#include "amdgpu_amdkfd.h"
> >>>
> >>> +void kfd_cleanup_processes(void);
> >>> +
> >>> +
> >>>static int kfd_init(void)
> >>>{
> >>> int err;
> >>> @@ -77,6 +80,7 @@ static int kfd_init(void)
> >>>
> >>>static void kfd_exit(void)
> >>>{
> >>> +   kfd_cleanup_processes();
> >>> kfd_debugfs_fini();
> >>> kfd_process_destroy_wq();
> >>> kfd_procfs_shutdown();
> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >>> index ebabe92f7edb..b5b28a32639d 100644
> >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >>> @@ -1181,6 +1181,17 @@ static void
> >>> kfd_process_notifier_release(struct
> >> mmu_notifier *mn,
> >>> return;
> >>>
> >>> mutex_lock(&kfd_processes_mutex);
> >>&g

RE: [PATCH] drm/amdkfd: remove gfx 12 trap handler page size cap

2024-11-05 Thread Belanger, David

[Public]

Instead of removing the check entirely, I think it should be changed to:
BUILD_BUG_ON(sizeof(cwsr_trap_gfx12_hex) > KFD_CWSR_TMA_OFFSET);
Like the pre-GFX11 cases.   A cwsr trap handler too large will overwrite the 
TMA area.

Regards,
David B.

> -Original Message-
> From: Kim, Jonathan 
> Sent: Tuesday, November 5, 2024 12:44 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Belanger, David ; Kim, Jonathan
> 
> Subject: [PATCH] drm/amdkfd: remove gfx 12 trap handler page size cap
>
> GFX 12 does not require a page size cap for the trap handler because it does 
> not
> require a CWSR work around like GFX 11 did.
>
> Signed-off-by: Jonathan Kim 
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 348925254bff..4705ebb07ba0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -534,7 +534,6 @@ static void kfd_cwsr_init(struct kfd_dev *kfd)
>   kfd->cwsr_isa = cwsr_trap_gfx11_hex;
>   kfd->cwsr_isa_size = sizeof(cwsr_trap_gfx11_hex);
>   } else {
> - BUILD_BUG_ON(sizeof(cwsr_trap_gfx12_hex) >
> PAGE_SIZE);
>   kfd->cwsr_isa = cwsr_trap_gfx12_hex;
>   kfd->cwsr_isa_size = sizeof(cwsr_trap_gfx12_hex);
>   }
> --
> 2.34.1

RE: [PATCH] drm/amdkfd: remove gfx 12 trap handler page size cap

2024-11-05 Thread Belanger, David

[Public]

Reviewed-by: David Belanger 

> -Original Message-
> From: Kim, Jonathan 
> Sent: Tuesday, November 5, 2024 1:46 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Belanger, David ; Kim, Jonathan
> 
> Subject: [PATCH] drm/amdkfd: remove gfx 12 trap handler page size cap
>
> GFX 12 does not require a page size cap for the trap handler because it does 
> not
> require a CWSR work around like GFX 11 did.
>
> v2: set default cap
>
> Signed-off-by: Jonathan Kim 
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 348925254bff..956198da7859 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -534,7 +534,8 @@ static void kfd_cwsr_init(struct kfd_dev *kfd)
>   kfd->cwsr_isa = cwsr_trap_gfx11_hex;
>   kfd->cwsr_isa_size = sizeof(cwsr_trap_gfx11_hex);
>   } else {
> - BUILD_BUG_ON(sizeof(cwsr_trap_gfx12_hex) >
> PAGE_SIZE);
> + BUILD_BUG_ON(sizeof(cwsr_trap_gfx12_hex)
> +  > KFD_CWSR_TMA_OFFSET);
>   kfd->cwsr_isa = cwsr_trap_gfx12_hex;
>   kfd->cwsr_isa_size = sizeof(cwsr_trap_gfx12_hex);
>   }
> --
> 2.34.1

RE: [PATCH] drm/amdkfd: add MEC version that supports no PCIe atomics for GFX12

2024-11-28 Thread Belanger, David

[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: David Belanger 

> -Original Message-
> From: amd-gfx  On Behalf Of Sreekant
> Somasekharan
> Sent: Thursday, November 28, 2024 1:20 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Somasekharan, Sreekant 
> Subject: [PATCH] drm/amdkfd: add MEC version that supports no PCIe atomics for
> GFX12
>
> Add MEC version from which alternate support for no PCIe atomics is provided 
> so
> that device is not skipped during KFD device init in GFX1200/GFX1201.
>
> Signed-off-by: Sreekant Somasekharan 
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 956198da7859..891ce1d0dbbe 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -237,6 +237,7 @@ static void kfd_device_info_init(struct kfd_dev *kfd,
>   kfd->device_info.no_atomic_fw_version = kfd->adev-
> >gfx.rs64_enable ? 509 : 0;
>   } else {
>   kfd->device_info.needs_pci_atomics = true;
> + kfd->device_info.no_atomic_fw_version = 2090;
>   }
>   } else {
>   kfd->device_info.doorbell_size = 4;
> --
> 2.34.1

Re: [PATCH 1/2] drm/amdkfd: hard-code cahceline for gc11

2024-12-05 Thread Belanger, David




Please fix typo in title line:
cahceline -> cacheline

And I suggest also changing:
gc11 -> GFX11 (for consistency with typical name used in comments)

With comment fixes, I approve the commit:

Reviewed-by: David Belanger 


On 11/28/2024 11:17 AM, Harish Kasiviswanathan wrote:

This information is not available in ip discovery table.

Signed-off-by: Harish Kasiviswanathan 
---
  drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 22 --
  1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
index 723f1220e1cc..3ca95f54601e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
@@ -1423,6 +1423,7 @@ int kfd_parse_crat_table(void *crat_image, struct 
list_head *device_list,
  
  
  static int kfd_fill_gpu_cache_info_from_gfx_config(struct kfd_dev *kdev,

+  bool cache_line_size_missing,
   struct kfd_gpu_cache_info 
*pcache_info)
  {
struct amdgpu_device *adev = kdev->adev;
@@ -1437,6 +1438,8 @@ static int kfd_fill_gpu_cache_info_from_gfx_config(struct 
kfd_dev *kdev,
CRAT_CACHE_FLAGS_SIMD_CACHE);
pcache_info[i].num_cu_shared = 
adev->gfx.config.gc_num_tcp_per_wpg / 2;
pcache_info[i].cache_line_size = 
adev->gfx.config.gc_tcp_cache_line_size;
+   if (cache_line_size_missing && !pcache_info[i].cache_line_size)
+   pcache_info[i].cache_line_size = 128;
i++;
}
/* Scalar L1 Instruction Cache per SQC */
@@ -1449,6 +1452,8 @@ static int kfd_fill_gpu_cache_info_from_gfx_config(struct 
kfd_dev *kdev,
CRAT_CACHE_FLAGS_SIMD_CACHE);
pcache_info[i].num_cu_shared = 
adev->gfx.config.gc_num_sqc_per_wgp * 2;
pcache_info[i].cache_line_size = 
adev->gfx.config.gc_instruction_cache_line_size;
+   if (cache_line_size_missing && !pcache_info[i].cache_line_size)
+   pcache_info[i].cache_line_size = 128;
i++;
}
/* Scalar L1 Data Cache per SQC */
@@ -1460,6 +1465,8 @@ static int kfd_fill_gpu_cache_info_from_gfx_config(struct 
kfd_dev *kdev,
CRAT_CACHE_FLAGS_SIMD_CACHE);
pcache_info[i].num_cu_shared = 
adev->gfx.config.gc_num_sqc_per_wgp * 2;
pcache_info[i].cache_line_size = 
adev->gfx.config.gc_scalar_data_cache_line_size;
+   if (cache_line_size_missing && !pcache_info[i].cache_line_size)
+   pcache_info[i].cache_line_size = 64;
i++;
}
/* GL1 Data Cache per SA */
@@ -1472,7 +1479,8 @@ static int kfd_fill_gpu_cache_info_from_gfx_config(struct 
kfd_dev *kdev,
CRAT_CACHE_FLAGS_DATA_CACHE |
CRAT_CACHE_FLAGS_SIMD_CACHE);
pcache_info[i].num_cu_shared = adev->gfx.config.max_cu_per_sh;
-   pcache_info[i].cache_line_size = 0;
+   if (cache_line_size_missing)
+   pcache_info[i].cache_line_size = 128;
i++;
}
/* L2 Data Cache per GPU (Total Tex Cache) */
@@ -1484,6 +1492,8 @@ static int kfd_fill_gpu_cache_info_from_gfx_config(struct 
kfd_dev *kdev,
CRAT_CACHE_FLAGS_SIMD_CACHE);
pcache_info[i].num_cu_shared = adev->gfx.config.max_cu_per_sh;
pcache_info[i].cache_line_size = 
adev->gfx.config.gc_tcc_cache_line_size;
+   if (cache_line_size_missing && !pcache_info[i].cache_line_size)
+   pcache_info[i].cache_line_size = 128;
i++;
}
/* L3 Data Cache per GPU */
@@ -1563,6 +1573,7 @@ static int 
kfd_fill_gpu_cache_info_from_gfx_config_v2(struct kfd_dev *kdev,
  int kfd_get_gpu_cache_info(struct kfd_node *kdev, struct kfd_gpu_cache_info 
**pcache_info)
  {
int num_of_cache_types = 0;
+   bool cache_line_size_missing = false;
  
  	switch (kdev->adev->asic_type) {

case CHIP_KAVERI:
@@ -1686,10 +1697,17 @@ int kfd_get_gpu_cache_info(struct kfd_node *kdev, 
struct kfd_gpu_cache_info **pc
case IP_VERSION(11, 5, 0):
case IP_VERSION(11, 5, 1):
case IP_VERSION(11, 5, 2):
+   /* Cacheline size not available in IP discovery for 
gc11.
+* kfd_fill_gpu_cache_info_from_gfx_config to hard code 
it
+*/
+   cache_line_size_missing = true;
+   fallthrough;
case IP_VERSION(12, 0, 0):
case IP_VERSION(12, 0, 1):
num_of_cache_types =
-   
kfd_fill_gpu_cache_info_from_gfx

Re: [PATCH 2/2] drm/amdkfd: hard-code MALL Cacheline size for gfx11, gfx12

2024-12-05 Thread Belanger, David




Reviewed-by: David Belanger 


On 11/28/2024 11:17 AM, Harish Kasiviswanathan wrote:

Signed-off-by: Harish Kasiviswanathan 
---
  drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
index 3ca95f54601e..e0faec4682f3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
@@ -1504,7 +1504,7 @@ static int kfd_fill_gpu_cache_info_from_gfx_config(struct 
kfd_dev *kdev,
CRAT_CACHE_FLAGS_DATA_CACHE |
CRAT_CACHE_FLAGS_SIMD_CACHE);
pcache_info[i].num_cu_shared = adev->gfx.config.max_cu_per_sh;
-   pcache_info[i].cache_line_size = 0;
+   pcache_info[i].cache_line_size = 64;
i++;
}
return i;

Re: [PATCH] drm/amdkfd: limit sdma queue reset caps flagging for gfx9

2025-03-27 Thread Belanger, David

Reviewed-by: David Belanger 

On 3/27/2025 11:53 AM, Jonathan Kim wrote:
> ASICs post GFX 9 are being flagged as SDMA per queue reset supported
> in the KGD but KFD and scheduler FW currently have no support.
> Limit SDMA queue reset capabilities to GFX 9.
> 
> Signed-off-by: Jonathan Kim 
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> index e477d7509646..993eef5a4983 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
> @@ -1983,9 +1983,6 @@ static void kfd_topology_set_capabilities(struct 
> kfd_topology_device *dev)
>   if (kfd_dbg_has_ttmps_always_setup(dev->gpu))
>   dev->node_props.debug_prop |= 
> HSA_DBG_DISPATCH_INFO_ALWAYS_VALID;
>  
> - if (dev->gpu->adev->sdma.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE)
> - dev->node_props.capability2 |= 
> HSA_CAP2_PER_SDMA_QUEUE_RESET_SUPPORTED;
> -
>   if (KFD_GC_VERSION(dev->gpu) < IP_VERSION(10, 0, 0)) {
>   if (KFD_GC_VERSION(dev->gpu) == IP_VERSION(9, 4, 3) ||
>   KFD_GC_VERSION(dev->gpu) == IP_VERSION(9, 4, 4))
> @@ -2002,6 +1999,9 @@ static void kfd_topology_set_capabilities(struct 
> kfd_topology_device *dev)
>   
> HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED;
>  
>   dev->node_props.capability |= HSA_CAP_PER_QUEUE_RESET_SUPPORTED;
> +
> + if (dev->gpu->adev->sdma.supported_reset & 
> AMDGPU_RESET_TYPE_PER_QUEUE)
> + dev->node_props.capability2 |= 
> HSA_CAP2_PER_SDMA_QUEUE_RESET_SUPPORTED;
>   } else {
>   dev->node_props.debug_prop |= 
> HSA_DBG_WATCH_ADDR_MASK_LO_BIT_GFX10 |
>   HSA_DBG_WATCH_ADDR_MASK_HI_BIT;

RE: [PATCH] drm/amdgpu: Enable SA software trap.

RE: [PATCH] drm/amdgpu: Enable SA software trap.

RE: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module exit.

RE: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module exit.

RE: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module exit.

RE: [PATCH] drm/amdkfd: remove gfx 12 trap handler page size cap

RE: [PATCH] drm/amdkfd: remove gfx 12 trap handler page size cap

RE: [PATCH] drm/amdkfd: add MEC version that supports no PCIe atomics for GFX12

Re: [PATCH 1/2] drm/amdkfd: hard-code cahceline for gc11

Re: [PATCH 2/2] drm/amdkfd: hard-code MALL Cacheline size for gfx11, gfx12

Re: [PATCH] drm/amdkfd: limit sdma queue reset caps flagging for gfx9

11 matches

Site Navigation

Mail list logo

Footer information