[Public]
> From: jesse.zh...@amd.com
> Sent: Wednesday, December 18, 2024 6:26 PM
> Subject: [PATCH] drm/amdkfd: fixed page fault when enable MES shader debugger
>
> Initialize the process context address before setting the shader debugger.
I think it would make sense to pull this into it's own
[Public]
> From: Koenig, Christian
> Sent: Thursday, December 19, 2024 5:07
> Am 16.12.24 um 18:49 schrieb Yunxiang Li:
> > Before, every time fdinfo is queried we try to lock all the BOs in the
> > VM and calculate memory usage from scratch. This works okay if the
> > fdinfo is rarely read and t
[Public]
> From: Tobias Klausmann
> Sent: Wednesday, December 18, 2024 10:54
> Hi!
>
> I have been hitting kernel messages from AMDGPU since v6.13-rc2, for
> example:
>
> [Wed Dec 18 15:56:24 2024] gmc_v11_0_process_interrupt: 10 callbacks
> suppressed [Wed Dec 18 15:56:24 2024] amdgpu :03:00
[Public]
+Felix
> From: Tobias Klausmann
> Sent: Wednesday, December 18, 2024 10:54
>
> I have been hitting kernel messages from AMDGPU since v6.13-rc2, for
> example:
>
> [Wed Dec 18 15:56:24 2024] gmc_v11_0_process_interrupt: 10 callbacks
> suppressed [Wed Dec 18 15:56:24 2024] amdgpu :03:
[Public]
> From: Koenig, Christian
> Sent: Thursday, December 12, 2024 4:25
> Am 11.12.24 um 17:14 schrieb Li, Yunxiang (Teddy):
> > [Public]
> >
> >> From: Koenig, Christian
> >> Sent: Wednesday, December 11, 2024 10:03 Am 11.12.24 um 15:02 schrieb
&
[Public]
> From: Koenig, Christian
> Sent: Wednesday, December 11, 2024 10:03
> Am 11.12.24 um 15:02 schrieb Li, Yunxiang (Teddy):
> > [Public]
> >
> >> From: Koenig, Christian
> >> Sent: Wednesday, December 11, 2024 3:16 Am 10.12.24 um 18:59 schrieb
>
[Public]
> From: Koenig, Christian
> Sent: Wednesday, December 11, 2024 3:16
> Am 10.12.24 um 18:59 schrieb Yunxiang Li:
> > Tracking the state of a GEM object for shared stats is quite difficult
> > since the handle_count is managed behind driver's back. So instead
> > considers GEM object share
[Public]
> From: Tvrtko Ursulin
> Sent: Tuesday, December 3, 2024 10:16
> On 03/12/2024 15:03, Li, Yunxiang (Teddy) wrote:
> > [Public]
> >
> >> From: Tvrtko Ursulin
> >> Sent: Tuesday, December 3, 2024 9:42
> >> On 28/11/2024 18:54, Yunxiang Li w
[Public]
> From: Tvrtko Ursulin
> Sent: Tuesday, December 3, 2024 9:42
> On 28/11/2024 18:54, Yunxiang Li wrote:
> > Before, every time fdinfo is queried we try to lock all the BOs in the
> > VM and calculate memory usage from scratch. This works okay if the
> > fdinfo is rarely read and the VMs
[Public]
> From: Tvrtko Ursulin
> Sent: Monday, November 11, 2024 5:30
> On 10/11/2024 15:41, Yunxiang Li wrote:
> > Make drm-active- optional just like drm-resident- and drm-purgeable-.
>
> As Jani has already commented the commit message needs some work.
>
> > Signed-off-by: Yunxiang Li
> > CC
[Public]
Indeed I have, thanks for the reminder
Teddy
[Public]
> From: Tvrtko Ursulin
> Sent: Monday, November 18, 2024 9:38
> On 16/11/2024 04:44, Yunxiang Li wrote:
> > Define how to handle buffers with multiple possible placement so we
> > don't get incompatible implementations. Callout the resident
> > requirement for drm-purgeable- explicitly.
[AMD Official Use Only - AMD Internal Distribution Only]
> From: Tvrtko Ursulin
> Sent: Wednesday, November 13, 2024 12:31
> On 13/11/2024 17:01, Li, Yunxiang (Teddy) wrote:
> > [Public]
> >
> >> From: Koenig, Christian
> >> Sent: Wednesday, November 13,
[Public]
> From: Koenig, Christian
> Sent: Wednesday, November 13, 2024 9:22
> Am 13.11.24 um 14:53 schrieb Li, Yunxiang (Teddy):
> > [Public]
> >
> >> From: Koenig, Christian
> >> Sent: Wednesday, November 13, 2024 6:39 Am 13.11.24 um 11:25 schrieb
>
[Public]
> From: Koenig, Christian
> Sent: Wednesday, November 13, 2024 3:49
> Am 12.11.24 um 19:16 schrieb Li, Yunxiang (Teddy):
> >> From: Christian König
> >> Sent: Tuesday, November 12, 2024 5:54
> >> Am 10.11.24 um 16:41 schrieb Yunxiang Li:
> >&g
[Public]
> From: Koenig, Christian
> Sent: Wednesday, November 13, 2024 6:39
> Am 13.11.24 um 11:25 schrieb Tvrtko Ursulin:
> > On 13/11/2024 08:49, Christian König wrote:
> >> Am 12.11.24 um 19:16 schrieb Li, Yunxiang (Teddy):
> >>> [SNIP]
> &g
[Public]
> From: Christian König
> Sent: Tuesday, November 12, 2024 5:54
> Am 10.11.24 um 16:41 schrieb Yunxiang Li:
> > @@ -310,6 +311,92 @@ static void amdgpu_vm_bo_reset_state_machine(struct
> amdgpu_vm *vm)
> > spin_unlock(&vm->status_lock);
> > }
> >
> > +/**
> > + * amdgpu_vm_update_s
[Public]
> From: Tvrtko Ursulin
> Sent: Thursday, November 7, 2024 5:48
> On 31/10/2024 13:48, Li, Yunxiang (Teddy) wrote:
> > [Public]
> >
> >> From: Christian König
> >> Sent: Thursday, October 31, 2024 8:54 Am 25.10.24 um 19:41 schrieb
> >>
[AMD Official Use Only - AMD Internal Distribution Only]
> From: Tvrtko Ursulin
> Sent: Thursday, November 7, 2024 5:41
> On 25/10/2024 18:41, Yunxiang Li wrote:
> > Add a helper to check if the memory stats is zero, this will be used
> > to check for memory accounting errors.
> >
> > Signed-off-
[Public]
> From: Christian König
> Sent: Thursday, October 31, 2024 8:54
> Am 25.10.24 um 19:41 schrieb Yunxiang Li:
> > Before, every time fdinfo is queried we try to lock all the BOs in the
> > VM and calculate memory usage from scratch. This works okay if the
> > fdinfo is rarely read and the
[AMD Official Use Only - AMD Internal Distribution Only]
Yeah it looks like I missed the whole active/purgeable thing as well...
Teddy
wrote:
> >>> Am 23.10.24 um 09:38 schrieb Tvrtko Ursulin:
> >>>>
> >>>> On 22/10/2024 17:24, Christian König wrote:
> >>>>> Am 22.10.24 um 17:17 schrieb Li, Yunxiang (Teddy):
> >>>>>> [Public]
> >>>>>&
[AMD Official Use Only - AMD Internal Distribution Only]
It sounds like it makes the most sense to ignore the BOs that have no placement
or private placements then, it would simplify the code too.
Teddy
[Public]
I suppose we could add a field like amd-memory-private: to cover the private
placements. When would a BO not have a placement, is it when it is being moved?
Since we are tracking the state changes, I wonder if such situations can be
avoided now so whenever we call these stat update fun
[Public]
> >
> > +static uint32_t fold_memtype(uint32_t memtype) {
>
> In general please add prefixes to even static functions, e.g. amdgpu_vm_ or
> amdgpu_bo_.
>
> > + /* Squash private placements into 'cpu' to keep the legacy userspace
> > view.
> */
> > + switch (mem_type) {
> > + case T
[AMD Official Use Only - AMD Internal Distribution Only]
Yes, my understanding is that autosuspend happens after a timeout, so it would
only go into suspend if the pdd is not used for a while after it's created.
Teddy
[AMD Official Use Only - AMD Internal Distribution Only]
Oops, lemme fix that real quick...
> -Original Message-
> From: Joshi, Mukul
> Sent: Thursday, October 10, 2024 11:46
> To: Li, Yunxiang (Teddy) ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Ch
[Public]
> Ok that looks extremely ugly. Please just add a separate function and call
> that
> from the TTM move function.
Should I still remove the adev argument? It is never used and causes a few call
sites having to find an adev unnecessarily.
> Please either drop that or compare each memor
[Public]
I'm not happy with the new helper names, and not quite satisfied with the
function arguments either. But it's what I got right now, would appreciate it
if anyone got better ideas.
Teddy
[Public]
> > + dma_resv_lock(bo->tbo.base.resv, NULL);
>
> Why do you grab the BO lock to update the stats? That doesn't seem to make
> any sense.
>
> > + update_stats = !(bo->flags & AMDGPU_GEM_WAS_EXPORTED);
> > + if (update_stats)
> > + amdgpu_bo_add_memory(bo, &stats);
> > +
[AMD Official Use Only - AMD Internal Distribution Only]
The trouble with taking the read side lock in the MES helper functions is that
we use a lot of them during reset under the write lock. So either we need to
duplicate the helper functions or we will get inconsistencies where a random
subse
[AMD Official Use Only - AMD Internal Distribution Only]
> It's not clear to me what's the requirement for reset domain locking around
> MES calls. We have a lot more of them in kfd_device_queue_manager.c
> (mostly calling adev->mes.funcs->...
> directly). Do they all need to be wrapped individual
[AMD Official Use Only - AMD Internal Distribution Only]
> One thing I could see going wrong is, that down_read_trylock(&dqm->dev-
> >adev->reset_domain->sem) will not fail immediately when the reset is
> scheduled. So there may be multipe attempts at HW access that detect an
> error or time out,
[AMD Official Use Only - AMD Internal Distribution Only]
> Yeah, I know. That's one of the reason I've pointed out on the patch adding
> that that this behavior is actually completely broken.
>
> If you run into issues with the MES because of this then please suggest a
> revert of that patch.
I t
[Public]
> The problem is that we don't force complete the non scheduler rings, e.g. MES,
> KIQ etc...
>
> Try to remove this check here from the loop in
> amdgpu_device_pre_asic_reset():
>
> if (!amdgpu_ring_sched_ready(ring))
> continue;
Ah, I see. Thou
[AMD Official Use Only - AMD Internal Distribution Only]
> I don't think trying to add some reset handling here makes sense in the first
> place.
> Part of the reset/recovery procedure is to signal all fence and that includes
> the one we are waiting for here.
> So this wait should return immedi
[Public]
> It's perfectly possible that the reset has already started before we enter
> the function.
Yeah, this could and does happen, but it just means we are back to the old
behavior. I guess I could use "can I take the read side lock?" to test if the
function is called outside of reset or
[AMD Official Use Only - AMD Internal Distribution Only]
Yes, the two places are 1. In debugfs and 2. In MI100's en/disable_debug_trap,
and evidently someone is testing the debugfs interface because there's a bug
fix for a race condition of it.
Teddy
[AMD Official Use Only - AMD Internal Distribution Only]
> If that is true you could in theory lower the locked area of the existing
> lock, but adding a new one is strict no-go from my side.
I'll try this, right now I see two places where this would be problematic
because they are trying to su
[Public]
> > Here is taking a different lock than the reset_domain->sem. It is a
> > seperate reset_domain->gpu_sem that is only locked when we will actuall do
> > reset, it is not taken in the skip_hw_reset path.
>
> Exactly that is what you should *not* do. Please don't add any new lock to
>
[AMD Official Use Only - AMD Internal Distribution Only]
> > +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void
> > +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int
> > +amdgpu_begin_hw_access(struct amdgpu_device *adev); void
> > +amdgpu_end_hw_access(struct amdgpu_device *ade
[Public]
Hi Christ,
I got R-b from the SRIOV team for the rest of the patches, can you help review
this last one? I think the concerns from the previous thread are all addressed
https://patchwork.freedesktop.org/patch/590678/?series=132727
Regards,
Teddy
[Public]
> Why remove this?
Oops it's a copy-paste error from the previous revision
> Need to call amdgpu_virt_release_full_gpu(adev, true) before retry, and the
> same as below.
I thought we talked about if we call amdgpu_virt_{reset,request_full}_gpu again
we don't need to release full gpu, I
[Public]
> Looks like that is handled by the scheduler work item now as well. See
> function gfx_v9_0_fault() for an example.
Cool so it is blocked by drm_sched_stop also. I think that covers everything.
[Public]
> We have the KFD, FLR, the per engine one in the scheduler and IIRC one more
> for the CP (illegal operation and register write).
>
> I'm not sure about the CP one, but all others should be handled correctly
> with the V2 patch as far as I can see.
Where can I find the CP one? Nothing
[Public]
> We can't do this technically as there are cases where we skip full device
> reset (even then amdgpu_in_reset will return true). The better thing to do is
> to move amdgpu_device_stop_pending_resets() later in
> gpu_recover()- if a device has undergone full reset, then cancel all pendi
[AMD Official Use Only - General]
> This patch results in a large number of compile errors if CONFIG_DEBUG_FS=n.
> Reverting it fixes the problem.
>
> This is an architecture independent problem.
>
> Guenter
Oops, seem to be because at amdgpu_dm.c:8328 the } should be inside the #endif
not outs
47 matches
Mail list logo