Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-04 Thread Timur Kristóf
Hi Felix, On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: > That's the worst-case scenario where you're debugging HW or FW > issues. > Those should be pretty rare post-bringup. But are there hangs caused > by > user mode driver or application bugs that are easier to debug and > probabl

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Christian König
Am 03.05.23 um 21:14 schrieb André Almeida: Em 03/05/2023 14:43, Timur Kristóf escreveu: Hi Felix, On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: That's the worst-case scenario where you're debugging HW or FW issues. Those should be pretty rare post-bringup. But are there hangs cause

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
On Wed, May 3, 2023, 14:53 André Almeida wrote: > Em 03/05/2023 14:08, Marek Olšák escreveu: > > GPU hangs are pretty common post-bringup. They are not common per user, > > but if we gather all hangs from all users, we can have lots and lots of > > them. > > > > GPU hangs are indeed not very debu

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread André Almeida
Em 03/05/2023 14:43, Timur Kristóf escreveu: Hi Felix, On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: That's the worst-case scenario where you're debugging HW or FW issues. Those should be pretty rare post-bringup. But are there hangs caused by user mode driver or application bugs tha

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread André Almeida
Em 03/05/2023 14:08, Marek Olšák escreveu: GPU hangs are pretty common post-bringup. They are not common per user, but if we gather all hangs from all users, we can have lots and lots of them. GPU hangs are indeed not very debuggable. There are however some things we can do: - Identify the h

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
WRITE_DATA with ENGINE=PFP will execute the packet on the frontend engine, while ENGINE=ME will execute the packet on the backend engine. Marek On Wed, May 3, 2023 at 1:08 PM Marek Olšák wrote: > GPU hangs are pretty common post-bringup. They are not common per user, > but if we gather all hang

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
GPU hangs are pretty common post-bringup. They are not common per user, but if we gather all hangs from all users, we can have lots and lots of them. GPU hangs are indeed not very debuggable. There are however some things we can do: - Identify the hanging IB by its VA (the kernel should know it) -

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Christian König
Am 03.05.23 um 17:08 schrieb Felix Kuehling: Am 2023-05-03 um 03:59 schrieb Christian König: Am 02.05.23 um 20:41 schrieb Alex Deucher: On Tue, May 2, 2023 at 11:22 AM Timur Kristóf wrote: [SNIP] In my opinion, the correct solution to those problems would be if the kernel could give userspac

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Felix Kuehling
Am 2023-05-03 um 03:59 schrieb Christian König: Am 02.05.23 um 20:41 schrieb Alex Deucher: On Tue, May 2, 2023 at 11:22 AM Timur Kristóf wrote: [SNIP] In my opinion, the correct solution to those problems would be if the kernel could give userspace the necessary information about a GPU hang b

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Christian König
Am 02.05.23 um 20:41 schrieb Alex Deucher: On Tue, May 2, 2023 at 11:22 AM Timur Kristóf wrote: [SNIP] In my opinion, the correct solution to those problems would be if the kernel could give userspace the necessary information about a GPU hang before a GPU reset. The fundamental problem he

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Timur Kristóf
Hi, On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > Christian König ezt írta (időpont: 2023. > > máj. 2., Ke 9:59): > >   > > > Am 02.05.23 um 03:26 schrieb André Almeida: > > >  > Em 01/05/2023 16:24, Alex Deucher escreveu: > > >  >> On Mon, May 1, 2023 at 2:58 PM André Almeid

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Timur Kristóf
On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote: > On Tue, May 2, 2023 at 9:35 AM Timur Kristóf > wrote: > > > > Hi, > > > > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > > > > > Christian König ezt írta (időpont: > > > > 2023. > > > > máj. 2., Ke 9:59): > > > > > >

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Timur Kristóf
Hi Christian, Christian König ezt írta (időpont: 2023. máj. 2., Ke 9:59): > Am 02.05.23 um 03:26 schrieb André Almeida: > > Em 01/05/2023 16:24, Alex Deucher escreveu: > >> On Mon, May 1, 2023 at 2:58 PM André Almeida > >> wrote: > >>> > >>> I know that devcoredump is also used for this kind of

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Alex Deucher
On Tue, May 2, 2023 at 11:22 AM Timur Kristóf wrote: > > On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote: > > On Tue, May 2, 2023 at 9:35 AM Timur Kristóf > > wrote: > > > > > > Hi, > > > > > > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > > > > > > > Christian König ez

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Alex Deucher
On Tue, May 2, 2023 at 9:35 AM Timur Kristóf wrote: > > Hi, > > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > > > Christian König ezt írta (időpont: 2023. > > > máj. 2., Ke 9:59): > > > > > > > Am 02.05.23 um 03:26 schrieb André Almeida: > > > > > Em 01/05/2023 16:24, Alex De

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Christian König
Hi Timur, Am 02.05.23 um 11:12 schrieb Timur Kristóf: Hi Christian, Christian König ezt írta (időpont: 2023. máj. 2., Ke 9:59): Am 02.05.23 um 03:26 schrieb André Almeida: > Em 01/05/2023 16:24, Alex Deucher escreveu: >> On Mon, May 1, 2023 at 2:58 PM André Almeida >> w

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Bas Nieuwenhuizen
On Tue, May 2, 2023 at 11:12 AM Timur Kristóf wrote: > > Hi Christian, > > Christian König ezt írta (időpont: 2023. máj. 2., > Ke 9:59): >> >> Am 02.05.23 um 03:26 schrieb André Almeida: >> > Em 01/05/2023 16:24, Alex Deucher escreveu: >> >> On Mon, May 1, 2023 at 2:58 PM André Almeida >> >> wr

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Christian König
Am 02.05.23 um 03:26 schrieb André Almeida: Em 01/05/2023 16:24, Alex Deucher escreveu: On Mon, May 1, 2023 at 2:58 PM André Almeida wrote: I know that devcoredump is also used for this kind of information, but I believe that using an IOCTL is better for interfacing Mesa + Linux rather than

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-02 Thread Christian König
Well first of all don't expose the VMID to userspace. The UMD doesn't know (and shouldn't know) which VMID is used for a submission since this is dynamically assigned and can change at any time. For debugging there is an interface to use an reserved VMID for your debugged process which allows

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-01 Thread André Almeida
Em 01/05/2023 16:24, Alex Deucher escreveu: On Mon, May 1, 2023 at 2:58 PM André Almeida wrote: I know that devcoredump is also used for this kind of information, but I believe that using an IOCTL is better for interfacing Mesa + Linux rather than parsing a file that its contents are subjected

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-01 Thread Alex Deucher
On Mon, May 1, 2023 at 2:58 PM André Almeida wrote: > > Currently UMD hasn't much information on what went wrong during a GPU reset. > To > help with that, this patch proposes a new IOCTL that can be used to query > information about the resources that caused the hang. If we went with the IOCTL,

[RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-01 Thread André Almeida
Currently UMD hasn't much information on what went wrong during a GPU reset. To help with that, this patch proposes a new IOCTL that can be used to query information about the resources that caused the hang. The goal of this RFC is to gather feedback about this interface. The mesa part can be foun