Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-06-04 Thread Christopher Snowhill
On Mon Jun 2, 2025 at 3:25 AM PDT, Philipp Reisner wrote: > Hi Christopher, > > Thanks for following up. The bug still annoys me from time to time. > It triggered last on May 8, May 12, and May 18. > The crash on May 18 was already with the 6.14.5 kernel. > >> Could this sleep wake issue also be ca

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-06-02 Thread Philipp Reisner
Hi Christopher, Thanks for following up. The bug still annoys me from time to time. It triggered last on May 8, May 12, and May 18. The crash on May 18 was already with the 6.14.5 kernel. > Could this sleep wake issue also be caused by a similar thing to the > panics and SMU hangs I was experienc

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-05-28 Thread Christopher Snowhill
On Mon Jan 13, 2025 at 1:55 AM PST, Christian König wrote: > Am 13.01.25 um 09:43 schrieb Philipp Stanner: >> [SNIP] The handling of NULL values is half-baked. In my opinion, you should define if drm_sched_pick_best() may put a NULL into rq. If your answer is yes, it might

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-13 Thread Christian König
Am 13.01.25 um 09:43 schrieb Philipp Stanner: [SNIP] The handling of NULL values is half-baked. In my opinion, you should define if drm_sched_pick_best() may put a NULL into rq. If your answer is yes, it might put a NULL there; then, there should be a BUG_ON(!entity->rq) after the invocation of

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-13 Thread Philipp Stanner
+cc Danilo +cc myself On Wed, 2025-01-08 at 09:19 +0100, Christian König wrote: > Am 07.01.25 um 16:21 schrieb Philipp Reisner: > > [...] > > > > The OOPS happens because the rq member of entity is NULL in > > > > drm_sched_job_arm() after the call to > > > > drm_sched_entity_select_rq(). > > > >

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-13 Thread Christian König
Am 10.01.25 um 16:10 schrieb Alex Deucher: On Fri, Jan 10, 2025 at 9:48 AM Christian König wrote: Am 10.01.25 um 15:32 schrieb Philipp Reisner: [...] Take a look at those messages right before the crash: Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready, skipping Jän 10

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-10 Thread Alex Deucher
On Fri, Jan 10, 2025 at 9:48 AM Christian König wrote: > > Am 10.01.25 um 15:32 schrieb Philipp Reisner: > > [...] > >> Take a look at those messages right before the crash: > >> > >> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready, > >> skipping > >> Jän 10 07:58:14 ryzen9

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-10 Thread Christian König
Am 10.01.25 um 15:32 schrieb Philipp Reisner: [...] Take a look at those messages right before the crash: Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready, skipping Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready, skipping That is basically a 100% c

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-10 Thread Philipp Reisner
[...] > Take a look at those messages right before the crash: > > Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready, > skipping > Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready, > skipping > > That is basically a 100% certain confirm that an application

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-10 Thread Christian König
Am 10.01.25 um 08:37 schrieb Philipp Reisner: [...] Could this be due to amdgpu setting sched->ready when the rings are finished initializing from long ago rather than when the scheduler has been armed? Yes and that is absolutely intentional. Either the driver is not done with it's resume yet,

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-09 Thread Philipp Reisner
[...] > > Could this be due to amdgpu setting sched->ready when the rings are > > finished initializing from long ago rather than when the scheduler has > > been armed? > > Yes and that is absolutely intentional. > > Either the driver is not done with it's resume yet, or it has already > started it

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-08 Thread Christian König
Am 08.01.25 um 15:26 schrieb Alex Deucher: On Tue, Jan 7, 2025 at 9:09 AM Christian König wrote: Am 07.01.25 um 15:02 schrieb Philipp Reisner: The following OOPS plagues me on about every 10th suspend and resume: [160640.791304] BUG: kernel NULL pointer dereference, address: 0008

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-08 Thread Alex Deucher
On Tue, Jan 7, 2025 at 9:09 AM Christian König wrote: > > Am 07.01.25 um 15:02 schrieb Philipp Reisner: > > The following OOPS plagues me on about every 10th suspend and resume: > > > > [160640.791304] BUG: kernel NULL pointer dereference, address: > > 0008 > > [160640.791309] #PF: su

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-08 Thread Christian König
Am 07.01.25 um 16:21 schrieb Philipp Reisner: [...] The OOPS happens because the rq member of entity is NULL in drm_sched_job_arm() after the call to drm_sched_entity_select_rq(). In drm_sched_entity_select_rq(), the code considers that drb_sched_pick_best() might return a NULL value. When NULL

[PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-07 Thread Philipp Reisner
The following OOPS plagues me on about every 10th suspend and resume: [160640.791304] BUG: kernel NULL pointer dereference, address: 0008 [160640.791309] #PF: supervisor read access in kernel mode [160640.791311] #PF: error_code(0x) - not-present page [160640.791313] PGD 0 P4D 0 [1

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-07 Thread Philipp Reisner
[...] > > The OOPS happens because the rq member of entity is NULL in > > drm_sched_job_arm() after the call to drm_sched_entity_select_rq(). > > > > In drm_sched_entity_select_rq(), the code considers that > > drb_sched_pick_best() might return a NULL value. When NULL, it assigns > > NULL to entit

Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

2025-01-07 Thread Christian König
Am 07.01.25 um 15:02 schrieb Philipp Reisner: The following OOPS plagues me on about every 10th suspend and resume: [160640.791304] BUG: kernel NULL pointer dereference, address: 0008 [160640.791309] #PF: supervisor read access in kernel mode [160640.791311] #PF: error_code(0x) -