Re: [PATCH v4] drm/amdgpu: guard atom_context in devcoredump VBIOS dump

Christian König Tue, 24 Mar 2026 05:38:30 -0700

On 3/24/26 11:28, Lazar, Lijo wrote:
> 
> 
> On 24-Mar-26 3:43 PM, Pierre-Eric Pelloux-Prayer wrote:
>>
>>
>> Le 24/03/2026 à 10:42, Christian König a écrit :
>>> On 3/24/26 02:26, Jesse.Zhang wrote:
>>>> During GPU reset coredump generation, amdgpu_devcoredump_fw_info() 
>>>> unconditionally
>>>> dereferences adev->mode_info.atom_context to print VBIOS fields. On 
>>>> reset/teardown
>>>> paths this pointer can be NULL, causing a kernel page fault from the 
>>>> deferred
>>>> coredump workqueue.
>>>>
>>>> Fix by checking ctx before printing VBIOS fields:
>>>>
>>>> if ctx is valid, print full VBIOS information as before;
>>>> This prevents NULL-dereference crashes while preserving coredump output.
>>>>
>>>> Observed page fault log:
>>>> [  667.933329] RIP: 0010:amdgpu_devcoredump_format+0x780/0xc00 [amdgpu]
>>>> [  667.941517] amdgpu 0002:01:00.0: Dumping IP State
>>>> [  667.949660] Code: 8d 57 74 48 c7 c6 01 65 9f c2 48 8d 7d 98 e8 97 96 7a 
>>>> ff 49 8d 97 b4 00 00 00 48 c7 c6 18 65 9f c2 48 8d 7d 98 e8 80 96 7a ff 
>>>> <41> 8b 97 f4 00 00 00 48 c7 c6 2f 65 9f c2 48 8d 7d 98 e8 69 96 7a
>>>> [  667.949666] RSP: 0018:ffffc9002302bd50 EFLAGS: 00010246
>>>> [  667.949673] RAX: 0000000000000000 RBX: ffff888110600000 RCX: 
>>>> 0000000000000000
>>>> [  667.949676] RDX: 000000000000a9b5 RSI: 0000000000000405 RDI: 
>>>> 000000000000a999
>>>> [  667.949680] RBP: ffffc9002302be00 R08: ffffffffc09c3084 R09: 
>>>> ffffffffc09c3085
>>>> [  667.949684] R10: 0000000000000000 R11: 0000000000000004 R12: 
>>>> 00000000000048e0
>>>> [  667.993908] amdgpu 0002:01:00.0: Dumping IP State Completed
>>>> [  667.994229] R13: 0000000000000025 R14: 000000000000000c R15: 
>>>> 0000000000000000
>>>> [  667.994233] FS:  0000000000000000(0000) GS:ffff88c44c2c9000(0000) 
>>>> knlGS:0000000000000000
>>>> [  668.000076] amdgpu 0002:01:00.0: [drm] AMDGPU device coredump file has 
>>>> been created
>>>> [  668.008025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  668.008030] CR2: 00000000000000f4 CR3: 000000011195f001 CR4: 
>>>> 0000000000770ef0
>>>> [  668.008035] PKRU: 55555554
>>>> [  668.008040] Call Trace:
>>>> [  668.008045]  <TASK>
>>>> [  668.016010] amdgpu 0002:01:00.0: [drm] Check your /sys/class/drm/ 
>>>> card16/device/devcoredump/data
>>>> [  668.023967]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>> [  668.023988]  ? __pfx___drm_printfn_coredump+0x10/0x10 [drm]
>>>> [  668.031950] amdgpu 0003:01:00.0: Dumping IP State
>>>> [  668.038159]  ? __pfx___drm_puts_coredump+0x10/0x10 [drm]
>>>> [  668.083017] amdgpu 0003:01:00.0: Dumping IP State Completed
>>>> [  668.083824]  amdgpu_devcoredump_deferred_work+0x26/0xc0 [amdgpu]
>>>> [  668.086163] amdgpu 0003:01:00.0: [drm] AMDGPU device coredump file has 
>>>> been created
>>>> [  668.095863]  process_scheduled_works+0xa6/0x420
>>>> [  668.095880]  worker_thread+0x12a/0x270
>>>> [  668.101223] amdgpu 0003:01:00.0: [drm] Check your /sys/class/drm/ 
>>>> card24/device/devcoredump/data
>>>> [  668.107441]  kthread+0x10d/0x230
>>>> [  668.107451]  ? __pfx_worker_thread+0x10/0x10
>>>> [  668.107458]  ? __pfx_kthread+0x10/0x10
>>>> [  668.112709] amdgpu 0000:01:00.0: ring vcn_unified_1 timeout, signaled 
>>>> seq=9, emitted seq=10
>>>> [  668.118630]  ret_from_fork+0x17c/0x1f0
>>>> [  668.118640]  ? __pfx_kthread+0x10/0x10
>>>> [  668.118647]  ret_from_fork_asm+0x1a/0x30
>>>>
>>>> v4: fix the race concern without introducing VBIOS snapshot state.
>>>>
>>>> Suggested-by: Lijo Lazar <[email protected]>
>>>> Signed-off-by: Jesse Zhang <[email protected]>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 16 ++++++++++------
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c       |  4 ++++
>>>>   2 files changed, 14 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/ 
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
>>>> index bbb5afd67b49..5aa46480f05f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
>>>> @@ -192,12 +192,16 @@ static void amdgpu_devcoredump_fw_info(struct 
>>>> amdgpu_device *adev,
>>>>       drm_printf(p, "VPE feature version: %u, fw version: 0x%08x\n",
>>>>              adev->vpe.feature_version, adev->vpe.fw_version);
>>>> -    drm_printf(p, "\nVBIOS Information\n");
>>>> -    drm_printf(p, "vbios name       : %s\n", ctx->name);
>>>> -    drm_printf(p, "vbios pn         : %s\n", ctx->vbios_pn);
>>>> -    drm_printf(p, "vbios version    : %d\n", ctx->version);
>>>> -    drm_printf(p, "vbios ver_str    : %s\n", ctx->vbios_ver_str);
>>>> -    drm_printf(p, "vbios date       : %s\n", ctx->date);
>>>> +    if (adev->bios) {
>>>> +        drm_printf(p, "\nVBIOS Information\n");
>>>> +        drm_printf(p, "vbios name       : %s\n", ctx->name);
>>>> +        drm_printf(p, "vbios pn         : %s\n", ctx->vbios_pn);
>>>> +        drm_printf(p, "vbios version    : %d\n", ctx->version);
>>>> +        drm_printf(p, "vbios ver_str    : %s\n", ctx->vbios_ver_str);
>>>> +        drm_printf(p, "vbios date       : %s\n", ctx->date);
>>>> +    }else {
>>>> +        drm_printf(p, "\nVBIOS Information: NA\n");
>>>> +    }
>>>>   }
>>>>   static ssize_t
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/ 
>>>> gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index fbe553c38583..69f4549e6271 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4260,6 +4260,10 @@ void amdgpu_device_fini_sw(struct amdgpu_device 
>>>> *adev)
>>>>       /* free i2c buses */
>>>>       amdgpu_i2c_fini(adev);
>>>> +#ifdef CONFIG_DEV_COREDUMP
>>>> +    /* Make sure deferred coredump formatting is done before tearing down 
>>>> VBIOS/ATOM. */
>>>> +    flush_work(&adev->coredump_work);
>>>> +#endif
>>>
>>> Looks correct to me of hand, but I'm not very familiar with this part of 
>>> the code.
>>>
>>> @Pierre-Eric and @Sunil can you take a look as well? You two have done more 
>>> with devcoredump then me.
>>
>> The worker thread doesn't access the HW but it still reads some things from 
>> the adev pointer.
>> Ideally, anything the worker needs should be copied to amdgpu_coredump_info 
>> from amdgpu_coredump.
>> Then the worker would only ever access its own state.
>>
> 
> Hi Pierre,
> 
> amdgpu_devcoredump_deferred_work -> amdgpu_devcoredump_format() -> 
> ip_block->version->funcs->print_ip_state(ip_block, &p);
> 
> This accesses hw state.
> 
> A newer version of the patch is posted which moves flush work inside fini hw 
> befor IP tear down.


That sounds reasonable to me.

I mean we don't need to copy static information into amdgpu_coredump_info (e.g. 
like BIOS info), but things like print_ip_state() clearly needs to execute 
outside of the worker since the dumped HW state is destroyed by the GPU reset.

Regards,
Christian. 

> 
> Thanks,
> Lijo
> 
>> That being said, flushing the worker is fine for now.
>>
>> Pierre-Eric
>>
>>>
>>> Thanks,
>>> Christian.
>>>
>>>
>>>>       if (adev->bios) {
>>>>           if (amdgpu_emu_mode != 1)
>>>>               amdgpu_atombios_fini(adev);
>

Re: [PATCH v4] drm/amdgpu: guard atom_context in devcoredump VBIOS dump

Reply via email to