[Bug 267028] kernel panics when booting with both (zfs,ko or vboxnetflt,ko or acpi_wmi.ko) and amdgpu.ko

bugzilla-noreply Thu, 26 Dec 2024 19:35:19 -0800

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=267028


--- Comment #307 from Mark Millard <marklmi26-f...@yahoo.com> ---
(In reply to Mark Millard from comment #306)

Going backwards through part of the list node allocations (before the
node is filled in but showing the contains and modname addresses that
are to be assgined in each case. . .

(kgdb)  print modlist_newmod_hist[modlist_newmod_hist_pos]
$7 = {modAddr = 0xfffff8000471eac0, containerAddr = 0xfffff800038caa80,
modnameAddr = 0xffffffff82ea6025 "amdgpu_raven_vcn_bin_fw", version = 1}
(kgdb) print modlist_newmod_hist[modlist_newmod_hist_pos-1]
$8 = {modAddr = 0xfffff8000471e900, containerAddr = 0xfffff800038cac00,
modnameAddr = 0xffffffff82e62026 "amdgpu_raven_mec2_bin_fw", version = 1}
(kgdb) print modlist_newmod_hist[modlist_newmod_hist_pos-2]
$9 = {modAddr = 0xfffff800046581c0, containerAddr = 0xfffff8000464a600,
modnameAddr = 0xffffffff82e1e010 "amdgpu_raven_mec_bin_fw", version = 1}
(kgdb) print modlist_newmod_hist[modlist_newmod_hist_pos-3]
$10 = {modAddr = 0xfffff80004574040, containerAddr = 0xfffff800038c9000,
modnameAddr = 0xffffffff82e12009 "amdgpu_raven_rlc_bin_fw", version = 1}
(kgdb) print modlist_newmod_hist[modlist_newmod_hist_pos-4]
$11 = {modAddr = 0xfffff80004574100, containerAddr = 0xfffff800038c9300,
modnameAddr = 0xffffffff829f6010 "amdgpu_raven_ce_bin_fw", version = 1}
(kgdb) print modlist_newmod_hist[modlist_newmod_hist_pos-5]
$12 = {modAddr = 0xfffff800036f00c0, containerAddr = 0xfffff80004ad6c00,
modnameAddr = 0xffffffff829ef000 "amdgpu_raven_me_bin_fw", version = 1}
(kgdb) print modlist_newmod_hist[modlist_newmod_hist_pos-6]
$13 = {modAddr = 0xfffff8000471e980, containerAddr = 0xfffff800038c9480,
modnameAddr = 0xffffffff829e7025 "amdgpu_raven_pfp_bin_fw", version = 1}

Going backwards through that part of list later, after the failure:

(kgdb) print *(modlist_t)0xfffff8000471eac0
$24 = {link = {tqe_next = 0x0, tqe_prev = 0xfffff8000471e900}, container =
0xfffff800038caa80, name = 0xffffffff82ea6025 "amdgpu_raven_vcn_bin_fw",
version = 1}
(kgdb) print *(modlist_t)0xfffff8000471e900
$25 = {link = {tqe_next = 0xfffff8000471eac0, tqe_prev = 0xfffff800046581c0},
container = 0xfffff800038cac00, name = 0xffffffff82e62026
"amdgpu_raven_mec2_bin_fw", version = 1}
. . .
(kgdb) print *(modlist_t)0xfffff800046581c0
$27 = {link = {tqe_next = 0xfffff8000471e900, tqe_prev = 0xfffff80004574040},
container = 0xfffff8000464a600, name = 0xffffffff82e1e010
"amdgpu_raven_mec_bin_fw", version = 1}
(kgdb) print *(modlist_t)0xfffff80004574040
$28 = {link = {tqe_next = 0xfffff800046581c0, tqe_prev = 0xfffff80004574100},
container = 0xfffff800038c9000, name = 0xffffffff82e12009
"amdgpu_raven_rlc_bin_fw", version = 1}
(kgdb) print *(modlist_t)0xfffff80004574100
$29 = {link = {tqe_next = 0xfffff80004574040, tqe_prev = 0xfffff800036f00c0},
container = 0xfffff800038c9300, name = 0xffffffff829f6010
"amdgpu_raven_ce_bin_fw", version = 1}
(kgdb) print *(modlist_t)0xfffff800036f00c0
$30 = {link = {tqe_next = 0xfffff80000000007, tqe_prev = 0xfffff8000471e980},
container = 0xfffff80004ad6c00, name = 0xffffffff829ef000
"amdgpu_raven_me_bin_fw", version = 1}

NOTE THE BAD tqe_next== 0xfffff80000000007 ABOVE.

(kgdb) print *(modlist_t)0xfffff8000471e980
$31 = {link = {tqe_next = 0xfffff800036f00c0, tqe_prev = 0xfffff800036f0100},
container = 0xfffff800038c9480, name = 0xffffffff829e7025
"amdgpu_raven_pfp_bin_fw", version = 1}

So: all the nodes are there but just one ends up with the odd
tqe_next== 0xfffff80000000007 corruption.

There was no allocation that returned 0xfffff80000000007 (not recorded
and I'd set up for such a value to panix just after the allocation).

Something replaced the intended:
*(modlist_t)0xfffff800036f00c0.link.tqe_next == 0xfffff80004574100
with:
*(modlist_t)0xfffff800036f00c0.link.tqe_next == 0xfffff80000000007

The scans of the list were okay as of setting up each of
(listed in execution order, not backwards list order):

"amdgpu_raven_ce_bin_fw"
"amdgpu_raven_rlc_bin_fw"
"amdgpu_raven_mec_bin_fw"
"amdgpu_raven_mec2_bin_fw"
"amdgpu_raven_vcn_bin_fw"

But as of (the first afater "amdgpu_raven_vcn_bin_fw"):
"acpi_wmi"

The list had the corrupted link.tqe_next associated with
"amdgpu_raven_me_bin_fw".

This suggests at/after the generation of:

drmn0: successfully loaded firmware image 'amdgpu/raven_vcn.bin'

during the generation of the sequence:

<6>[drm] Found VCN firmware Version ENC: 1.13 DEC: 2 VEP: 0 Revision: 4
drmn0: Will use PSP to load VCN firmware
<6>[drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
drmn0: RAS: optional ras ta ucode is not available
drmn0: RAP: optional rap ta ucode is not available
<6>[drm] kiq ring mec 2 pipe 1 q 0
<6>[drm] DM_PPLIB: values for F clock
<6>[drm] DM_PPLIB:       400000 in kHz, 3649 in mV
<6>[drm] DM_PPLIB:       933000 in kHz, 4074 in mV
<6>[drm] DM_PPLIB:       1200000 in kHz, 4399 in mV
<6>[drm] DM_PPLIB:       1333000 in kHz, 4399 in mV
<6>[drm] DM_PPLIB: values for DCF clock
<6>[drm] DM_PPLIB:       300000 in kHz, 3649 in mV
<6>[drm] DM_PPLIB:       600000 in kHz, 4074 in mV
<6>[drm] DM_PPLIB:       626000 in kHz, 4250 in mV
<6>[drm] DM_PPLIB:       654000 in kHz, 4399 in mV
<6>[drm] Display Core initialized with v3.2.104!
lkpi_iic0: <LinuxKPI I2C> on drmn0
iicbus0: <Philips I2C bus> on lkpi_iic0
iic0: <I2C generic I/O> on iicbus0
lkpi_iic1: <LinuxKPI I2C> on drmn0
iicbus1: <Philips I2C bus> on lkpi_iic1
iic1: <I2C generic I/O> on iicbus1
<6>[drm] VCN decode and encode initialized successfully(under SPG Mode).
drmn0: SE 1, SH per SE 1, CU per SH 11, active_cu_number 8
<6>[drm] fb mappable at 0x60BCA000
<6>[drm] vram apper at 0x60000000
<6>[drm] size 8294400
<6>[drm] fb depth is 24
<6>[drm]    pitch is 7680
VT: Replacing driver "vga" with new "fb".
start FB_INFO:
type=11 height=1080 width=1920 depth=32
pbase=0x60bca000 vbase=0xfffff80060bca000
name=drmn0 flags=0x0 stride=7680 bpp=32
end FB_INFO
drmn0: ring gfx uses VM inv eng 0 on hub 0
drmn0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
drmn0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
drmn0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
drmn0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
drmn0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
drmn0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
drmn0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
drmn0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
drmn0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
drmn0: ring sdma0 uses VM inv eng 0 on hub 1
drmn0: ring vcn_dec uses VM inv eng 1 on hub 1
drmn0: ring vcn_enc0 uses VM inv eng 4 on hub 1
drmn0: ring vcn_enc1 uses VM inv eng 5 on hub 1
drmn0: ring jpeg_dec uses VM inv eng 6 on hub 1
vgapci0: child drmn0 requested pci_get_powerstate
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
<6>[drm] Initialized amdgpu 3.40.0 20150101 for drmn0 on minor 0

Or the very early stages of setting up: acpi_wmi.ko

The mismatch was detected during the first modlist_lookup for
the found_modules list for the setup of acpi_wmi.ko.

The "during" text seems to happen during activity from
the likes of:

/wrkdirs/usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_7/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c

(given the raven firmware is in use as well?).

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 267028] kernel panics when booting with both (zfs,ko or vboxnetflt,ko or acpi_wmi.ko) and amdgpu.ko

Reply via email to