On 2024/8/22 22:05, Mario Limonciello wrote: > On 7/23/2024 04:42, Lu Yao wrote: >> [Why] >> When running kdump test on a machine with R7340 card, a hang is caused due >> to the failure of 'amdgpu_device_ip_init()', error message as follows: >> >> '[drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block >> <si_dpm> failed -22' >> '[drm:uvd_v3_1_hw_init [amdgpu]] *ERROR* amdgpu: UVD Firmware validate >> fail (-22).' >> '[drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block >> <uvd_v3_1> failed -22' >> 'amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed' >> 'amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init' >> >> This is because the caputrue kernel does not power off when it starts, > > Presumably you mean: > s/caputrue/capture/ Oh, you're right. It's a mistake. > >> cause hardware status does not reset. >> >> [How] >> Add 'is_kdump_kernel()' judgment. >> For 'si_dpm' block, use disable and then enable. >> For 'uvd_v3_1' block, skip loading during the initialization phase. >> >> Signed-off-by: Lu Yao <ya...@kylinos.cn> >> --- >> During test, I first modified the 'amdgpu_device_ip_hw_init_phase*', make >> it does not end directly when a block hw_init failed. >> >> After analysis, 'si_dpm' block failed at 'si_dpm_enable()-> >> amdgpu_si_is_smc_running()', calling 'si_dpm_disable()' before can resolve. >> 'uvd_v3_1' block failed at 'uvd_v3_1_hw_init()->uvd_v3_1_fw_validate()', >> read mmUVD_FW_STATUS value is 0x27220102, I didn't find out why. But for >> caputrue kernel, UVD is not required. Therefore, don't added this block. > > Hmm, a few thoughs. > > 1) Although you used this for the R7340, these concepts you're identifying > probably make sense on most AMD GPUs. SUch checks might be better to uplevel > to earlier in IP discovery code. > > 2) I'd actually argue we don't want to have the kdump capture kernel do ANY > hardware init. You're going to lose hardware state which "could" be valuable > information for debugging a problem that caused a panic. > So, maybe should skip all the ip_block hw_init functions when kdump? > That being said, I'm not really sure what framebuffer can drive the display > across a kexec if you don't load amdgpu. What actually happens if you > blacklist amdgpu in the capture kernel? > > What happens with your patch in place? > > At least for me I'd like to see a kernel log from both cases. >
After add 'initcall_blacklist=amdgpu_init' in KDUMP_CMDLINE_APPEND, kernel logs are as follow: [ 4.085602][ 0] nvme0n1: p1 p2 p3 p4 p5 p6 [ 4.157927][ 0] [drm] radeon kernel modesetting enabled. [ 4.163383][ 0] radeon 0000:01:00.0: SI support disabled by module param [ 5.387012][ 0] initcall amdgpu_init blacklisted [ 6.613733][ 0] initcall amdgpu_init blacklisted [ 7.859320][ 0] mtsnd build info: e3fc429 [ 8.687512][ 0] EXT4-fs (nvme0n1p3): orphan cleanup on readonly fs [ 8.694035][ 0] EXT4-fs (nvme0n1p3): mounted filesystem 75c1e96b-cef8-4ed3-86ea-45010c7b859c ro with ordered data mode. Quota mode: none. [ 9.309862][ 0] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log. [ 9.325236][ 0] device-mapper: uevent: version 1.0.3 [ 9.330946][ 0] systemd[1]: Starting modprobe@fuse.service - Load Kernel Module fuse... [ 9.341512][ 0] device-mapper: ioctl: 4.48.0-ioctl (2023-03-01) initialised: dm-de...@redhat.com [ 9.380944][ 0] fuse: init (API version 7.39) [ 9.390196][ 0] loop: module loaded [ 9.486957][ 0] lp: driver loaded but no devices found [ 9.494904][ 0] EXT4-fs (nvme0n1p3): re-mounted 75c1e96b-cef8-4ed3-86ea-45010c7b859c r/w. Quota mode: none. [ 9.505931][ 0] systemd[1]: Starting systemd-udev-trigger.service - Coldplug All udev Devices... [ 9.518899][ 0] ppdev: user-space parallel port driver [ 9.524908][ 0] systemd[1]: Started systemd-journald.service - Journal Service. [ 9.574209][ 0] systemd-journald[350]: Received client request to flush runtime journal. [ 10.118484][ 0] snd_hda_intel 0000:00:1f.3: Unknown capability 0 [ 11.590124][ 0] hdaudio hdaudioC0D2: Unable to configure, disabling [ 23.892640][ 0] reboot: Restarting system After with my patch in place: [ 4.074629][ 0] nvme0n1: p1 p2 p3 p4 p5 p6 [ 4.146956][ 0] [drm] radeon kernel modesetting enabled. [ 4.152409][ 0] radeon 0000:01:00.0: SI support disabled by module param [ 5.379207][ 0] [drm] amdgpu kernel modesetting enabled. [ 5.384909][ 0] amdgpu: Virtual CRAT table created for CPU [ 5.390514][ 0] amdgpu: Topology: Add CPU node [ 5.395225][ 0] [drm] initializing kernel modesetting (OLAND 0x1002:0x6611 0x1642:0x1869 0x87). [ 5.404040][ 0] [drm] register mmio base: 0xA1600000 [ 5.409118][ 0] [drm] register mmio size: 262144 [ 5.413864][ 0] [drm] add ip block number 0 <si_common> [ 5.419207][ 0] [drm] add ip block number 1 <gmc_v6_0> [ 5.424448][ 0] [drm] add ip block number 2 <si_ih> [ 5.429427][ 0] [drm] add ip block number 3 <gfx_v6_0> [ 5.434668][ 0] [drm] add ip block number 4 <si_dma> [ 5.439733][ 0] [drm] add ip block number 5 <si_dpm> [ 5.444803][ 0] [drm] add ip block number 6 <dce_v6_0> [ 5.450051][ 0] amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from VFCT [ 5.456517][ 0] amdgpu: ATOM BIOS: 113-RADEONI6910-B03-BT [ 5.462023][ 0] kfd kfd: amdgpu: OLAND not supported in kfd [ 5.467857][ 0] amdgpu 0000:01:00.0: vgaarb: deactivate vga console [ 5.474239][ 0] amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported [ 5.482781][ 0] amdgpu 0000:01:00.0: amdgpu: PCIE atomic ops is not supported [ 5.490242][ 0] [drm] PCIE gen 3 link speeds already enabled [ 5.496017][ 0] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit [ 5.504778][ 0] amdgpu 0000:01:00.0: amdgpu: VRAM: 1024M 0x000000F400000000 - 0x000000F43FFFFFFF (1024M used) [ 5.514812][ 0] amdgpu 0000:01:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF [ 5.523710][ 0] [drm] Detected VRAM RAM=1024M, BAR=1024M [ 5.529133][ 0] [drm] RAM width 32bits GDDR5 [ 5.533532][ 0] [drm] amdgpu: 1024M of VRAM memory ready [ 5.538963][ 0] [drm] amdgpu: 225M of GTT memory ready. [ 5.544293][ 0] [drm] GART: num cpu pages 262144, num gpu pages 262144 [ 5.550950][ 0] amdgpu 0000:01:00.0: amdgpu: PCIE GART of 1024M enabled (table at 0x000000F400E00000). [ 5.560859][ 0] [drm] Internal thermal controller with fan control [ 5.567163][ 0] [drm] amdgpu: dpm initialized [ 5.571642][ 0] [drm] AMDGPU Display Connectors [ 5.576278][ 0] [drm] Connector 0: [ 5.579782][ 0] [drm] HDMI-A-1 [ 5.583108][ 0] [drm] HPD2 [ 5.586088][ 0] [drm] DDC: 0x1950 0x1950 0x1951 0x1951 0x1952 0x1952 0x1953 0x1953 [ 5.593937][ 0] [drm] Encoders: [ 5.597353][ 0] [drm] DFP1: INTERNAL_UNIPHY [ 5.601985][ 0] [drm] Connector 1: [ 5.605488][ 0] [drm] VGA-1 [ 5.608553][ 0] [drm] DDC: 0x194c 0x194c 0x194d 0x194d 0x194e 0x194e 0x194f 0x194f [ 5.616400][ 0] [drm] Encoders: [ 5.619807][ 0] [drm] CRT1: INTERNAL_KLDSCP_DAC1 [ 5.985857][ 0] amdgpu 0000:01:00.0: amdgpu: SE 1, SH per SE 1, CU per SH 6, active_cu_number 6 [ 6.346743][ 0] [drm] Initialized amdgpu 3.54.0 20150101 for 0000:01:00.0 on minor 0 [ 6.433683][ 0] fbcon: amdgpudrmfb (fb0) is primary device [ 6.439260][ 0] Console: switching to colour frame buffer device 240x67 [ 6.454578][ 0] amdgpu 0000:01:00.0: [drm] fb0: amdgpudrmfb frame buffer device [ 6.816426][ 0] mtsnd build info: e3fc429 [ 7.827506][ 0] EXT4-fs (nvme0n1p3): orphan cleanup on readonly fs [ 7.834021][ 0] EXT4-fs (nvme0n1p3): mounted filesystem 75c1e96b-cef8-4ed3-86ea-45010c7b859c ro with ordered data mode. Quota mode: none. [ 8.502847][ 0] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log. [ 8.517899][ 0] systemd[1]: Starting modprobe@fuse.service - Load Kernel Module fuse... [ 8.526044][ 0] device-mapper: uevent: version 1.0.3 [ 8.531923][ 0] systemd[1]: Starting modprobe@loop.service - Load Kernel Module loop... [ 8.545910][ 0] systemd[1]: systemd-fsck-root.service - File System Check on Root Device was skipped because of an unmet condition check (ConditionPathExists=!/run/initramfs/fsck-root). [ 8.564367][ 0] fuse: init (API version 7.39) [ 8.568872][ 0] device-mapper: ioctl: 4.48.0-ioctl (2023-03-01) initialised: dm-de...@redhat.com [ 8.581889][ 0] systemd[1]: Starting systemd-journald.service - Journal Service... [ 8.591857][ 0] loop: module loaded [ 8.639020][ 0] lp: driver loaded but no devices found [ 8.662288][ 0] systemd[1]: systemd-tpm2-setup-early.service - TPM2 SRK Setup (Early) was skipped because of an unmet condition check (ConditionSecurity=measured-uki). [ 8.685851][ 0] ppdev: user-space parallel port driver [ 8.697866][ 0] EXT4-fs (nvme0n1p3): re-mounted 75c1e96b-cef8-4ed3-86ea-45010c7b859c r/w. Quota mode: none. [ 9.362160][ 0] snd_hda_intel 0000:00:1f.3: Unknown capability 0 [ 9.716497][ 0] hdaudio hdaudioC0D2: Unable to configure, disabling [ 20.101499][ 0] reboot: Restarting system Compared with the blacklist method, amdgpu driver initialization can be completed after adding patch. >From the external observation, more startup animation can be shown (of course, >this is meaningless, because it will restart immediately).