Hello Michael,
On 26/06/24 14:57, Michael Ellerman wrote:
Nicholas Piggin <npig...@gmail.com> writes:
kexec on pseries disables AIL (reloc_on_exc), required for scv
instruction support, before other CPUs have been shut down. This means
they can execute scv instructions after AIL is disabled, which causes an
interrupt at an unexpected entry location that crashes the kernel.
Change the kexec sequence to disable AIL after other CPUs have been
brought down.
As a refresher, the real-mode scv interrupt vector is 0x17000, and the
fixed-location head code probably couldn't easily deal with implementing
such high addresses so it was just decided not to support that interrupt
at all.
Reported-by: Sourabh Jain <sourabhj...@linux.ibm.com>
Was this reported publicly? I don't remember it.
No, I didn't report this issue publicly.
While debugging a kexec issue, the git bisect pointed to the commit
mentioned
in the patch description. So, I contacted Nick directly.
`kexec -e` with --smt=off the first kernel hits exception when
wake_offline_cpus() -> add_cpu() is called
to bring up offline CPUs.
Console log:
[ 68.824514] restraintd[899]: * Parsing recipe
[ 68.825546] restraintd[899]: * Running recipe
[ 68.825591] restraintd[899]: ** Continuing task: 20291
[/mnt/tests/distribution/reservesys]
[ 68.834095] restraintd[899]: ** Preparing metadata
[ 68.872927] restraintd[899]: ** Refreshing peer role hostnames: Retries 0
[ 68.911107] restraintd[899]: ** Updating env vars
[ 68.911737] restraintd[899]: *** Current Time: Tue May 21 09:09:42
2024 Localwatchdog at: * Disabled! *
[ 68.922803] restraintd[899]: ** Running task: 20291
[/distribution/reservesys]
[ 78.027943] Removing IBM Power 842 compression device
[ 78.093777] XFS (sda2): Block device removal (0x20) detected at
xfs_fs_shutdown+0x34/0x50 [xfs] (fs/xfs/xfs_super.c:1179). Shutting down
filesystem.
[ 78.093894] XFS (sda2): Please unmount the filesystem and rectify the
problem(s)
[ 83.450854] dm-0: writeback error on inode 17086756, offset 569344,
sector 11026136
[ 83.450910] dm-0: writeback error on inode 36421601, offset 0, sector
20772504
[ 84.021819] dm-0: writeback error on inode 36382045, offset 0, sector
20772536
[ 84.094348] dm-0: writeback error on inode 18703102, offset 0, sector
11021000
[ 84.601228] dm-0: writeback error on inode 51268015, offset 0, sector
27663152
[ 84.601468] dm-0: writeback error on inode 58225471, offset 0, sector
34636080
[ 85.370996] kexec_core: Starting new kernel
[ 85.391013] kexec: Waking offline cpu 1.
[ 85.391038] ------------[ cut here ]------------
[ 85.391042] kernel BUG at arch/powerpc/kernel/exceptions-64s.S:501!
[ 85.391047] Oops: Exception in kernel mode, sig: 5 [#1]
[ 85.391051] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[ 85.391056] Modules linked in: bonding tls rfkill pseries_rng
vmx_crypto drm fuse drm_panel_orientation_quirks xfs libcrc32c sr_mod
sd_mod cdrom t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror
dm_region_hash dm_log dm_mod
[ 85.391086] CPU: 0 PID: 565 Comm: systemd-journal Kdump: loaded Not
tainted 6.9.0+ #1
[ 85.391092] Hardware name: IBM,9008-22L POWER9 (raw) 0x4e0202
0xf000005 of:IBM,FW950.A0 (VL950_144) hv:phyp pSeries
[ 85.391096] NIP: c0000000000089a4 LR: 000000000001703c CTR:
c000000000008980
[ 85.391101] REGS: c00000000f76fd60 TRAP: 0700 Not tainted (6.9.0+)
[ 85.391106] MSR: 8000000000021031 <SF,ME,IR,DR,LE> CR: 240022d4
XER: 00000000
[ 85.391116] CFAR: c00000000000899c IRQMASK: 0
[ 85.391116] GPR00: 0000000000000003 00007fffc4f783a0 00007fff9f0a7200
0000010014331bb8
[ 85.391116] GPR04: 00007fffc4f7b078 000000000000c4f6 00007fffc4f7b1d0
00000100143469a0
[ 85.391116] GPR08: 00007fff9f489268 00000000440022d4 00007fffc4f78670
00000000000ac588
[ 85.391116] GPR12: 8000000000009003 c000000002f50000 0000000000000000
0000000000000000
[ 85.391116] GPR16: 0000000000000000 0000000000000000 0000000000000000
0000000000000000
[ 85.391116] GPR20: 0000000000000000 0000000000000000 0000000127117b48
00000001271185b8
[ 85.391116] GPR24: 0000000127117b90 00007fffc4f7b070 0000010014331540
00007fffc4f7b078
[ 85.391116] GPR28: 0000000000000000 00007fffc4f78f80 000000000000c4f6
0000010014331ba0
[ 85.391173] NIP [c0000000000089a4] data_access_common_virt+0x14/0x220
[ 85.391181] LR [000000000001703c] 0x1703c
[ 85.391186] Call Trace:
[ 85.391189] Code: 48024df9 48000000 60000000 e94d0020 694a0002
7d400164 60000000 718a4000 7c2a0b78 3821fd30 41c20008 e82d0910
<0981fd30> f9210160 f9610130 f9810138
[ 85.391208] ---[ end trace 0000000000000000 ]---
[ 85.394302] pstore: backend (nvram) writing error (-1)
[ 85.394306]
[ 86.394309] Kernel panic - not syncing: Fatal exception
[ 86.399970] Rebooting in 10 seconds..
Thanks,
Sourabh Jain