Hello Michael,

On 26/06/24 14:57, Michael Ellerman wrote:
Nicholas Piggin <npig...@gmail.com> writes:
kexec on pseries disables AIL (reloc_on_exc), required for scv
instruction support, before other CPUs have been shut down. This means
they can execute scv instructions after AIL is disabled, which causes an
interrupt at an unexpected entry location that crashes the kernel.

Change the kexec sequence to disable AIL after other CPUs have been
brought down.

As a refresher, the real-mode scv interrupt vector is 0x17000, and the
fixed-location head code probably couldn't easily deal with implementing
such high addresses so it was just decided not to support that interrupt
at all.

Reported-by: Sourabh Jain <sourabhj...@linux.ibm.com>
Was this reported publicly? I don't remember it.

No, I didn't report this issue publicly.

While debugging a kexec issue, the git bisect pointed to the commit mentioned
in the patch description. So, I contacted Nick directly.

`kexec -e` with --smt=off the first kernel hits exception when wake_offline_cpus() -> add_cpu() is called
to bring up offline CPUs.

Console log:

[   68.824514] restraintd[899]: * Parsing recipe
[   68.825546] restraintd[899]: * Running recipe
[   68.825591] restraintd[899]: ** Continuing task: 20291 [/mnt/tests/distribution/reservesys]
[   68.834095] restraintd[899]: ** Preparing metadata
[   68.872927] restraintd[899]: ** Refreshing peer role hostnames: Retries 0
[   68.911107] restraintd[899]: ** Updating env vars
[   68.911737] restraintd[899]: *** Current Time: Tue May 21 09:09:42 2024  Localwatchdog at:  * Disabled! * [   68.922803] restraintd[899]: ** Running task: 20291 [/distribution/reservesys]
[   78.027943] Removing IBM Power 842 compression device
[   78.093777] XFS (sda2): Block device removal (0x20) detected at xfs_fs_shutdown+0x34/0x50 [xfs] (fs/xfs/xfs_super.c:1179). Shutting down filesystem. [   78.093894] XFS (sda2): Please unmount the filesystem and rectify the problem(s) [   83.450854] dm-0: writeback error on inode 17086756, offset 569344, sector 11026136 [   83.450910] dm-0: writeback error on inode 36421601, offset 0, sector 20772504 [   84.021819] dm-0: writeback error on inode 36382045, offset 0, sector 20772536 [   84.094348] dm-0: writeback error on inode 18703102, offset 0, sector 11021000 [   84.601228] dm-0: writeback error on inode 51268015, offset 0, sector 27663152 [   84.601468] dm-0: writeback error on inode 58225471, offset 0, sector 34636080
[   85.370996] kexec_core: Starting new kernel
[   85.391013] kexec: Waking offline cpu 1.
[   85.391038] ------------[ cut here ]------------
[   85.391042] kernel BUG at arch/powerpc/kernel/exceptions-64s.S:501!
[   85.391047] Oops: Exception in kernel mode, sig: 5 [#1]
[   85.391051] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[   85.391056] Modules linked in: bonding tls rfkill pseries_rng vmx_crypto drm fuse drm_panel_orientation_quirks xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod [   85.391086] CPU: 0 PID: 565 Comm: systemd-journal Kdump: loaded Not tainted 6.9.0+ #1 [   85.391092] Hardware name: IBM,9008-22L POWER9 (raw) 0x4e0202 0xf000005 of:IBM,FW950.A0 (VL950_144) hv:phyp pSeries [   85.391096] NIP:  c0000000000089a4 LR: 000000000001703c CTR: c000000000008980
[   85.391101] REGS: c00000000f76fd60 TRAP: 0700   Not tainted (6.9.0+)
[   85.391106] MSR:  8000000000021031 <SF,ME,IR,DR,LE>  CR: 240022d4  XER: 00000000
[   85.391116] CFAR: c00000000000899c IRQMASK: 0
[   85.391116] GPR00: 0000000000000003 00007fffc4f783a0 00007fff9f0a7200 0000010014331bb8 [   85.391116] GPR04: 00007fffc4f7b078 000000000000c4f6 00007fffc4f7b1d0 00000100143469a0 [   85.391116] GPR08: 00007fff9f489268 00000000440022d4 00007fffc4f78670 00000000000ac588 [   85.391116] GPR12: 8000000000009003 c000000002f50000 0000000000000000 0000000000000000 [   85.391116] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [   85.391116] GPR20: 0000000000000000 0000000000000000 0000000127117b48 00000001271185b8 [   85.391116] GPR24: 0000000127117b90 00007fffc4f7b070 0000010014331540 00007fffc4f7b078 [   85.391116] GPR28: 0000000000000000 00007fffc4f78f80 000000000000c4f6 0000010014331ba0
[   85.391173] NIP [c0000000000089a4] data_access_common_virt+0x14/0x220
[   85.391181] LR [000000000001703c] 0x1703c
[   85.391186] Call Trace:
[   85.391189] Code: 48024df9 48000000 60000000 e94d0020 694a0002 7d400164 60000000 718a4000 7c2a0b78 3821fd30 41c20008 e82d0910 <0981fd30> f9210160 f9610130 f9810138
[   85.391208] ---[ end trace 0000000000000000 ]---
[   85.394302] pstore: backend (nvram) writing error (-1)
[   85.394306]
[   86.394309] Kernel panic - not syncing: Fatal exception
[   86.399970] Rebooting in 10 seconds..


Thanks,
Sourabh Jain

Reply via email to