On 02/08/2017 20:03, Nathan Fontenot wrote:
> When DLPAR adding or removing memory we need to check the device
> offline status before trying to online/offline the memory. This is
> needed because calls device_online() and device_offline() will return
> non-zero for memory that is already online and offline respectively.
> 
> This update resolves two scenarios. First, for kernel built with
> auto-online memory enabled, memory will be onlined as part of calls
> to add_memory(). After adding the memory the pseries dlpar code tries
> to online it and fails since the memory is already online. The dlpar
> code then tries to remove the memory which produces the oops message
> below because the memory is not offline.
> 
> The second scenario occurs when removing memory that is already offline,
> i.e. marking memory offline (via sysfs) and the trying to remove that
> memory. This doesn't work because offlining the already offline memory
> does not succeed and the dlpar code then fails the dlpar remove operation.
> 
> The fix for both scenarios is to check the device.offline status before
> making the calls to device_online() or device_offline().
> 
> kernel BUG at mm/memory_hotplug.c:2189!
> Oops: Exception in kernel mode, sig: 5 [#1]
> SMP NR_CPUS=2048
> NUMA
> pSeries
> CPU: 0 PID: 5 Comm: kworker/u129:0 Not tainted 4.12.0-rc3 #272
> Workqueue: pseries hotplug workque .pseries_hp_work_fn
> task: c0000003f9c89200 task.stack: c0000003f9d10000
> NIP: c0000000002ca428 LR: c0000000002ca3cc CTR: c000000000ba16a0
> REGS: c0000003f9d13630 TRAP: 0700   Not tainted  (4.12.0-rc3)
> MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>
>   CR: 22002024  XER: 0000000a
> CFAR: c0000000002ca3d0 SOFTE: 1
> GPR00: c0000000002ca3cc c0000003f9d138b0 c000000001bb0200 0000000000000001
> GPR04: c0000003fb143c80 c0000003fef21630 0000000000000003 0000000000000002
> GPR08: 0000000000000003 0000000000000003 0000000000000003 00000000000031b1
> GPR12: 0000000028002042 c00000000fd80000 c000000000118ae0 c0000003fb170180
> GPR16: 0000000000000000 0000000000000004 0000000000000010 c0000003ffff79c8
> GPR20: c0000003ffff7b68 c0000003f728ff84 0000000000000002 0000000000000010
> GPR24: 0000000000000002 c0000003f728ff80 0000000000000002 0000000000000001
> GPR28: c0000003fb143c38 0000000000000002 0000000010000000 0000000020000000
> NIP [c0000000002ca428] .remove_memory+0xb8/0xc0
> LR [c0000000002ca3cc] .remove_memory+0x5c/0xc0
> Call Trace:
> [c0000003f9d138b0] [c0000000002ca3cc] .remove_memory+0x5c/0xc0 (unreliable)
> [c0000003f9d13940] [c0000000000938a4] .dlpar_add_lmb+0x384/0x400
> [c0000003f9d13a30] [c00000000009456c] .dlpar_memory+0x5dc/0xca0
> [c0000003f9d13af0] [c00000000008ce84] .handle_dlpar_errorlog+0x74/0xe0
> [c0000003f9d13b70] [c00000000008cf1c] .pseries_hp_work_fn+0x2c/0x90
> [c0000003f9d13bf0] [c000000000110a5c] .process_one_work+0x17c/0x460
> [c0000003f9d13c90] [c000000000110dc8] .worker_thread+0x88/0x500
> [c0000003f9d13d70] [c000000000118c3c] .kthread+0x15c/0x1a0
> [c0000003f9d13e30] [c00000000000ba18] .ret_from_kernel_thread+0x58/0xc0
> Instruction dump:
> 7fe3fb78 4bd7c845 60000000 7fa3eb78 4bfdd3c9 38210090 e8010010 eba1ffe8
> ebc1fff0 ebe1fff8 7c0803a6 4bfdc2ac <0fe00000> 00000000 7c0802a6 fb01ffc0
> 
> Fixes: 943db62c316c ("powerpc/pseries: Revert 'Auto-online hotplugged 
> memory'")
> Signed-off-by: Nathan Fontenot <nf...@linux.vnet.ibm.com>

tested the first scenario with 4.13.0-rc4 and qemu 2.10.0-rc2.

Tested-by: Laurent Vivier <lviv...@redhat.com>
Reviewed-by: Laurent Vivier <lviv...@redhat.com>

Reply via email to