Re: [PATCH 2/2] powerpc: Fix DABR write in the case of DAWR disabled
On Sun, 1 Apr 2018 15:50:36 +1000 Nicholas Piggin wrote: > flush_thread will call set_breakpoint via set_debug_reg_defaults, > and cause POWER8 and above CPUs without the DAWR feature to try > to write to DAWR. DABR > > Cc: Michael Neuling > Signed-off-by: Nicholas Piggin > --- > arch/powerpc/kernel/process.c | 13 + > 1 file changed, 9 insertions(+), 4 deletions(-) > > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c > index 24a591b4dbe9..bfcf5437083c 100644 > --- a/arch/powerpc/kernel/process.c > +++ b/arch/powerpc/kernel/process.c > @@ -718,7 +718,8 @@ static void set_debug_reg_defaults(struct thread_struct > *thread) > { > thread->hw_brk.address = 0; > thread->hw_brk.type = 0; > - set_breakpoint(&thread->hw_brk); > + if (ppc_breakpoint_available()) > + set_breakpoint(&thread->hw_brk); > } > #endif /* !CONFIG_HAVE_HW_BREAKPOINT */ > #endif /* CONFIG_PPC_ADV_DEBUG_REGS */ > @@ -814,10 +815,14 @@ void __set_breakpoint(struct arch_hw_breakpoint *brk) > { > memcpy(this_cpu_ptr(¤t_brk), brk, sizeof(*brk)); > > - if (cpu_has_feature(CPU_FTR_DAWR)) > - set_dawr(brk); > - else > + if (cpu_has_feature(CPU_FTR_ARCH_207S)) { > + if (cpu_has_feature(CPU_FTR_DAWR)) > + set_dawr(brk); > + else > + WARN_ON_ONCE(1); > + } else { > set_dabr(brk); > + } > } > > void set_breakpoint(struct arch_hw_breakpoint *brk)
[PATCH 0/3] stop secondaries for reboot/shutdown
I'm rebasing and resending this series because the hard lockup messages are being seen in the field. Since last time, one of the patches was merged, and patches were split and reordered a bit. No major changes. This still hasn't been tested with the FSP firmware update. Thanks, Nick Nicholas Piggin (3): powerpc: use NMI IPI for smp_send_stop powerpc: hard disable irqs in smp_send_stop loop powerpc/powernv: Always stop secondaries before reboot/shutdown arch/powerpc/include/asm/opal.h | 2 +- arch/powerpc/kernel/smp.c | 13 +++-- arch/powerpc/platforms/powernv/opal-flash.c | 28 +--- arch/powerpc/platforms/powernv/setup.c | 15 +-- 4 files changed, 18 insertions(+), 40 deletions(-) -- 2.16.3
[PATCH v2 1/3] powerpc: use NMI IPI for smp_send_stop
Use the NMI IPI rather than smp_call_function for smp_send_stop. Have stopped CPUs hard disable interrupts rather than just soft disable. This function is used in crash/panic/shutdown paths to bring other CPUs down as quickly and reliably as possible, and minimizing their potential to cause trouble. Avoiding the Linux smp_call_function infrastructure and (if supported) using true NMI IPIs makes this more robust. Signed-off-by: Nicholas Piggin --- arch/powerpc/kernel/smp.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index cfc08b099c49..db88660bf6bd 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -565,7 +565,11 @@ void crash_send_ipi(void (*crash_ipi_callback)(struct pt_regs *)) } #endif +#ifdef CONFIG_NMI_IPI +static void stop_this_cpu(struct pt_regs *regs) +#else static void stop_this_cpu(void *dummy) +#endif { /* Remove this CPU */ set_cpu_online(smp_processor_id(), false); @@ -577,7 +581,11 @@ static void stop_this_cpu(void *dummy) void smp_send_stop(void) { +#ifdef CONFIG_NMI_IPI + smp_send_nmi_ipi(NMI_IPI_ALL_OTHERS, stop_this_cpu, 100); +#else smp_call_function(stop_this_cpu, NULL, 0); +#endif } struct thread_info *current_set[NR_CPUS]; -- 2.16.3
[PATCH v2 2/3] powerpc: hard disable irqs in smp_send_stop loop
The hard lockup watchdog can fire under local_irq_disable on platforms with irq soft masking. Signed-off-by: Nicholas Piggin --- arch/powerpc/kernel/smp.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index db88660bf6bd..e16ec7b3b427 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -574,9 +574,10 @@ static void stop_this_cpu(void *dummy) /* Remove this CPU */ set_cpu_online(smp_processor_id(), false); - local_irq_disable(); + hard_irq_disable(); + spin_begin(); while (1) - ; + spin_cpu_relax(); } void smp_send_stop(void) -- 2.16.3
[PATCH v2 3/3] powerpc/powernv: Always stop secondaries before reboot/shutdown
Currently powernv reboot and shutdown requests just leave secondaries to do their own things. This is undesirable because they can trigger any number of watchdogs while waiting for reboot, but also we don't know what else they might be doing -- they might be causing trouble, trampling memory, etc. The opal scheduled flash update code already ran into watchdog problems due to flashing taking a long time, and it was fixed with 2196c6f1ed ("powerpc/powernv: Return secondary CPUs to firmware before FW update"), which returns secondaries to opal. It's been found that regular reboots can take over 10 seconds, which can result in the hard lockup watchdog firing, reboot: Restarting system [ 360.038896709,5] OPAL: Reboot request... Watchdog CPU:0 Hard LOCKUP Watchdog CPU:44 detected Hard LOCKUP other CPUS:16 Watchdog CPU:16 Hard LOCKUP watchdog: BUG: soft lockup - CPU#16 stuck for 3s! [swapper/16:0] This patch removes the special case for flash update, and calls smp_send_stop in all cases before calling reboot/shutdown. smp_send_stop could return CPUs to OPAL, the main reason not to is that the request could come from a NMI that interrupts OPAL code, so re-entry to OPAL can cause a number of problems. Putting secondaries into simple spin loops improves the chances of a successful reboot. Cc: Vasant Hegde Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/opal.h | 2 +- arch/powerpc/platforms/powernv/opal-flash.c | 28 +--- arch/powerpc/platforms/powernv/setup.c | 15 +-- 3 files changed, 7 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index dde60089d0d4..7159e1a6a61a 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -325,7 +325,7 @@ struct rtc_time; extern unsigned long opal_get_boot_time(void); extern void opal_nvram_init(void); extern void opal_flash_update_init(void); -extern void opal_flash_term_callback(void); +extern void opal_flash_update_print_message(void); extern int opal_elog_init(void); extern void opal_platform_dump_init(void); extern void opal_sys_param_init(void); diff --git a/arch/powerpc/platforms/powernv/opal-flash.c b/arch/powerpc/platforms/powernv/opal-flash.c index 1cb0b895a236..b37015101bf6 100644 --- a/arch/powerpc/platforms/powernv/opal-flash.c +++ b/arch/powerpc/platforms/powernv/opal-flash.c @@ -303,26 +303,9 @@ static int opal_flash_update(int op) return rc; } -/* Return CPUs to OPAL before starting FW update */ -static void flash_return_cpu(void *info) -{ - int cpu = smp_processor_id(); - - if (!cpu_online(cpu)) - return; - - /* Disable IRQ */ - hard_irq_disable(); - - /* Return the CPU to OPAL */ - opal_return_cpu(); -} - /* This gets called just before system reboots */ -void opal_flash_term_callback(void) +void opal_flash_update_print_message(void) { - struct cpumask mask; - if (update_flash_data.status != FLASH_IMG_READY) return; @@ -333,15 +316,6 @@ void opal_flash_term_callback(void) /* Small delay to help getting the above message out */ msleep(500); - - /* Return secondary CPUs to firmware */ - cpumask_copy(&mask, cpu_online_mask); - cpumask_clear_cpu(smp_processor_id(), &mask); - if (!cpumask_empty(&mask)) - smp_call_function_many(&mask, - flash_return_cpu, NULL, false); - /* Hard disable interrupts */ - hard_irq_disable(); } /* diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c index 5f963286232f..ef8c9ce53a61 100644 --- a/arch/powerpc/platforms/powernv/setup.c +++ b/arch/powerpc/platforms/powernv/setup.c @@ -201,17 +201,12 @@ static void pnv_prepare_going_down(void) */ opal_event_shutdown(); - /* Soft disable interrupts */ - local_irq_disable(); + /* Print flash update message if one is scheduled. */ + opal_flash_update_print_message(); - /* -* Return secondary CPUs to firwmare if a flash update -* is pending otherwise we will get all sort of error -* messages about CPU being stuck etc.. This will also -* have the side effect of hard disabling interrupts so -* past this point, the kernel is effectively dead. -*/ - opal_flash_term_callback(); + smp_send_stop(); + + hard_irq_disable(); } static void __noreturn pnv_restart(char *cmd) -- 2.16.3
Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()
On Tue, Mar 27, 2018 at 11:21:17AM +0100, Will Deacon wrote: > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote: > > kick_all_cpus_sync() forces all CPUs to sync caches by sending broadcast > > IPI. > > If CPU is in extended quiescent state (idle task or nohz_full userspace), > > this > > work may be done at the exit of this state. Delaying synchronization helps > > to > > save power if CPU is in idle state and decrease latency for real-time tasks. > > > > This patch introduces kick_active_cpus_sync() and uses it in mm/slab and > > arm64 > > code to delay syncronization. > > > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to the CPU > > running > > isolated task would be fatal, as it breaks isolation. The approach with > > delaying > > of synchronization work helps to maintain isolated state. > > > > I've tested it with test from task isolation series on ThunderX2 for more > > than > > 10 hours (10k giga-ticks) without breaking isolation. > > > > Signed-off-by: Yury Norov > > --- > > arch/arm64/kernel/insn.c | 2 +- > > include/linux/smp.h | 2 ++ > > kernel/smp.c | 24 > > mm/slab.c| 2 +- > > 4 files changed, 28 insertions(+), 2 deletions(-) > > > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c > > index 2718a77da165..9d7c492e920e 100644 > > --- a/arch/arm64/kernel/insn.c > > +++ b/arch/arm64/kernel/insn.c > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void *addrs[], > > u32 insns[], int cnt) > > * synchronization. > > */ > > ret = aarch64_insn_patch_text_nosync(addrs[0], > > insns[0]); > > - kick_all_cpus_sync(); > > + kick_active_cpus_sync(); > > return ret; > > } > > } > > I think this means that runtime modifications to the kernel text might not > be picked up by CPUs coming out of idle. Shouldn't we add an ISB on that > path to avoid executing stale instructions? > > Will commit 153ae9d5667e7baab4d48c48e8ec30fbcbd86f1e Author: Yury Norov Date: Sat Mar 31 15:05:23 2018 +0300 Hi Will, Paul, On my system there are 3 paths that go thru rcu_dynticks_eqs_exit(), and so require isb(). First path starts at gic_handle_irq() on secondary_start_kernel stack. gic_handle_irq() already issues isb(), and so we can do nothing. Second path starts at el0_svc entry; and third path is the exit from do_idle() on secondary_start_kernel stack. For do_idle() path there is arch_cpu_idle_exit() hook that is not used by arm64 at now, so I picked it. And for el0_svc, I've introduced isb_if_eqs macro and call it at the beginning of el0_svc_naked. I've tested it on ThunderX2 machine, and it works for me. Below is my call traces and patch for them. If you OK with it, I think I'm ready to submit v2 (but maybe split this patch for better readability). Yury [ 585.412095] Call trace: [ 585.412097] [] dump_backtrace+0x0/0x380 [ 585.412099] [] show_stack+0x14/0x20 [ 585.412101] [] dump_stack+0x98/0xbc [ 585.412104] [] rcu_dynticks_eqs_exit+0x68/0x70 [ 585.412105] [] rcu_irq_enter+0x48/0x50 [ 585.412106] [] irq_enter+0xc/0x70 [ 585.412108] [] __handle_domain_irq+0x3c/0x120 [ 585.412109] [] gic_handle_irq+0xc4/0x180 [ 585.412110] Exception stack(0xfc001130fe20 to 0xfc001130ff60) [ 585.412112] fe20: 00a0 0001 [ 585.412113] fe40: 028f6f0b 0020 0013cd6f53963b31 [ 585.412144] fe60: 0002 fc001130fed0 0b80 3400 [ 585.412146] fe80: 0001 01db [ 585.412147] fea0: fc0008247a78 03ff86dc61f8 0014 fc0008fc [ 585.412149] fec0: fc00090143e8 fc0009014000 fc0008fc94a0 [ 585.412150] fee0: fe8f46bb1700 [ 585.412152] ff00: fc001130ff60 fc0008085034 fc001130ff60 [ 585.412153] ff20: fc0008085038 00400149 fc0009014000 fc0008fc94a0 [ 585.412155] ff40: fc001130ff60 fc0008085038 [ 585.412156] [] el1_irq+0xb0/0x124 [ 585.412158] [] arch_cpu_idle+0x10/0x18 [ 585.412159] [] do_idle+0x10c/0x1d8 [ 585.412160] [] cpu_startup_entry+0x24/0x28 [ 585.412162] [] secondary_start_kernel+0x15c/0x1a0 [ 585.412164] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.0-isolation-160735-g59b71c1-dirty #18 [ 585.412058] Call trace: [ 585.412060] [] dump_backtrace+0x0/0x380 [ 585.412062] [] show_stack+0x14/0x20 [ 585.412064] [] dump_stack+0x98/0xbc [ 585.412066] [] rcu_dynticks_eqs_exit+0x68/0x70 [ 585.412068] [] rcu_eqs_exit.isra.23+0x64/0x80 [ 585.412069] [] rcu_user_exit+0xc/0x18 [ 585.412071] [] __context_tracking_exit.part.2+0x74/0x98 [ 585.412072] [] contex
Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()
On Sun, Apr 01, 2018 at 02:11:08PM +0300, Yury Norov wrote: > On Tue, Mar 27, 2018 at 11:21:17AM +0100, Will Deacon wrote: > > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote: > > > kick_all_cpus_sync() forces all CPUs to sync caches by sending broadcast > > > IPI. > > > If CPU is in extended quiescent state (idle task or nohz_full userspace), > > > this > > > work may be done at the exit of this state. Delaying synchronization > > > helps to > > > save power if CPU is in idle state and decrease latency for real-time > > > tasks. > > > > > > This patch introduces kick_active_cpus_sync() and uses it in mm/slab and > > > arm64 > > > code to delay syncronization. > > > > > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to the CPU > > > running > > > isolated task would be fatal, as it breaks isolation. The approach with > > > delaying > > > of synchronization work helps to maintain isolated state. > > > > > > I've tested it with test from task isolation series on ThunderX2 for more > > > than > > > 10 hours (10k giga-ticks) without breaking isolation. > > > > > > Signed-off-by: Yury Norov > > > --- > > > arch/arm64/kernel/insn.c | 2 +- > > > include/linux/smp.h | 2 ++ > > > kernel/smp.c | 24 > > > mm/slab.c| 2 +- > > > 4 files changed, 28 insertions(+), 2 deletions(-) > > > > > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c > > > index 2718a77da165..9d7c492e920e 100644 > > > --- a/arch/arm64/kernel/insn.c > > > +++ b/arch/arm64/kernel/insn.c > > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void *addrs[], > > > u32 insns[], int cnt) > > >* synchronization. > > >*/ > > > ret = aarch64_insn_patch_text_nosync(addrs[0], > > > insns[0]); > > > - kick_all_cpus_sync(); > > > + kick_active_cpus_sync(); > > > return ret; > > > } > > > } > > > > I think this means that runtime modifications to the kernel text might not > > be picked up by CPUs coming out of idle. Shouldn't we add an ISB on that > > path to avoid executing stale instructions? > > > > Will > > commit 153ae9d5667e7baab4d48c48e8ec30fbcbd86f1e > Author: Yury Norov > Date: Sat Mar 31 15:05:23 2018 +0300 > > Hi Will, Paul, > > On my system there are 3 paths that go thru rcu_dynticks_eqs_exit(), > and so require isb(). > > First path starts at gic_handle_irq() on secondary_start_kernel stack. > gic_handle_irq() already issues isb(), and so we can do nothing. > > Second path starts at el0_svc entry; and third path is the exit from > do_idle() on secondary_start_kernel stack. > > For do_idle() path there is arch_cpu_idle_exit() hook that is not used by > arm64 at now, so I picked it. And for el0_svc, I've introduced isb_if_eqs > macro and call it at the beginning of el0_svc_naked. > > I've tested it on ThunderX2 machine, and it works for me. > > Below is my call traces and patch for them. If you OK with it, I think I'm > ready to submit v2 (but maybe split this patch for better readability). I must defer to Will on this one. Thanx, Paul > Yury > > [ 585.412095] Call trace: > [ 585.412097] [] dump_backtrace+0x0/0x380 > [ 585.412099] [] show_stack+0x14/0x20 > [ 585.412101] [] dump_stack+0x98/0xbc > [ 585.412104] [] rcu_dynticks_eqs_exit+0x68/0x70 > [ 585.412105] [] rcu_irq_enter+0x48/0x50 > [ 585.412106] [] irq_enter+0xc/0x70 > [ 585.412108] [] __handle_domain_irq+0x3c/0x120 > [ 585.412109] [] gic_handle_irq+0xc4/0x180 > [ 585.412110] Exception stack(0xfc001130fe20 to 0xfc001130ff60) > [ 585.412112] fe20: 00a0 0001 > > [ 585.412113] fe40: 028f6f0b 0020 0013cd6f53963b31 > > [ 585.412144] fe60: 0002 fc001130fed0 0b80 > 3400 > [ 585.412146] fe80: 0001 > 01db > [ 585.412147] fea0: fc0008247a78 03ff86dc61f8 0014 > fc0008fc > [ 585.412149] fec0: fc00090143e8 fc0009014000 fc0008fc94a0 > > [ 585.412150] fee0: fe8f46bb1700 > > [ 585.412152] ff00: fc001130ff60 fc0008085034 > fc001130ff60 > [ 585.412153] ff20: fc0008085038 00400149 fc0009014000 > fc0008fc94a0 > [ 585.412155] ff40: fc001130ff60 > fc0008085038 > [ 585.412156] [] el1_irq+0xb0/0x124 > [ 585.412158] [] arch_cpu_idle+0x10/0x18 > [ 585.412159] [] do_idle+0x10c/0x1d8 > [ 585.412160] [] cpu_startup_entry+0x24/0x28 > [ 585.412162] [] secondary_start_kernel+0x15c/0x1a0 > [ 585.412164] CPU: 1 PID: 0 Comm: swapper/1 Not tainted > 4.14.0-isolation-160735-g59b
[PATCH v6 14/16] powerpc: Switch to generic free_initrd_mem.
The first patch in this series added a weakly-defined generic implementation, which is functionally identical to the architecture-specific one removed here. Series boot-tested on RISC-V (which now uses the generic implementation) and x86_64 (which doesn't). Signed-off-by: Shea Levy --- arch/powerpc/mm/mem.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index fe8c61149fb8..e85b2a3cd264 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -404,13 +404,6 @@ void free_initmem(void) free_initmem_default(POISON_FREE_INITMEM); } -#ifdef CONFIG_BLK_DEV_INITRD -void __init free_initrd_mem(unsigned long start, unsigned long end) -{ - free_reserved_area((void *)start, (void *)end, -1, "initrd"); -} -#endif - /* * This is called when a page has been modified by the kernel. * It just marks the page as not i-cache clean. We do the i-cache -- 2.16.2
Re: [PATCH v4 14/16] powerpc: Use generic free_initrd_mem.
Hi Michael, Michael Ellerman writes: > Shea Levy writes: > >> Joe Perches writes: >> >>> On Wed, 2018-03-28 at 16:36 -0400, Shea Levy wrote: Signed-off-by: Shea Levy >>> >>> Most people seem to want some form of commit message >>> and not just your sign-off. >>> >> >> Ah, if the subject is insufficient I can add some more detail. > > Yeah please do. > > Seeing this patch in isolation, with no change log, I might think it's > safe for me to just apply it. > > But that would break the build because I don't have patch 1. > > So for starters you need to explain that part, eg something like: > > A previous patch in the series added a weak definition of > free_initrd_mem() in init/initramfs.c. > > The powerpc implementation is identical, so it can be removed allowing > the generic version to be used. > > > Then you could also tell me if you did/didn't build/boot test it. Thanks for the feedback, can you let me know if the recently posted v6 fits the bill? > > cheers Thanks, Shea signature.asc Description: PGP signature
Re: [PATCH] Extract initrd free logic from arch-specific code.
Hi Ingo, Ingo Molnar writes: > * Shea Levy wrote: > >> Now only those architectures that have custom initrd free requirements >> need to define free_initrd_mem. >> >> Signed-off-by: Shea Levy > > Please put the Kconfig symbol name this patch introduces both into the title, > so > that people know what to grep for. > >> --- >> arch/alpha/mm/init.c | 8 >> arch/arc/mm/init.c| 7 --- >> arch/arm/Kconfig | 1 + >> arch/arm64/Kconfig| 1 + >> arch/blackfin/Kconfig | 1 + >> arch/c6x/mm/init.c| 7 --- >> arch/cris/Kconfig | 1 + >> arch/frv/mm/init.c| 11 --- >> arch/h8300/mm/init.c | 7 --- >> arch/hexagon/Kconfig | 1 + >> arch/ia64/Kconfig | 1 + >> arch/m32r/Kconfig | 1 + >> arch/m32r/mm/init.c | 11 --- >> arch/m68k/mm/init.c | 7 --- >> arch/metag/Kconfig| 1 + >> arch/microblaze/mm/init.c | 7 --- >> arch/mips/Kconfig | 1 + >> arch/mn10300/Kconfig | 1 + >> arch/nios2/mm/init.c | 7 --- >> arch/openrisc/mm/init.c | 7 --- >> arch/parisc/mm/init.c | 7 --- >> arch/powerpc/mm/mem.c | 7 --- >> arch/riscv/mm/init.c | 6 -- >> arch/s390/Kconfig | 1 + >> arch/score/Kconfig| 1 + >> arch/sh/mm/init.c | 7 --- >> arch/sparc/Kconfig| 1 + >> arch/tile/Kconfig | 1 + >> arch/um/kernel/mem.c | 7 --- >> arch/unicore32/Kconfig| 1 + >> arch/x86/Kconfig | 1 + >> arch/xtensa/Kconfig | 1 + >> init/initramfs.c | 7 +++ >> usr/Kconfig | 4 >> 34 files changed, 28 insertions(+), 113 deletions(-) > > Please also put it into Documentation/features/. > I switched this patch series (the latest revision v6 was just posted) to using weak symbols instead of Kconfig. Does it still warrant documentation? > >> diff --git a/usr/Kconfig b/usr/Kconfig >> index 43658b8a975e..7a94f6df39bf 100644 >> --- a/usr/Kconfig >> +++ b/usr/Kconfig >> @@ -233,3 +233,7 @@ config INITRAMFS_COMPRESSION >> default ".lzma" if RD_LZMA >> default ".bz2" if RD_BZIP2 >> default "" >> + >> +config HAVE_ARCH_FREE_INITRD_MEM >> +bool >> +default n > > Help text would be nice, to tell arch maintainers what the purpose of this > switch > is. > > Also, a nit, I think this should be named "ARCH_HAS_FREE_INITRD_MEM", which > is the > dominant pattern: > > triton:~/tip> git grep 'select.*ARCH' arch/x86/Kconfig* | cut -f2 | cut -d_ > -f1-2 | sort | uniq -c | sort -n > ... > 2 select ARCH_USES > 2 select ARCH_WANTS > 3 select ARCH_MIGHT > 3 select ARCH_WANT > 4 select ARCH_SUPPORTS > 4 select ARCH_USE > 16 select HAVE_ARCH > 23 select ARCH_HAS > > It also reads nicely in English: > > "arch has free_initrd_mem()" > > While the other makes little sense: > > "have arch free_initrd_mem()" > > ? > > Thanks, > > Ingo Thanks, Shea signature.asc Description: PGP signature
[PATCH v3 0/7] powerpc/fadump: Improvements and fixes for firmware-assisted dump.
One of the primary issues with Firmware Assisted Dump (fadump) on Power is that it needs a large amount of memory to be reserved. This reserved memory is used for saving the contents of old crashed kernel's memory before fadump capture kernel uses old kernel's memory area to boot. However, This reserved memory area stays unused until system crash and isn't available for production kernel to use. Instead of setting aside a significant chunk of memory that nobody can use, take advantage Linux kernel's Contiguous Memory Allocator (CMA) feature, to reserve a significant chunk of memory that the kernel is prevented from using , but applications are free to use it. Patch 1 moves the metadata region to the start of the reserved area for easy handling/detection of metadata region in second kernel. Patch 3 implements the usage of CMA region to allow production kernel to use that memory for applications usage, making fadump reservationless. Patch 4-7 fixes various fadump issues and bugs. Changes in V3: - patch 1 & 2: move metadata region and documentation update. - patch 7: Un-register the faudmp on kexec path --- Mahesh Salgaonkar (7): powerpc/fadump: Move the metadata region to start of the reserved area. powerpc/fadump: Update documentation to reflect the metadata region movement. powerpc/fadump: Reservationless firmware assisted dump powerpc/fadump: exclude memory holes while reserving memory in second kernel. powerpc/fadump: throw proper error message on fadump registration failure. powerpc/fadump: Do not allow hot-remove memory from fadump reserved area. powerpc/fadump: un-register fadump on kexec path. Documentation/powerpc/firmware-assisted-dump.txt | 31 ++ arch/powerpc/include/asm/fadump.h|6 arch/powerpc/kernel/fadump.c | 275 +++--- arch/powerpc/platforms/pseries/hotplug-memory.c |7 - 4 files changed, 271 insertions(+), 48 deletions(-) -- Signature
[PATCH v3 1/7] powerpc/fadump: Move the metadata region to start of the reserved area.
From: Mahesh Salgaonkar Currently the metadata region that holds crash info structure and ELF core header is placed towards the end of reserved memory area. This patch places it at the beginning of the reserved memory area. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/include/asm/fadump.h |4 ++ arch/powerpc/kernel/fadump.c | 92 - 2 files changed, 83 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h index 5a23010af600..44b6ebfe9be6 100644 --- a/arch/powerpc/include/asm/fadump.h +++ b/arch/powerpc/include/asm/fadump.h @@ -61,6 +61,9 @@ #define FADUMP_UNREGISTER 2 #define FADUMP_INVALIDATE 3 +/* Number of dump sections requested by kernel */ +#define FADUMP_NUM_SECTIONS4 + /* Dump status flag */ #define FADUMP_ERROR_FLAG 0x2000 @@ -119,6 +122,7 @@ struct fadump_mem_struct { struct fadump_section cpu_state_data; struct fadump_section hpte_region; struct fadump_section rmr_region; + struct fadump_section metadata_region; }; /* Firmware-assisted dump configuration details. */ diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 3c2c2688918f..80552fd7d5f8 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -188,17 +188,48 @@ static void fadump_show_config(void) pr_debug("Boot memory size : %lx\n", fw_dump.boot_memory_size); } +static unsigned long get_fadump_metadata_size(void) +{ + unsigned long size = 0; + + /* +* If fadump is active then look into fdm_active to get metadata +* size set by previous kernel. +*/ + if (fw_dump.dump_active) { + size = fdm_active->cpu_state_data.destination_address - + fdm_active->metadata_region.source_address; + goto out; + } + size += sizeof(struct fadump_crash_info_header); + size += sizeof(struct elfhdr); /* ELF core header.*/ + size += sizeof(struct elf_phdr); /* place holder for cpu notes */ + /* Program headers for crash memory regions. */ + size += sizeof(struct elf_phdr) * (memblock_num_regions(memory) + 2); + + size = PAGE_ALIGN(size); +out: + pr_debug("fadump Metadata size is %ld\n", size); + return size; +} + static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm, unsigned long addr) { + uint16_t num_sections = 0; + unsigned long metadata_base = 0; + if (!fdm) return 0; + /* Skip the fadump metadata area. */ + metadata_base = addr; + addr += get_fadump_metadata_size(); + memset(fdm, 0, sizeof(struct fadump_mem_struct)); addr = addr & PAGE_MASK; fdm->header.dump_format_version = cpu_to_be32(0x0001); - fdm->header.dump_num_sections = cpu_to_be16(3); fdm->header.dump_status_flag = 0; fdm->header.offset_first_dump_section = cpu_to_be32((u32)offsetof(struct fadump_mem_struct, cpu_state_data)); @@ -222,6 +253,7 @@ static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm, fdm->cpu_state_data.source_address = 0; fdm->cpu_state_data.source_len = cpu_to_be64(fw_dump.cpu_state_data_size); fdm->cpu_state_data.destination_address = cpu_to_be64(addr); + num_sections++; addr += fw_dump.cpu_state_data_size; /* hpte region section */ @@ -230,6 +262,7 @@ static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm, fdm->hpte_region.source_address = 0; fdm->hpte_region.source_len = cpu_to_be64(fw_dump.hpte_region_size); fdm->hpte_region.destination_address = cpu_to_be64(addr); + num_sections++; addr += fw_dump.hpte_region_size; /* RMA region section */ @@ -238,8 +271,25 @@ static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm, fdm->rmr_region.source_address = cpu_to_be64(RMA_START); fdm->rmr_region.source_len = cpu_to_be64(fw_dump.boot_memory_size); fdm->rmr_region.destination_address = cpu_to_be64(addr); + num_sections++; addr += fw_dump.boot_memory_size; + /* +* fadump metadata section. +* Add an entry with source len a zero. We are only intereseted in +* source address which will help us to detect the location of +* metadata area where faump header and elf core header is placed. +*/ + fdm->metadata_region.request_flag = cpu_to_be32(FADUMP_REQUEST_FLAG); + fdm->metadata_region.source_data_type = + cpu_to_be16(FADUMP_REAL_MODE_REGION); + fdm->metadata_region.source_address = cpu_to_be64(metadata_base); + fdm->metadata_region.source_len = 0; + fdm->metadata_region.destination_ad
[PATCH v3 2/7] powerpc/fadump: Update documentation to reflect the metadata region movement.
From: Mahesh Salgaonkar Metadata region that holds fadump header and ELF header is now placed at the start of reserved memory area. Update the documentation accordingly Signed-off-by: Mahesh Salgaonkar --- Documentation/powerpc/firmware-assisted-dump.txt | 31 +- 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/Documentation/powerpc/firmware-assisted-dump.txt b/Documentation/powerpc/firmware-assisted-dump.txt index bdd344aa18d9..c77f8dc9231f 100644 --- a/Documentation/powerpc/firmware-assisted-dump.txt +++ b/Documentation/powerpc/firmware-assisted-dump.txt @@ -115,17 +115,24 @@ be kept permanently reserved, so that it can act as a receptacle for a copy of the boot memory content in addition to CPU state and HPTE region, in the case a crash does occur. +The first kernel, during fadump registration, prepares ELF core header +that contains necessary information for the coredump of the 1st kernel +which includes its physical memory layout. This ELF header and some more +additional data is stored in the area called metadata region at the start +of the reserved memory area. + o Memory Reservation during first kernel Low memory Top of memory 0 boot memory size | | ||<--Reserved dump area -->| | V V| Permanent Reservation | V - +---+--/ /---+---++---++--+ - | ||CPU|HPTE| DUMP |ELF | | - +---+--/ /---+---++---++--+ -| ^ -| | + +---+--/ /---++---++---+--+ + | ||ELF |CPU|HPTE| DUMP | | + +---+-/ /++---++---+--+ +|^ ^ +|| | +|metadata region| \ / --- Boot memory content gets transferred to @@ -137,14 +144,14 @@ and HPTE region, in the case a crash does occur. Low memoryTop of memory 0 boot memory size | - | |<- Reserved dump area --- -->| + | |<- Reserved dump area -->| V V V - +---+--/ /---+---++---++--+ - | ||CPU|HPTE| DUMP |ELF | | - +---+--/ /---+---++---++--+ -| | -V V - Used by second/proc/vmcore + +---+--/ /---+++---+--+ + | ||ELF |CPU|HPTE| DUMP | | + +---+--/ /---+++---+--+ +| | +V V + Used by second /proc/vmcore kernel to boot Fig. 2
[PATCH v3 3/7] powerpc/fadump: Reservationless firmware assisted dump
From: Mahesh Salgaonkar One of the primary issues with Firmware Assisted Dump (fadump) on Power is that it needs a large amount of memory to be reserved. On large systems with TeraBytes of memory, this reservation can be quite significant. In some cases, fadump fails if the memory reserved is insufficient, or if the reserved memory was DLPAR hot-removed. In the normal case, post reboot, the preserved memory is filtered to extract only relevant areas of interest using the makedumpfile tool. While the tool provides flexibility to determine what needs to be part of the dump and what memory to filter out, all supported distributions default this to "Capture only kernel data and nothing else". We take advantage of this default and the Linux kernel's Contiguous Memory Allocator (CMA) to fundamentally change the memory reservation model for fadump. Fadump can now only capture kernel memory. Instead of setting aside a significant chunk of memory nobody can use, this patch uses CMA instead, to reserve a significant chunk of memory that the kernel is prevented from using (due to MIGRATE_CMA), but applications are free to use it. Essentially, on a P9 LPAR with 2 cores, 8GB RAM and current upstream: [root@zzxx-yy10 ~]# free -m totalusedfree shared buff/cache available Mem: 7557 1936822 12 5416725 Swap: 4095 04095 With this patch: [root@zzxx-yy10 ~]# free -m totalusedfree shared buff/cache available Mem: 8133 1947464 12 4757338 Swap: 4095 04095 Changes made here are completely transparent to how fadump has traditionally worked. Thanks to Aneesh Kumar and Anshuman Khandual for helping us understand CMA and its usage. TODO: - Handle case where CMA reservation spans nodes. Signed-off-by: Ananth N Mavinakayanahalli Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/kernel/fadump.c | 122 -- 1 file changed, 104 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 80552fd7d5f8..011f1aa7abab 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include @@ -45,11 +46,57 @@ static struct fw_dump fw_dump; static struct fadump_mem_struct fdm; static const struct fadump_mem_struct *fdm_active; +static struct cma *fadump_cma; static DEFINE_MUTEX(fadump_mutex); struct fad_crash_memory_ranges crash_memory_ranges[INIT_CRASHMEM_RANGES]; int crash_mem_ranges; +/* + * fadump_cma_reserve() - reserve area for fadump memory reservation + * + * This function reserves memory from early allocator. It should be + * called by arch specific code once the memblock allocator + * has been activated. + */ +int __init fadump_cma_reserve(void) +{ + unsigned long long base, size; + int rc; + + if (!fw_dump.fadump_enabled) + return 0; + + base = fw_dump.reserve_dump_area_start; + size = fw_dump.reserve_dump_area_size; + pr_debug("Original reserve area base %ld, size %ld\n", + (unsigned long)base >> 20, + (unsigned long)size >> 20); + if (!size) + return 0; + + rc = cma_declare_contiguous(base, size, 0, 0, 0, false, + "fadump_cma", &fadump_cma); + if (rc) { + printk(KERN_ERR "fadump: Failed to reserve cma area for " + "firmware-assisted dump, %d\n", rc); + fw_dump.reserve_dump_area_size = 0; + return 0; + } + /* +* So we now have cma area reserved for fadump. base may be different +* from what we requested. +*/ + fw_dump.reserve_dump_area_start = cma_get_base(fadump_cma); + fw_dump.reserve_dump_area_size = cma_get_size(fadump_cma); + printk("Reserved %ldMB cma area at %ldMB for firmware-assisted dump " + "(System RAM: %ldMB)\n", + cma_get_size(fadump_cma) >> 20, + (unsigned long)cma_get_base(fadump_cma) >> 20, + (unsigned long)(memblock_phys_mem_size() >> 20)); + return 1; +} + /* Scan the Firmware Assisted dump configuration details. */ int __init early_init_dt_scan_fw_dump(unsigned long node, const char *uname, int depth, void *data) @@ -381,7 +428,7 @@ static unsigned long get_fadump_area_size(void) } static inline unsigned long get_fadump_metadata_base( - const struct fadump_mem_struct *fdm) + const struct fadump_mem_struct *fdm) { return be64_to_cpu(fdm->metadata_region.source_address); } @@ -449,8 +496,9 @@ int
[PATCH v3 4/7] powerpc/fadump: exclude memory holes while reserving memory in second kernel.
From: Mahesh Salgaonkar The second kernel, during early boot after the crash, reserves rest of the memory above boot memory size to make sure it does not touch any of the dump memory area. It uses memblock_reserve() that reserves the specified memory region irrespective of memory holes present within that region. There are chances where previous kernel would have hot removed some of its memory leaving memory holes behind. In such cases fadump kernel reports incorrect number of reserved pages through arch_reserved_kernel_pages() hook causing kernel to hang or panic. Fix this by excluding memory holes while reserving rest of the memory above boot memory size during second kernel boot after crash. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/kernel/fadump.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 011f1aa7abab..a497e9fb93fe 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -433,6 +433,21 @@ static inline unsigned long get_fadump_metadata_base( return be64_to_cpu(fdm->metadata_region.source_address); } +static void fadump_memblock_reserve(unsigned long base, unsigned long size) +{ + struct memblock_region *reg; + unsigned long start, end; + + for_each_memblock(memory, reg) { + start = max(base, (unsigned long)reg->base); + end = reg->base + reg->size; + end = min(base + size, end); + + if (start < end) + memblock_reserve(start, end - start); + } +} + int __init fadump_reserve_mem(void) { unsigned long base, size, memory_boundary; @@ -487,7 +502,7 @@ int __init fadump_reserve_mem(void) */ base = fw_dump.boot_memory_size; size = memory_boundary - base; - memblock_reserve(base, size); + fadump_memblock_reserve(base, size); printk(KERN_INFO "Reserved %ldMB of memory at %ldMB " "for saving crash dump\n", (unsigned long)(size >> 20),
[PATCH v3 5/7] powerpc/fadump: throw proper error message on fadump registration failure.
From: Mahesh Salgaonkar fadump fails to register when there are holes in reserved memory area. This can happen if user has hot-removed a memory that falls in the fadump reserved memory area. Throw a meaningful error message to the user in such case. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/kernel/fadump.c | 33 + 1 file changed, 33 insertions(+) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index a497e9fb93fe..59aaf0357a52 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -216,6 +216,36 @@ static int is_boot_memory_area_contiguous(void) return ret; } +/* + * Returns 1, if there are no holes in reserved memory area, + * 0 otherwise. + */ +static int is_reserved_memory_area_contiguous(void) +{ + struct memblock_region *reg; + unsigned long start, end; + unsigned long d_start = fw_dump.reserve_dump_area_start; + unsigned long d_end = d_start + fw_dump.reserve_dump_area_size; + int ret = 0; + + for_each_memblock(memory, reg) { + start = max(d_start, (unsigned long)reg->base); + end = min(d_end, (unsigned long)(reg->base + reg->size)); + if (d_start < end) { + /* Memory hole from d_start to start */ + if (start > d_start) + break; + + if (end == d_end) { + ret = 1; + break; + } + d_start = end + 1; + } + } + return ret; +} + /* Print firmware assisted dump configurations for debugging purpose. */ static void fadump_show_config(void) { @@ -602,6 +632,9 @@ static int register_fw_dump(struct fadump_mem_struct *fdm) if (!is_boot_memory_area_contiguous()) pr_err("Can't have holes in boot memory area while " "registering fadump\n"); + else if (!is_reserved_memory_area_contiguous()) + pr_err("Can't have holes in reserved memory area while" + " registering fadump\n"); printk(KERN_ERR "Failed to register firmware-assisted kernel" " dump. Parameter Error(%d).\n", rc);
[PATCH v3 6/7] powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.
From: Mahesh Salgaonkar For fadump to work successfully there should not be any holes in reserved memory ranges where kernel has asked firmware to move the content of old kernel memory in event of crash. But this memory area is currently not protected from hot-remove operations. Hence, fadump service can fail to re-register after the hot-remove operation, if hot-removed memory belongs to fadump reserved region. To avoid this make sure that memory from fadump reserved area is not hot-removable if fadump is registered. However, if user still wants to remove that memory, he can do so by manually stopping fadump service before hot-remove operation. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/include/asm/fadump.h |2 +- arch/powerpc/kernel/fadump.c| 10 -- arch/powerpc/platforms/pseries/hotplug-memory.c |7 +-- 3 files changed, 14 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h index 44b6ebfe9be6..d16dc77107a8 100644 --- a/arch/powerpc/include/asm/fadump.h +++ b/arch/powerpc/include/asm/fadump.h @@ -207,7 +207,7 @@ struct fad_crash_memory_ranges { unsigned long long size; }; -extern int is_fadump_boot_memory_area(u64 addr, ulong size); +extern int is_fadump_memory_area(u64 addr, ulong size); extern int early_init_dt_scan_fw_dump(unsigned long node, const char *uname, int depth, void *data); extern int fadump_reserve_mem(void); diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 59aaf0357a52..2c3c7e655eec 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -162,13 +162,19 @@ int __init early_init_dt_scan_fw_dump(unsigned long node, /* * If fadump is registered, check if the memory provided - * falls within boot memory area. + * falls within boot memory area and reserved memory area. */ -int is_fadump_boot_memory_area(u64 addr, ulong size) +int is_fadump_memory_area(u64 addr, ulong size) { + u64 d_start = fw_dump.reserve_dump_area_start; + u64 d_end = d_start + fw_dump.reserve_dump_area_size; + if (!fw_dump.dump_registered) return 0; + if (((addr + size) > d_start) && (addr <= d_end)) + return 1; + return (addr + size) > RMA_START && addr <= fw_dump.boot_memory_size; } diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index c1578f54c626..e4c658cda3a7 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -389,8 +389,11 @@ static bool lmb_is_removable(struct drmem_lmb *lmb) phys_addr = lmb->base_addr; #ifdef CONFIG_FA_DUMP - /* Don't hot-remove memory that falls in fadump boot memory area */ - if (is_fadump_boot_memory_area(phys_addr, block_sz)) + /* +* Don't hot-remove memory that falls in fadump boot memory area +* and memory that is reserved for capturing old kernel memory. +*/ + if (is_fadump_memory_area(phys_addr, block_sz)) return false; #endif
[PATCH v3 7/7] powerpc/fadump: un-register fadump on kexec path.
From: Mahesh Salgaonkar otherwise the fadump registration in new kexec-ed kernel complains that fadump is already registered. This makes new kernel to continue using fadump registered by previous kernel which may lead to invalid vmcore generation. Hence this patch fixes this issue by un-registering fadump in fadump_cleanup() which is called during kexec path so that new kernel can register fadump with new valid values. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/kernel/fadump.c |3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 2c3c7e655eec..cb496c8f40f3 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -1345,6 +1345,9 @@ void fadump_cleanup(void) init_fadump_mem_struct(&fdm, get_fadump_metadata_base(fdm_active)); fadump_invalidate_dump(&fdm); + } else if (fw_dump.dump_registered) { + /* Un-register Firmware-assisted dump if it was registered. */ + fadump_unregister_dump(&fdm); } }
Re: [PATCH] Extract initrd free logic from arch-specific code.
* Shea Levy wrote: > > Please also put it into Documentation/features/. > > I switched this patch series (the latest revision v6 was just posted) to > using weak symbols instead of Kconfig. Does it still warrant documentation? Probably not. Thanks, Ingo