[PATCH v2] lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWN
From: Nicholas Piggin CPU unplug first calls __cpu_disable(), and that's where powerpc calls cleanup_cpu_mmu_context(), which clears this CPU from mm_cpumask() of all mms in the system. However this CPU may still be using a lazy tlb mm, and its mm_cpumask bit will be cleared from it. The CPU does not switch away from the lazy tlb mm until arch_cpu_idle_dead() calls idle_task_exit(). If that user mm exits in this window, it will not be subject to the lazy tlb mm shootdown and may be freed while in use as a lazy mm by the CPU that is being unplugged. cleanup_cpu_mmu_context() could be moved later, but it looks better to move the lazy tlb mm switching earlier. The problem with doing the lazy mm switching in idle_task_exit() is explained in commit bf2c59fce4074 ("sched/core: Fix illegal RCU from offline CPUs"), which added a wart to switch away from the mm but leave it set in active_mm to be cleaned up later. So instead, switch away from the lazy tlb mm at sched_cpu_wait_empty(), which is the last hotplug state before teardown (CPUHP_AP_SCHED_WAIT_EMPTY). This CPU will never switch to a user thread from this point, so it has no chance to pick up a new lazy tlb mm. This removes the lazy tlb mm handling wart in CPU unplug. With this, idle_task_exit() is not needed anymore and can be cleaned up. This leaves the prototype alone, to be cleaned after this change. herton: took the suggestions from https://lore.kernel.org/all/87jzvyprsw.ffs@tglx/ and made adjustments on the initial patch proposed by Nicholas. Link: https://lkml.kernel.org/r/20230524060455.147699-1-npig...@gmail.com Link: https://lore.kernel.org/all/20230525205253.e2faec43...@smtp.kernel.org/ Fixes: 2655421ae69fa ("lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme") Signed-off-by: Nicholas Piggin Cc: Linus Torvalds Cc: Peter Zijlstra Suggested-by: Thomas Gleixner Signed-off-by: Herton R. Krzesinski --- include/linux/sched/hotplug.h | 4 kernel/cpu.c | 11 ++- kernel/sched/core.c | 22 +++--- 3 files changed, 21 insertions(+), 16 deletions(-) Herton: I contacted Nicholas by email, he was ok with me going ahead and posting this, I saw the original patch was stalled/didn't went forward. Thus I'm posting this but keeping his From/authorship, since he is original author of the patch, so we can have this moving forward. I have a report and also reproduced the warning similar to the one reported at https://github.com/linuxppc/issues/issues/469 - which can be triggered doing cpu offline/online loop with CONFIG_DEBUG_VM enabled. This patch fixes the problem. I updated the changelog/patch based on the suggestions given and to the best of my knowledge/investigation on this issue, thorough review is appreciated. If this is ok then I can submit a followup for this to cleanup idle_task_exit(). v2: fix warning reported by kernel test robot https://lore.kernel.org/oe-kbuild-all/20241100.0u2cxcam-...@intel.com/ - sched_force_init_mm is only used under CONFIG_HOTPLUG_CPU at sched_cpu_wait_empty, so we don't need to define it for !CONFIG_HOTPLUG_CPU diff --git a/include/linux/sched/hotplug.h b/include/linux/sched/hotplug.h index 412cdaba33eb..17e04859b9a4 100644 --- a/include/linux/sched/hotplug.h +++ b/include/linux/sched/hotplug.h @@ -18,10 +18,6 @@ extern int sched_cpu_dying(unsigned int cpu); # define sched_cpu_dying NULL #endif -#ifdef CONFIG_HOTPLUG_CPU -extern void idle_task_exit(void); -#else static inline void idle_task_exit(void) {} -#endif #endif /* _LINUX_SCHED_HOTPLUG_H */ diff --git a/kernel/cpu.c b/kernel/cpu.c index d293d52a3e00..fb4f46885cb2 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -904,13 +904,14 @@ static int finish_cpu(unsigned int cpu) struct task_struct *idle = idle_thread_get(cpu); struct mm_struct *mm = idle->active_mm; - /* -* idle_task_exit() will have switched to &init_mm, now -* clean up any remaining active_mm state. + /* +* sched_force_init_mm() ensured the use of &init_mm, +* drop that refcount now that the CPU has stopped. */ - if (mm != &init_mm) - idle->active_mm = &init_mm; + WARN_ON(mm != &init_mm); + idle->active_mm = NULL; mmdrop_lazy_tlb(mm); + return 0; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dbfb5717d6af..7d8f47a8f000 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7826,19 +7826,26 @@ void sched_setnuma(struct task_struct *p, int nid) #ifdef CONFIG_HOTPLUG_CPU /* - * Ensure that the idle task is using init_mm right before its CPU goes - * offline. + * Invoked on the outgoing CPU in context of the CPU hotplug thread + * after ensuring that there are no user space tasks left on the CPU. + * + * If there is a lazy mm in use on the hotplug thread, drop it and + * sw
[PATCH] lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWN
From: Nicholas Piggin CPU unplug first calls __cpu_disable(), and that's where powerpc calls cleanup_cpu_mmu_context(), which clears this CPU from mm_cpumask() of all mms in the system. However this CPU may still be using a lazy tlb mm, and its mm_cpumask bit will be cleared from it. The CPU does not switch away from the lazy tlb mm until arch_cpu_idle_dead() calls idle_task_exit(). If that user mm exits in this window, it will not be subject to the lazy tlb mm shootdown and may be freed while in use as a lazy mm by the CPU that is being unplugged. cleanup_cpu_mmu_context() could be moved later, but it looks better to move the lazy tlb mm switching earlier. The problem with doing the lazy mm switching in idle_task_exit() is explained in commit bf2c59fce4074 ("sched/core: Fix illegal RCU from offline CPUs"), which added a wart to switch away from the mm but leave it set in active_mm to be cleaned up later. So instead, switch away from the lazy tlb mm at sched_cpu_wait_empty(), which is the last hotplug state before teardown (CPUHP_AP_SCHED_WAIT_EMPTY). This CPU will never switch to a user thread from this point, so it has no chance to pick up a new lazy tlb mm. This removes the lazy tlb mm handling wart in CPU unplug. With this, idle_task_exit() is not needed anymore and can be cleaned up. This leaves the prototype alone, to be cleaned after this change. herton: took the suggestions from https://lore.kernel.org/all/87jzvyprsw.ffs@tglx/ and made adjustments on the initial patch proposed by Nicholas. Link: https://lkml.kernel.org/r/20230524060455.147699-1-npig...@gmail.com Link: https://lore.kernel.org/all/20230525205253.e2faec43...@smtp.kernel.org/ Fixes: 2655421ae69fa ("lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme") Signed-off-by: Nicholas Piggin Cc: Linus Torvalds Cc: Peter Zijlstra Suggested-by: Thomas Gleixner Signed-off-by: Herton R. Krzesinski --- include/linux/sched/hotplug.h | 4 kernel/cpu.c | 11 ++- kernel/sched/core.c | 26 +++--- 3 files changed, 25 insertions(+), 16 deletions(-) Herton: I contacted Nicholas by email, he was ok with me going ahead and posting this, I saw the original patch was stalled/didn't went forward. Thus I'm posting this but keeping his From/authorship, since he is original author of the patch, so we can have this moving forward. I have a report and also reproduced the warning similar to the one reported at https://github.com/linuxppc/issues/issues/469 - which can be triggered doing cpu offline/online loop with CONFIG_DEBUG_VM enabled. This patch fixes the problem. I updated the changelog/patch based on the suggestions given and to the best of my knowledge/investigation on this issue, thorough review is appreciated. If this is ok then I can submit a followup for this to cleanup idle_task_exit(). diff --git a/include/linux/sched/hotplug.h b/include/linux/sched/hotplug.h index 412cdaba33eb..17e04859b9a4 100644 --- a/include/linux/sched/hotplug.h +++ b/include/linux/sched/hotplug.h @@ -18,10 +18,6 @@ extern int sched_cpu_dying(unsigned int cpu); # define sched_cpu_dying NULL #endif -#ifdef CONFIG_HOTPLUG_CPU -extern void idle_task_exit(void); -#else static inline void idle_task_exit(void) {} -#endif #endif /* _LINUX_SCHED_HOTPLUG_H */ diff --git a/kernel/cpu.c b/kernel/cpu.c index d293d52a3e00..fb4f46885cb2 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -904,13 +904,14 @@ static int finish_cpu(unsigned int cpu) struct task_struct *idle = idle_thread_get(cpu); struct mm_struct *mm = idle->active_mm; - /* -* idle_task_exit() will have switched to &init_mm, now -* clean up any remaining active_mm state. + /* +* sched_force_init_mm() ensured the use of &init_mm, +* drop that refcount now that the CPU has stopped. */ - if (mm != &init_mm) - idle->active_mm = &init_mm; + WARN_ON(mm != &init_mm); + idle->active_mm = NULL; mmdrop_lazy_tlb(mm); + return 0; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dbfb5717d6af..8a51acd6d650 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7826,19 +7826,26 @@ void sched_setnuma(struct task_struct *p, int nid) #ifdef CONFIG_HOTPLUG_CPU /* - * Ensure that the idle task is using init_mm right before its CPU goes - * offline. + * Invoked on the outgoing CPU in context of the CPU hotplug thread + * after ensuring that there are no user space tasks left on the CPU. + * + * If there is a lazy mm in use on the hotplug thread, drop it and + * switch to init_mm. + * + * The reference count on init_mm is dropped in finish_cpu(). */ -void idle_task_exit(void) +static void sched_force_init_mm(void) { struct mm_struct *mm = current->active_mm; - BUG_ON(cpu_online(smp_processor_id())); -