Re: WARN: Interrupts were enabled early [bisected cfeb6ae8bcb9 ("maple_tree: disable mas_wr_append() when other readers are possible")]
* Christophe Leroy [230912 06:56]: > Hi, > > With pmac32_defconfig on QEMU I get the following WARN with 6.6-rc1 > > Bisected to cfeb6ae8bcb9 ("maple_tree: disable mas_wr_append() when > other readers are possible") > > I have absolutely no idea what it can be, do you ? Please see the discussion on the mailing list [1]. We are looking at it there and would appreciate anything you can add. [1]. https://lore.kernel.org/linux-mm/3f86d58e-7f36-c6b4-c43a-2a7bcffd...@linux-m68k.org/ Thanks, Liam > > [ cut here ] > Interrupts were enabled early > WARNING: CPU: 0 PID: 0 at init/main.c:992 start_kernel+0x4d8/0x5c0 > Modules linked in: > CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc1 #480 > Hardware name: PowerMac3,1 7400 0xc0209 PowerMac > NIP: c0a6052c LR: c0a6052c CTR: > REGS: c0c4dee0 TRAP: 0700 Not tainted (6.6.0-rc1) > MSR: 00029032 CR: 24000282 XER: 2000 > > GPR00: c0a6052c c0c4dfa0 c0b92580 001d c0b9d128 0001 c0b9d148 > 3dff > GPR08: c0ba80f0 3e00 44000282 > 0199abf8 > GPR16: 0199b0a0 7fde7fa4 7fc5ac0c 00bb 4100 01c690c8 c0b92014 > c09b4bcc > GPR24: c0c55220 c0ac efff9109 efff9100 000a c0c6d000 > c0b920a0 > NIP [c0a6052c] start_kernel+0x4d8/0x5c0 > LR [c0a6052c] start_kernel+0x4d8/0x5c0 > Call Trace: > [c0c4dfa0] [c0a6052c] start_kernel+0x4d8/0x5c0 (unreliable) > [c0c4dff0] [3540] 0x3540 > Code: 480037b1 48023c05 4bab88cd 90620260 480139e9 4b657ccd 7d2000a6 > 71298000 41a20014 3c60c09a 3863b788 4b5e9ccd <0fe0> 3920 > 99380008 7d2000a6 > ---[ end trace ]--- > > Christophe
Re: [PATCH v2 1/2] maple_tree: Disable mas_wr_append() when other readers are possible
* Andreas Schwab [230912 14:15]: > Any news? This is still broken. I have a proposed fix. I seem to have caused a pre-existing problem to show up. Please see if the attached works for you, and I'll send it to a lot of people. Thanks, Liam >From 9ef8f834bb0342dc26464b9dd0165929d3e6a7e5 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Tue, 12 Sep 2023 13:45:29 -0400 Subject: [PATCH] init/main: Clear boot task idle flag Initial booting was setting the task flag to idle (PF_IDLE) by the call path sched_init() -> init_idle(). Having the task idle and calling call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be enabled. Subsequent calls to any cond_resched() will enable IRQs, potentially earlier than the enabling of IRQs. This causes a warning later in start_kernel() as interrupts are enabled before the are fully set up. Fix this issue by clearing the PF_IDLE flag on return from sched_init() and restore the flag in rest_init(). Signed-off-by: Liam R. Howlett --- init/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/init/main.c b/init/main.c index ad920fac325c..46b35be8f00a 100644 --- a/init/main.c +++ b/init/main.c @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void) */ rcu_read_lock(); tsk = find_task_by_pid_ns(pid, &init_pid_ns); - tsk->flags |= PF_NO_SETAFFINITY; + tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE; set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); rcu_read_unlock(); @@ -938,6 +938,7 @@ void start_kernel(void) * time - but meanwhile we still have a functioning scheduler. */ sched_init(); + current->flags &= ~PF_IDLE; if (WARN(!irqs_disabled(), "Interrupts were enabled *very* early, fixing it\n")) -- 2.39.2
Re: [Bisected] PowerMac G4 getting "BUG: Unable to handle kernel data access on write at 0x00001ff0" at boot with CONFIG_VMAP_STACK=y on kernels 6.5.x (regression over 6.4.x)
* Erhard Furtner [230925 19:02]: > Greetings! > > Had a chat on #gentoo-powerpc with another user whose G4 Mini fails booting > kernel 6.5.0 when CONFIG_VMAP_STACK=y is enabled. I was able to replicate the > issue on my PowerMac G4. Also I was able to bisect the issue. > > Kernels 6.4.x boot ok with CONFIG_VMAP_STACK=y but on 6.5.5 I get: > > [...] > Kernel attempted to write user page (1ff0) - exploit attempt? (uid: 0) > BUG: Unable to handle kernel data access on write at 0x1ff0 > Faulting instruction address: 0xc0008750 > Oops: Kernel access of bad area, sig: 11 [#1] > BE PAGE_SIZE=4K MMU=Hash PowerMac > Modules linked in: > CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.5-PMacG4 #5 > Hardware name: PowerMac3,6 7455 0x80010303 PowerMac > NIP: c0008750 LR: c0041848 CTR: c0070988 > REGS: c0d3dcd0 TRAP: 0300 Not tainted (6.5.5-PMacG4) > MSR: 1032 CR: 22d3ddc0 XER: 2000 > DAR: 1ff0 DSISR: 4200 > GPR00: c0041848 c0d3dd90 c0d06360 c0d3ddd0 c0d06360 c0d3dea8 c0d3adc0 > GPR08: c0d4 c0d3ddc0 0004 > GPR16: 0002 0002 00402dc2 00402dc2 2000 f1004000 > GPR24: c0d45220 c0d06644 c0843c34 0002 c0d06360 c0d0ce00 c0d06360 > NIP [c0008750] do_softirq_own_stack+0x18/0x3c > LR [c0041848] irq_exit+0x98/0xc4 > Call Trace: > [c0d3dd90] [c0d69564] 0xc0d69564 (unreliable) > [c0d3ddb0] [c0041848] irq_exit+0x98/0xc4 > [c0d3ddc0] [c0004a98] Decrementer_virt+0x108/0x10c > --- interrupt: 900 at __schedule+0x43c/0x4e0 > NIP: c0843940 LR: c084398c CTR: c0070988 > REGS: c0d3ddd0 TRAP: 0900 Not tainted (6.5.5-PMacG4) > MSR: 9032 CR: 22024484 XER: > > GPR00: c0843574 c0d3de90 c0d06360 c0d06360 c0d06360 c0d3dea8 0001 > GPR08: 9032 c099ce2c 0007ffbf 22024484 0004 > GPR16: 0002 0002 00402dc2 00402dc2 2000 f1004000 > GPR24: c0d45220 c0d06644 c0843c34 0002 c0d06360 c0d0ce00 c0d06360 c0d063ac > NIP [c0843940] __schedule+0x43c/0x4e0 > LR [c084390c] __schedule+0x408/0x4e0 > --- interrupt: 900 > [c0d3de90] [c0843574] __schedule+0x70/0x4e0 (unreliable) > [c0d3ded0] [c0843c34] __cond_resched+0x34/0x54 > [c0d3dee0] [c0141068] __vmalloc_node_range+0x27c/0x64c > [c0d3de60] [c0141794] __vmalloc_node+0x44/0x54 > [c8d3df80] [c0c06510] init_IRQ+0x34/0xd4 > [c8d3dfa0] [c0c03440] start_kernel+0x424/0x558 > [c8d3dff0] [3540] 0x3540 > Code: 39490999 7d4901a4 39290aaa 7d2a01a4 4c00012c 4b20 9421ffe0 > 7c08002a6 3d20c0d4 93e1001c 90010024 83e95278 <943f1ff0> 7fe1fb78 48840c6d > 8021 > ---[ end trace ]--- > > Kernel panic - not syncing: Attempted to kill the idle task! > Rebooting in 48 seconds.. This looks very close to the crash a few weeks ago which bisected to the same commit [1]. Can you try applying this fix [2] which is on its way upstream? [1] https://lore.kernel.org/linux-mm/3f86d58e-7f36-c6b4-c43a-2a7bcffd...@linux-m68k.org/ [2] https://lore.kernel.org/lkml/2023091517.2835306-1-liam.howl...@oracle.com/ > > > The bisect revealed this commit: > # git bisect good > cfeb6ae8bcb96ccf674724f223661bbcef7b0d0b is the first bad commit > commit cfeb6ae8bcb96ccf674724f223661bbcef7b0d0b > Author: Liam R. Howlett > Date: Fri Aug 18 20:43:55 2023 -0400 > > maple_tree: disable mas_wr_append() when other readers are possible > > The current implementation of append may cause duplicate data and/or > incorrect ranges to be returned to a reader during an update. Although > this has not been reported or seen, disable the append write operation > while the tree is in rcu mode out of an abundance of caution. > > During the analysis of the mas_next_slot() the following was > artificially created by separating the writer and reader code: > > Writer: reader: > mas_wr_append > set end pivot > updates end metata > Detects write to last slot > last slot write is to start of slot > store current contents in slot > overwrite old end pivot > mas_next_slot(): > read end metadata > read old end pivot > return with incorrect > range > store new value > > Alternatively: > > Writer: reader: > mas_wr_append > set end pivot > updates end metata > Detects write to last slot > last lost write
Re: [Bisected] PowerMac G4 getting "BUG: Unable to handle kernel data access on write at 0x00001ff0" at boot with CONFIG_VMAP_STACK=y on kernels 6.5.x (regression over 6.4.x)
* Bagas Sanjaya [230925 20:03]: > On Tue, Sep 26, 2023 at 01:01:59AM +0200, Erhard Furtner wrote: > > Greetings! > > > > Had a chat on #gentoo-powerpc with another user whose G4 Mini fails booting > > kernel 6.5.0 when CONFIG_VMAP_STACK=y is enabled. I was able to replicate > > the issue on my PowerMac G4. Also I was able to bisect the issue. > > > > Kernels 6.4.x boot ok with CONFIG_VMAP_STACK=y but on 6.5.5 I get: > > > > [...] > > Kernel attempted to write user page (1ff0) - exploit attempt? (uid: 0) > > BUG: Unable to handle kernel data access on write at 0x1ff0 > > Faulting instruction address: 0xc0008750 > > Oops: Kernel access of bad area, sig: 11 [#1] > > BE PAGE_SIZE=4K MMU=Hash PowerMac > > Modules linked in: > > CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.5-PMacG4 #5 > > Hardware name: PowerMac3,6 7455 0x80010303 PowerMac > > NIP: c0008750 LR: c0041848 CTR: c0070988 > > REGS: c0d3dcd0 TRAP: 0300 Not tainted (6.5.5-PMacG4) > > MSR: 1032 CR: 22d3ddc0 XER: 2000 > > DAR: 1ff0 DSISR: 4200 > > GPR00: c0041848 c0d3dd90 c0d06360 c0d3ddd0 c0d06360 c0d3dea8 c0d3adc0 > > > > GPR08: c0d4 c0d3ddc0 > > 0004 > > GPR16: 0002 0002 00402dc2 00402dc2 2000 f1004000 > > > > GPR24: c0d45220 c0d06644 c0843c34 0002 c0d06360 c0d0ce00 c0d06360 > > > > NIP [c0008750] do_softirq_own_stack+0x18/0x3c > > LR [c0041848] irq_exit+0x98/0xc4 > > Call Trace: > > [c0d3dd90] [c0d69564] 0xc0d69564 (unreliable) > > [c0d3ddb0] [c0041848] irq_exit+0x98/0xc4 > > [c0d3ddc0] [c0004a98] Decrementer_virt+0x108/0x10c > > --- interrupt: 900 at __schedule+0x43c/0x4e0 > > NIP: c0843940 LR: c084398c CTR: c0070988 > > REGS: c0d3ddd0 TRAP: 0900 Not tainted (6.5.5-PMacG4) > > MSR: 9032 CR: 22024484 XER: > > > > GPR00: c0843574 c0d3de90 c0d06360 c0d06360 c0d06360 c0d3dea8 0001 > > > > GPR08: 9032 c099ce2c 0007ffbf 22024484 > > 0004 > > GPR16: 0002 0002 00402dc2 00402dc2 2000 f1004000 > > > > GPR24: c0d45220 c0d06644 c0843c34 0002 c0d06360 c0d0ce00 c0d06360 > > c0d063ac > > NIP [c0843940] __schedule+0x43c/0x4e0 > > LR [c084390c] __schedule+0x408/0x4e0 > > --- interrupt: 900 > > [c0d3de90] [c0843574] __schedule+0x70/0x4e0 (unreliable) > > [c0d3ded0] [c0843c34] __cond_resched+0x34/0x54 > > [c0d3dee0] [c0141068] __vmalloc_node_range+0x27c/0x64c > > [c0d3de60] [c0141794] __vmalloc_node+0x44/0x54 > > [c8d3df80] [c0c06510] init_IRQ+0x34/0xd4 > > [c8d3dfa0] [c0c03440] start_kernel+0x424/0x558 > > [c8d3dff0] [3540] 0x3540 > > Code: 39490999 7d4901a4 39290aaa 7d2a01a4 4c00012c 4b20 9421ffe0 > > 7c08002a6 3d20c0d4 93e1001c 90010024 83e95278 <943f1ff0> 7fe1fb78 48840c6d > > 8021 > > ---[ end trace ]--- > > > > Kernel panic - not syncing: Attempted to kill the idle task! > > Rebooting in 48 seconds.. > > > > > > The bisect revealed this commit: > > # git bisect good > > cfeb6ae8bcb96ccf674724f223661bbcef7b0d0b is the first bad commit > > commit cfeb6ae8bcb96ccf674724f223661bbcef7b0d0b > > Author: Liam R. Howlett > > Date: Fri Aug 18 20:43:55 2023 -0400 > > > > maple_tree: disable mas_wr_append() when other readers are possible > > > > The current implementation of append may cause duplicate data and/or > > incorrect ranges to be returned to a reader during an update. Although > > this has not been reported or seen, disable the append write operation > > while the tree is in rcu mode out of an abundance of caution. > > > > During the analysis of the mas_next_slot() the following was > > artificially created by separating the writer and reader code: > > > > Writer: reader: > > mas_wr_append > > set end pivot > > updates end metata > > Detects write to last slot > > last slot write is to start of slot > > store current contents in slot > > overwrite old end pivot > > mas_next_slot(): > > read end metadata > > read old end pivot > > return with incorrect > > range > &g
Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
* Matthew Wilcox [230120 11:50]: > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote: > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan > > wrote: > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko wrote: > > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko wrote: > > > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > > call_rcu() can take a long time when callback offloading is > > > > > > > enabled. > > > > > > > Its use in the vm_area_free can cause regressions in the exit > > > > > > > path when > > > > > > > multiple VMAs are being freed. To minimize that impact, place > > > > > > > VMAs into > > > > > > > a list and free them in groups using one call_rcu() call per > > > > > > > group. > > > > > > > > > > > > After some more clarification I can understand how call_rcu might > > > > > > not be > > > > > > super happy about thousands of callbacks to be invoked and I do > > > > > > agree > > > > > > that this is not really optimal. > > > > > > > > > > > > On the other hand I do not like this solution much either. > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that > > > > > > much with processes with a huge number of vmas either. It would > > > > > > still be > > > > > > in housands of callbacks to be scheduled without a good reason. > > > > > > > > > > > > Instead, are there any other cases than remove_vma that need this > > > > > > batching? We could easily just link all the vmas into linked list > > > > > > and > > > > > > use a single call_rcu instead, no? This would both simplify the > > > > > > implementation, remove the scaling issue as well and we do not have > > > > > > to > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + > > > > > > 1. > > > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs > > > > > on the list without hooking up a shrinker (additional complexity) does > > > > > not sound too appealing either. > > > > > > > > I suspect you have missed my idea. I do not really want to keep the list > > > > around or any shrinker. It is dead simple. Collect all vmas in > > > > remove_vma and then call_rcu the whole list at once after the whole list > > > > (be it from exit_mmap or remove_mt). See? > > > > > > Yes, I understood your idea but keeping dead objects until the process > > > exits even when the system is low on memory (no shrinkers attached) > > > seems too wasteful. If we do this I would advocate for attaching a > > > shrinker. > > > > Maybe even simpler, since we are hit with this VMA freeing flood > > during exit_mmap (when all VMAs are destroyed), we pass a hint to > > vm_area_free to batch the destruction and all other cases call > > call_rcu()? I don't think there will be other cases of VMA destruction > > floods. > > ... or have two different call_rcu functions; one for munmap() and > one for exit. It'd be nice to use kmem_cache_free_bulk(). Do we even need a call_rcu on exit? At the point of freeing the VMAs we have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock. Once we have obtained the write lock again, I think it's safe to say we can just go ahead and free the VMAs directly.
Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
* Suren Baghdasaryan [230120 12:50]: > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox wrote: > > > > On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote: > > > On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett > > > wrote: > > > > > > > > * Matthew Wilcox [230120 11:50]: > > > > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote: > > > > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan > > > > > > wrote: > > > > > > > > > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko > > > > > > > wrote: > > > > > > > > > > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote: > > > > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > > > > > > call_rcu() can take a long time when callback offloading > > > > > > > > > > > is enabled. > > > > > > > > > > > Its use in the vm_area_free can cause regressions in the > > > > > > > > > > > exit path when > > > > > > > > > > > multiple VMAs are being freed. To minimize that impact, > > > > > > > > > > > place VMAs into > > > > > > > > > > > a list and free them in groups using one call_rcu() call > > > > > > > > > > > per group. > > > > > > > > > > > > > > > > > > > > After some more clarification I can understand how call_rcu > > > > > > > > > > might not be > > > > > > > > > > super happy about thousands of callbacks to be invoked and > > > > > > > > > > I do agree > > > > > > > > > > that this is not really optimal. > > > > > > > > > > > > > > > > > > > > On the other hand I do not like this solution much either. > > > > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help > > > > > > > > > > all that > > > > > > > > > > much with processes with a huge number of vmas either. It > > > > > > > > > > would still be > > > > > > > > > > in housands of callbacks to be scheduled without a good > > > > > > > > > > reason. > > > > > > > > > > > > > > > > > > > > Instead, are there any other cases than remove_vma that > > > > > > > > > > need this > > > > > > > > > > batching? We could easily just link all the vmas into > > > > > > > > > > linked list and > > > > > > > > > > use a single call_rcu instead, no? This would both simplify > > > > > > > > > > the > > > > > > > > > > implementation, remove the scaling issue as well and we do > > > > > > > > > > not have to > > > > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or > > > > > > > > > > epsilon + 1. > > > > > > > > > > > > > > > > > > Yes, I agree the solution is not stellar. I wanted something > > > > > > > > > simple > > > > > > > > > but this is probably too simple. OTOH keeping all dead > > > > > > > > > vm_area_structs > > > > > > > > > on the list without hooking up a shrinker (additional > > > > > > > > > complexity) does > > > > > > > > > not sound too appealing either. > > > > > > > > > > > > > > > > I suspect you have missed my idea. I do not really want to keep > > > > > > > > the list > > > > > > > > around or any shrinker. It is dead simple. Collect all vmas in > > > > > > > > remove_vma and then call_rcu the whole list at once after the > > > > > > > > whole list > > > > &
Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
* Michal Hocko [230123 15:00]: > On Mon 23-01-23 19:30:43, Matthew Wilcox wrote: > > On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote: > > > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote: > > > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote: > > > [...] > > > > > Yes, batching the vmas into a list and draining it in remove_mt() and > > > > > exit_mmap() as you suggested makes sense to me and is quite simple. > > > > > Let's do that if nobody has objections. > > > > > > > > I object. We *know* nobody has a reference to any of the VMAs because > > > > you have to have a refcount on the mm before you can get a reference > > > > to a VMA. If Michal is saying that somebody could do: > > > > > > > > mmget(mm); > > > > vma = find_vma(mm); > > > > lock_vma(vma); > > > > mmput(mm); > > > > vma->a = b; > > > > unlock_vma(mm, vma); > > > > > > > > then that's something we'd catch in review -- you obviously can't use > > > > the mm after you've dropped your reference to it. > > > > > > I am not claiming this is possible now. I do not think we want to have > > > something like that in the future either but that is really hard to > > > envision. I am claiming that it is subtle and potentially error prone to > > > have two different ways of mass vma freeing wrt. locking. Also, don't we > > > have a very similar situation during last munmaps? > > > > We shouldn't have two ways of mass VMA freeing. Nobody's suggesting that. > > There are two cases; there's munmap(), which typically frees a single > > VMA (yes, theoretically, you can free hundreds of VMAs with a single > > call which spans multiple VMAs, but in practice that doesn't happen), > > and there's exit_mmap() which happens on exec() and exit(). > > This requires special casing remove_vma for those two different paths > (exit_mmap and remove_mt). If you ask me that sounds like a suboptimal > code to even not handle potential large munmap which might very well be > a rare thing as you say. But haven't we learned that sooner or later we > will find out there is somebody that cares afterall? Anyway, this is not > something I care about all that much. It is just weird to special case > exit_mmap, if you ask me. exit_mmap() is already a special case for locking (and statistics). This exists today to optimize the special exit scenario. I don't think it's a question of sub-optimal code but what we can get away without doing in the case of the process exit.
Re: [PATCH v4 4/7] mm: replace vma->vm_flags direct modifications with modifier calls
* Suren Baghdasaryan [230126 14:38]: > Replace direct modifications to vma->vm_flags with calls to modifier > functions to be able to track flag changes and to keep vma locking > correctness. > That was long, but it all looks good! Reviewed-by: Liam R. Howlett > Signed-off-by: Suren Baghdasaryan > Acked-by: Michal Hocko > Acked-by: Mel Gorman > Acked-by: Mike Rapoport (IBM) > Acked-by: Sebastian Reichel > --- > arch/arm/kernel/process.c | 2 +- > arch/ia64/mm/init.c| 8 > arch/loongarch/include/asm/tlb.h | 2 +- > arch/powerpc/kvm/book3s_xive_native.c | 2 +- > arch/powerpc/mm/book3s64/subpage_prot.c| 2 +- > arch/powerpc/platforms/book3s/vas-api.c| 2 +- > arch/powerpc/platforms/cell/spufs/file.c | 14 +++--- > arch/s390/mm/gmap.c| 3 +-- > arch/x86/entry/vsyscall/vsyscall_64.c | 2 +- > arch/x86/kernel/cpu/sgx/driver.c | 2 +- > arch/x86/kernel/cpu/sgx/virt.c | 2 +- > arch/x86/mm/pat/memtype.c | 6 +++--- > arch/x86/um/mem_32.c | 2 +- > drivers/acpi/pfr_telemetry.c | 2 +- > drivers/android/binder.c | 3 +-- > drivers/char/mspec.c | 2 +- > drivers/crypto/hisilicon/qm.c | 2 +- > drivers/dax/device.c | 2 +- > drivers/dma/idxd/cdev.c| 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c| 2 +- > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 ++-- > drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 4 ++-- > drivers/gpu/drm/amd/amdkfd/kfd_events.c| 4 ++-- > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 ++-- > drivers/gpu/drm/drm_gem.c | 2 +- > drivers/gpu/drm/drm_gem_dma_helper.c | 3 +-- > drivers/gpu/drm/drm_gem_shmem_helper.c | 2 +- > drivers/gpu/drm/drm_vm.c | 8 > drivers/gpu/drm/etnaviv/etnaviv_gem.c | 2 +- > drivers/gpu/drm/exynos/exynos_drm_gem.c| 4 ++-- > drivers/gpu/drm/gma500/framebuffer.c | 2 +- > drivers/gpu/drm/i810/i810_dma.c| 2 +- > drivers/gpu/drm/i915/gem/i915_gem_mman.c | 4 ++-- > drivers/gpu/drm/mediatek/mtk_drm_gem.c | 2 +- > drivers/gpu/drm/msm/msm_gem.c | 2 +- > drivers/gpu/drm/omapdrm/omap_gem.c | 3 +-- > drivers/gpu/drm/rockchip/rockchip_drm_gem.c| 3 +-- > drivers/gpu/drm/tegra/gem.c| 5 ++--- > drivers/gpu/drm/ttm/ttm_bo_vm.c| 3 +-- > drivers/gpu/drm/virtio/virtgpu_vram.c | 2 +- > drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c | 2 +- > drivers/gpu/drm/xen/xen_drm_front_gem.c| 3 +-- > drivers/hsi/clients/cmt_speech.c | 2 +- > drivers/hwtracing/intel_th/msu.c | 2 +- > drivers/hwtracing/stm/core.c | 2 +- > drivers/infiniband/hw/hfi1/file_ops.c | 4 ++-- > drivers/infiniband/hw/mlx5/main.c | 4 ++-- > drivers/infiniband/hw/qib/qib_file_ops.c | 13 ++--- > drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 2 +- > drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c| 2 +- > .../media/common/videobuf2/videobuf2-dma-contig.c | 2 +- > drivers/media/common/videobuf2/videobuf2-vmalloc.c | 2 +- > drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +- > drivers/media/v4l2-core/videobuf-dma-sg.c | 4 ++-- > drivers/media/v4l2-core/videobuf-vmalloc.c | 2 +- > drivers/misc/cxl/context.c | 2 +- > drivers/misc/habanalabs/common/memory.c| 2 +- > drivers/misc/habanalabs/gaudi/gaudi.c | 4 ++-- > drivers/misc/habanalabs/gaudi2/gaudi2.c| 8 > drivers/misc/habanalabs/goya/goya.c| 4 ++-- > drivers/misc/ocxl/context.c| 4 ++-- > drivers/misc/ocxl/sysfs.c | 2 +- > drivers/misc/open-dice.c | 4 ++-- > drivers/misc/sgi-gru/grufile.c | 4 ++-- > drivers/misc/uacce/uacce.c | 2 +- > drivers/sbus/char/oradax.c | 2 +- > drivers/scsi/cxlflash/ocxl_hw.c| 2 +- > drivers/scsi/sg.c | 2 +- > drivers/staging/media/atomi
Re: [PATCH v3 21/35] mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area
First, sorry I didn't see this before v3.. * Suren Baghdasaryan [230216 00:18]: > While unmapping VMAs, adjacent VMAs might be able to grow into the area > being unmapped. In such cases write-lock adjacent VMAs to prevent this > growth. > > Signed-off-by: Suren Baghdasaryan > --- > mm/mmap.c | 8 +--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 118b2246bba9..00f8c5798936 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2399,11 +2399,13 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct > vm_area_struct *vma, >* down_read(mmap_lock) and collide with the VMA we are about to unmap. >*/ > if (downgrade) { > - if (next && (next->vm_flags & VM_GROWSDOWN)) > + if (next && (next->vm_flags & VM_GROWSDOWN)) { > + vma_start_write(next); > downgrade = false; If the mmap write lock is insufficient to protect us from next/prev modifications then we need to move *most* of this block above the maple tree write operation, otherwise we have a race here. When I say most, I mean everything besides the call to mmap_write_downgrade() needs to be moved. If the mmap write lock is sufficient to protect us from next/prev modifications then we don't need to write lock the vmas themselves. I believe this is for expand_stack() protection, so I believe it's okay to not vma write lock these vmas.. I don't think there are other areas where we can modify the vmas without holding the mmap lock, but others on the CC list please chime in if I've forgotten something. So, if I am correct, then you shouldn't lock next/prev and allow the vma locking fault method on these vmas. This will work because lock_vma_under_rcu() uses mas_walk() on the faulting address. That is, your lock_vma_under_rcu() will fail to find anything that needs to be grown and go back to mmap lock protection. As it is written today, the vma locking fault handler will fail and we will wait for the mmap lock to be released even when the vma isn't going to expand. > - else if (prev && (prev->vm_flags & VM_GROWSUP)) > + } else if (prev && (prev->vm_flags & VM_GROWSUP)) { > + vma_start_write(prev); > downgrade = false; > - else > + } else > mmap_write_downgrade(mm); > } > > -- > 2.39.1
Re: [PATCH v3 21/35] mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area
* Suren Baghdasaryan [230216 14:36]: > On Thu, Feb 16, 2023 at 7:34 AM Liam R. Howlett > wrote: > > > > > > First, sorry I didn't see this before v3.. > > Feedback at any time is highly appreciated! > > > > > * Suren Baghdasaryan [230216 00:18]: > > > While unmapping VMAs, adjacent VMAs might be able to grow into the area > > > being unmapped. In such cases write-lock adjacent VMAs to prevent this > > > growth. > > > > > > Signed-off-by: Suren Baghdasaryan > > > --- > > > mm/mmap.c | 8 +--- > > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > index 118b2246bba9..00f8c5798936 100644 > > > --- a/mm/mmap.c > > > +++ b/mm/mmap.c > > > @@ -2399,11 +2399,13 @@ do_vmi_align_munmap(struct vma_iterator *vmi, > > > struct vm_area_struct *vma, > > >* down_read(mmap_lock) and collide with the VMA we are about to > > > unmap. > > >*/ > > > if (downgrade) { > > > - if (next && (next->vm_flags & VM_GROWSDOWN)) > > > + if (next && (next->vm_flags & VM_GROWSDOWN)) { > > > + vma_start_write(next); > > > downgrade = false; > > > > If the mmap write lock is insufficient to protect us from next/prev > > modifications then we need to move *most* of this block above the maple > > tree write operation, otherwise we have a race here. When I say most, I > > mean everything besides the call to mmap_write_downgrade() needs to be > > moved. > > Which prior maple tree write operation are you referring to? I see > __split_vma() and munmap_sidetree() which both already lock the VMAs > they operate on, so page faults can't happen in those VMAs. The write that removes the VMAs from the maple tree a few lines above.. /* Point of no return */ If the mmap lock is not sufficient, then we need to move the vma_start_write() of prev/next to above the call to vma_iter_clear_gfp() in do_vmi_align_munmap(). But I still think it IS enough. > > > > > If the mmap write lock is sufficient to protect us from next/prev > > modifications then we don't need to write lock the vmas themselves. > > mmap write lock is not sufficient because with per-VMA locks we do not > take mmap lock at all. Understood, but it also does not expand VMAs. > > > > > I believe this is for expand_stack() protection, so I believe it's okay > > to not vma write lock these vmas.. I don't think there are other areas > > where we can modify the vmas without holding the mmap lock, but others > > on the CC list please chime in if I've forgotten something. > > > > So, if I am correct, then you shouldn't lock next/prev and allow the > > vma locking fault method on these vmas. This will work because > > lock_vma_under_rcu() uses mas_walk() on the faulting address. That is, > > your lock_vma_under_rcu() will fail to find anything that needs to be > > grown and go back to mmap lock protection. As it is written today, the > > vma locking fault handler will fail and we will wait for the mmap lock > > to be released even when the vma isn't going to expand. > > So, let's consider a case when the next VMA is not being removed (so > it was neither removed nor locked by munmap_sidetree()) and it is > found by lock_vma_under_rcu() in the page fault handling path. By this point next VMA is either NULL or outside the munmap area, so what you said here is always true. >Page > fault handler can now expand it and push into the area we are > unmapping in unmap_region(). That is the race I'm trying to prevent > here by locking the next/prev VMAs which can be expanded before > unmap_region() unmaps them. Am I missing something? Yes, I think the part you are missing (or I am missing..) is that expand_stack() will never be called without the mmap lock. We don't use the vma locking to expand the stack. ...
Re: [PATCH v3 23/35] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration
* Suren Baghdasaryan [230216 00:18]: > Page fault handlers might need to fire MMU notifications while a new > notifier is being registered. Modify mm_take_all_locks to write-lock all > VMAs and prevent this race with page fault handlers that would hold VMA > locks. VMAs are locked before i_mmap_rwsem and anon_vma to keep the same > locking order as in page fault handlers. > > Signed-off-by: Suren Baghdasaryan > --- > mm/mmap.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 00f8c5798936..801608726be8 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3501,6 +3501,7 @@ static void vm_lock_mapping(struct mm_struct *mm, > struct address_space *mapping) > * of mm/rmap.c: > * - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for > * hugetlb mapping); > + * - all vmas marked locked > * - all i_mmap_rwsem locks; > * - all anon_vma->rwseml > * > @@ -3523,6 +3524,13 @@ int mm_take_all_locks(struct mm_struct *mm) > > mutex_lock(&mm_all_locks_mutex); > > + mas_for_each(&mas, vma, ULONG_MAX) { > + if (signal_pending(current)) > + goto out_unlock; > + vma_start_write(vma); > + } > + > + mas_set(&mas, 0); > mas_for_each(&mas, vma, ULONG_MAX) { > if (signal_pending(current)) > goto out_unlock; Do we need a vma_end_write_all(mm) in the out_unlock unrolling? Also, does this need to honour the strict locking order that we have to add an entire new loop? This function is...suboptimal today, but if we could get away with not looping through every VMA for a 4th time, that would be nice. > @@ -3612,6 +3620,7 @@ void mm_drop_all_locks(struct mm_struct *mm) > if (vma->vm_file && vma->vm_file->f_mapping) > vm_unlock_mapping(vma->vm_file->f_mapping); > } > + vma_end_write_all(mm); > > mutex_unlock(&mm_all_locks_mutex); > } > -- > 2.39.1 >
Re: [PATCH v3 24/35] mm: introduce vma detached flag
Can we change this to active/inactive? I think there is potential for reusing this functionality to even larger degrees and that name would fit better and would still make sense in this context. ie: vma_mark_active() and vma_mark_inactive() ? * Suren Baghdasaryan [230216 00:18]: > Per-vma locking mechanism will search for VMA under RCU protection and > then after locking it, has to ensure it was not removed from the VMA > tree after we found it. To make this check efficient, introduce a > vma->detached flag to mark VMAs which were removed from the VMA tree. > > Signed-off-by: Suren Baghdasaryan > --- > include/linux/mm.h | 11 +++ > include/linux/mm_types.h | 3 +++ > mm/mmap.c| 2 ++ > 3 files changed, 16 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index f4f702224ec5..3f98344f829c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -693,6 +693,14 @@ static inline void vma_assert_write_locked(struct > vm_area_struct *vma) > VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), > vma); > } > > +static inline void vma_mark_detached(struct vm_area_struct *vma, bool > detached) > +{ > + /* When detaching vma should be write-locked */ > + if (detached) > + vma_assert_write_locked(vma); > + vma->detached = detached; > +} > + > #else /* CONFIG_PER_VMA_LOCK */ > > static inline void vma_init_lock(struct vm_area_struct *vma) {} > @@ -701,6 +709,8 @@ static inline bool vma_start_read(struct vm_area_struct > *vma) > static inline void vma_end_read(struct vm_area_struct *vma) {} > static inline void vma_start_write(struct vm_area_struct *vma) {} > static inline void vma_assert_write_locked(struct vm_area_struct *vma) {} > +static inline void vma_mark_detached(struct vm_area_struct *vma, > + bool detached) {} > > #endif /* CONFIG_PER_VMA_LOCK */ > > @@ -712,6 +722,7 @@ static inline void vma_init(struct vm_area_struct *vma, > struct mm_struct *mm) > vma->vm_mm = mm; > vma->vm_ops = &dummy_vm_ops; > INIT_LIST_HEAD(&vma->anon_vma_chain); > + vma_mark_detached(vma, false); > vma_init_lock(vma); > } > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index e268723eaf44..939f4f5a1115 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -511,6 +511,9 @@ struct vm_area_struct { > #ifdef CONFIG_PER_VMA_LOCK > int vm_lock_seq; > struct rw_semaphore lock; > + > + /* Flag to indicate areas detached from the mm->mm_mt tree */ > + bool detached; > #endif > > /* > diff --git a/mm/mmap.c b/mm/mmap.c > index 801608726be8..adf40177e68f 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -593,6 +593,7 @@ static inline void vma_complete(struct vma_prepare *vp, > > if (vp->remove) { > again: > + vma_mark_detached(vp->remove, true); > if (vp->file) { > uprobe_munmap(vp->remove, vp->remove->vm_start, > vp->remove->vm_end); > @@ -2267,6 +2268,7 @@ static inline int munmap_sidetree(struct vm_area_struct > *vma, > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL)) > return -ENOMEM; > > + vma_mark_detached(vma, true); > if (vma->vm_flags & VM_LOCKED) > vma->vm_mm->locked_vm -= vma_pages(vma); > > -- > 2.39.1 >
Re: [PATCH v3 17/35] mm/mmap: write-lock VMA before shrinking or expanding it
Reviewed-by: Liam R. Howlett * Suren Baghdasaryan [230216 00:18]: > vma_expand and vma_shrink change VMA boundaries. Expansion might also > result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent > concurrent page faults. > > Signed-off-by: Suren Baghdasaryan > --- > mm/mmap.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/mm/mmap.c b/mm/mmap.c > index ec2f8d0af280..f079e5bbcd57 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct > vm_area_struct *vma, > ret = dup_anon_vma(vma, next); > if (ret) > return ret; > + > + /* Lock the VMA before removing it */ > + vma_start_write(next); > } > > init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL); > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct > vm_area_struct *vma, > if (vma_iter_prealloc(vmi)) > goto nomem; > > + vma_start_write(vma); > vma_adjust_trans_huge(vma, start, end, 0); > /* VMA iterator points to previous, so set to start if necessary */ > if (vma_iter_addr(vmi) != start) > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct > vm_area_struct *vma, > if (vma_iter_prealloc(vmi)) > return -ENOMEM; > > + vma_start_write(vma); > init_vma_prep(&vp, vma); > vma_adjust_trans_huge(vma, start, end, 0); > vma_prepare(&vp); > -- > 2.39.1 >
Re: [PATCH v3 17/35] mm/mmap: write-lock VMA before shrinking or expanding it
Wait, I figured a better place to do this. init_multi_vma_prep() should vma_start_write() on any VMA that is passed in.. that we we catch any modifications here & in vma_merge(), which I think is missed in this patch set? * Liam R. Howlett [230223 15:20]: > Reviewed-by: Liam R. Howlett > > * Suren Baghdasaryan [230216 00:18]: > > vma_expand and vma_shrink change VMA boundaries. Expansion might also > > result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent > > concurrent page faults. > > > > Signed-off-by: Suren Baghdasaryan > > --- > > mm/mmap.c | 5 + > > 1 file changed, 5 insertions(+) > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index ec2f8d0af280..f079e5bbcd57 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct > > vm_area_struct *vma, > > ret = dup_anon_vma(vma, next); > > if (ret) > > return ret; > > + > > + /* Lock the VMA before removing it */ > > + vma_start_write(next); > > } > > > > init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL); > > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct > > vm_area_struct *vma, > > if (vma_iter_prealloc(vmi)) > > goto nomem; > > > > + vma_start_write(vma); > > vma_adjust_trans_huge(vma, start, end, 0); > > /* VMA iterator points to previous, so set to start if necessary */ > > if (vma_iter_addr(vmi) != start) > > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct > > vm_area_struct *vma, > > if (vma_iter_prealloc(vmi)) > > return -ENOMEM; > > > > + vma_start_write(vma); > > init_vma_prep(&vp, vma); > > vma_adjust_trans_huge(vma, start, end, 0); > > vma_prepare(&vp); > > -- > > 2.39.1 > >
Re: [PATCH v3 17/35] mm/mmap: write-lock VMA before shrinking or expanding it
* Suren Baghdasaryan [230223 16:16]: > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett > wrote: > > > > > > Wait, I figured a better place to do this. > > > > init_multi_vma_prep() should vma_start_write() on any VMA that is passed > > in.. that we we catch any modifications here & in vma_merge(), which I > > think is missed in this patch set? > > Hmm. That looks like a good idea but in that case, why not do the > locking inside vma_prepare() itself? From the description of that > function it sounds like it was designed to acquire locks before VMA > modifications, so would be the ideal location for doing that. WDYT? That might be even better. I think it will result in even less code. There is also a vma_complete() which might work to call vma_end_write_all() as well? > The only concern is vma_adjust_trans_huge() being called before > vma_prepare() but I *think* that's safe because > vma_adjust_trans_huge() does its modifications after acquiring PTL > lock, which page fault handlers also have to take. Does that sound > right? I am not sure. We are certainly safe the way it is, and the PTL has to be safe for concurrent faults.. but this could alter the walk to a page table while that walk is occurring and I don't think that happens today. It might be best to leave the locking order the way you have it, unless someone can tell us it's safe? We could pass through the three extra variables that are needed to move the vma_adjust_trans_huge() call within that function as well? This would have the added benefit of having all locking grouped in the one location, but the argument list would be getting long, however we could use the struct. remove & remove2 should be be detached in vma_prepare() or vma_complete() as well? > > > > > > > * Liam R. Howlett [230223 15:20]: > > > Reviewed-by: Liam R. Howlett > > > > > > * Suren Baghdasaryan [230216 00:18]: > > > > vma_expand and vma_shrink change VMA boundaries. Expansion might also > > > > result in freeing of an adjacent VMA. Write-lock affected VMAs to > > > > prevent > > > > concurrent page faults. > > > > > > > > Signed-off-by: Suren Baghdasaryan > > > > --- > > > > mm/mmap.c | 5 + > > > > 1 file changed, 5 insertions(+) > > > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > index ec2f8d0af280..f079e5bbcd57 100644 > > > > --- a/mm/mmap.c > > > > +++ b/mm/mmap.c > > > > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct > > > > vm_area_struct *vma, > > > > ret = dup_anon_vma(vma, next); > > > > if (ret) > > > > return ret; > > > > + > > > > + /* Lock the VMA before removing it */ > > > > + vma_start_write(next); > > > > } > > > > > > > > init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, > > > > NULL); > > > > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct > > > > vm_area_struct *vma, > > > > if (vma_iter_prealloc(vmi)) > > > > goto nomem; > > > > > > > > + vma_start_write(vma); > > > > vma_adjust_trans_huge(vma, start, end, 0); > > > > /* VMA iterator points to previous, so set to start if necessary */ > > > > if (vma_iter_addr(vmi) != start) > > > > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct > > > > vm_area_struct *vma, > > > > if (vma_iter_prealloc(vmi)) > > > > return -ENOMEM; > > > > > > > > + vma_start_write(vma); > > > > init_vma_prep(&vp, vma); > > > > vma_adjust_trans_huge(vma, start, end, 0); > > > > vma_prepare(&vp); > > > > -- > > > > 2.39.1 > > > > > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an > > email to kernel-team+unsubscr...@android.com. > >
Re: [PATCH v3 17/35] mm/mmap: write-lock VMA before shrinking or expanding it
* Suren Baghdasaryan [230223 21:06]: > On Thu, Feb 23, 2023 at 5:46 PM Liam R. Howlett > wrote: > > > > * Suren Baghdasaryan [230223 16:16]: > > > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett > > > wrote: > > > > > > > > > > > > Wait, I figured a better place to do this. > > > > > > > > init_multi_vma_prep() should vma_start_write() on any VMA that is passed > > > > in.. that we we catch any modifications here & in vma_merge(), which I > > > > think is missed in this patch set? > > > > > > Hmm. That looks like a good idea but in that case, why not do the > > > locking inside vma_prepare() itself? From the description of that > > > function it sounds like it was designed to acquire locks before VMA > > > modifications, so would be the ideal location for doing that. WDYT? > > > > That might be even better. I think it will result in even less code. > > Yes. > > > > > There is also a vma_complete() which might work to call > > vma_end_write_all() as well? > > If there are other VMAs already locked before vma_prepare() then we > would unlock them too. Safer to just let mmap_unlock do > vma_end_write_all(). > > > > > > The only concern is vma_adjust_trans_huge() being called before > > > vma_prepare() but I *think* that's safe because > > > vma_adjust_trans_huge() does its modifications after acquiring PTL > > > lock, which page fault handlers also have to take. Does that sound > > > right? > > > > I am not sure. We are certainly safe the way it is, and the PTL has to > > be safe for concurrent faults.. but this could alter the walk to a page > > table while that walk is occurring and I don't think that happens today. > > > > It might be best to leave the locking order the way you have it, unless > > someone can tell us it's safe? > > Yes, I have the same feelings about changing this. > > > > > We could pass through the three extra variables that are needed to move > > the vma_adjust_trans_huge() call within that function as well? This > > would have the added benefit of having all locking grouped in the one > > location, but the argument list would be getting long, however we could > > use the struct. > > Any issues if I change the order to have vma_prepare() called always > before vma_adjust_trans_huge()? That way the VMA will always be locked > before vma_adjust_trans_huge() executes and we don't need any > additional arguments. I preserved the locking order from __vma_adjust() to ensure there was no issues. I am not sure but, looking through the page table information [1], it seems that vma_adjust_trans_huge() uses the pmd lock, which is part of the split page table lock. According to the comment in rmap, it should be fine to reverse the ordering here. Instead of: mmap_lock() vma_adjust_trans_huge() pte_lock pte_unlock vma_prepare() mapping->i_mmap_rwsem lock anon_vma->rwsem lock vma_complete() anon_vma->rwsem unlock mapping->i_mmap_rwsem unlock mmap_unlock() - We would have: mmap_lock() vma_prepare() mapping->i_mmap_rwsem lock anon_vma->rwsem lock vma_adjust_trans_huge() pte_lock pte_unlock vma_complete() anon_vma->rwsem unlock mapping->i_mmap_rwsem unlock mmap_unlock() Essentially, increasing the nesting of the pte lock, but not violating the ordering. 1. https://docs.kernel.org/mm/split_page_table_lock.html > > > > > remove & remove2 should be be detached in vma_prepare() or > > vma_complete() as well? > > They are marked detached in vma_complete() (see > https://lore.kernel.org/all/20230216051750.3125598-25-sur...@google.com/) > and that should be enough. We should be safe as long as we mark them > detached before unlocking the VMA. > Right, Thanks. ...
Re: [PATCH v3 1/3] mm/hugetlb: Allow arch to override and call the weak function
* Aneesh Kumar K.V [170727 02:18]: > For ppc64, we want to call this function when we are not running as guest. > Also, if we failed to allocate hugepages, let the user know. > [...] > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index bc48ee783dd9..a3a7a7e6339e 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2083,7 +2083,9 @@ struct page *alloc_huge_page_noerr(struct > vm_area_struct *vma, > return page; > } > > -int __weak alloc_bootmem_huge_page(struct hstate *h) > +int alloc_bootmem_huge_page(struct hstate *h) > + __attribute__ ((weak, alias("__alloc_bootmem_huge_page"))); > +int __alloc_bootmem_huge_page(struct hstate *h) > { > struct huge_bootmem_page *m; > int nr_nodes, node; > @@ -2104,6 +2106,7 @@ int __weak alloc_bootmem_huge_page(struct hstate *h) > goto found; > } > } > + pr_info("Failed to allocate hugepage of size %ld\n", huge_page_size(h)); > return 0; > > found: There is already a call to warn the user in the hugetlb_hstate_alloc_pages function. If you look there, you will see that the huge_page_size was translated into a more user friendly format and the count prior to the failure is included. What call path are you trying to cover? Also, you may want your print to be a pr_warn since it is a failure? Thanks, Liam
Re: [PATCH v3 1/3] mm/hugetlb: Allow arch to override and call the weak function
* Aneesh Kumar K.V [170727 12:12]: > > > On 07/27/2017 08:55 PM, Liam R. Howlett wrote: > > * Aneesh Kumar K.V [170727 02:18]: > > > For ppc64, we want to call this function when we are not running as guest. > > > Also, if we failed to allocate hugepages, let the user know. > > > > > [...] > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index bc48ee783dd9..a3a7a7e6339e 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -2083,7 +2083,9 @@ struct page *alloc_huge_page_noerr(struct > > > vm_area_struct *vma, > > > return page; > > > } > > > -int __weak alloc_bootmem_huge_page(struct hstate *h) > > > +int alloc_bootmem_huge_page(struct hstate *h) > > > + __attribute__ ((weak, alias("__alloc_bootmem_huge_page"))); > > > +int __alloc_bootmem_huge_page(struct hstate *h) > > > { > > > struct huge_bootmem_page *m; > > > int nr_nodes, node; > > > @@ -2104,6 +2106,7 @@ int __weak alloc_bootmem_huge_page(struct hstate *h) > > > goto found; > > > } > > > } > > > + pr_info("Failed to allocate hugepage of size %ld\n", huge_page_size(h)); > > > return 0; > > > found: > > > > There is already a call to warn the user in the > > hugetlb_hstate_alloc_pages function. If you look there, you will see > > that the huge_page_size was translated into a more user friendly format > > and the count prior to the failure is included. What call path are you > > trying to cover? Also, you may want your print to be a pr_warn since it > > is a failure? > > > > Sorry I missed that in the recent kernel. I wrote the above before the > mentioned changes was done. I will drop the pr_info from the patch. Okay, thanks. I didn't think there was a code path that was missed on boot. Cheers, Liam
Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
* LEROY Christophe [240710 08:59]: > ... > > Assuming the removal of the vdso does not cause the application to seg > fault, then the user visible change is that any vdso call after a failed > mmap(MAP_FIXED) call would result in a seg fault. The only reason it > would fail is if the mapping process was attempting to map a large > enough area over the vdso (which is accounted and in the vma tree, > afaict) and ran out of memory. Note that this situation could arise > already since we could run out of memory (not accounting) after the > arch_unmap() call within the kernel. > > The code today can suffer the same fate, but not by the accounting > failure. It can happen due to failure to allocate a new vma, > do_vmi_munmap() failure after the arch_unmap() call, or any of the other > failure scenarios later in the mmap_region() function. > > At the very least, this requires an expanded change log. > >>> ... > >>> I mean why are they unmapping the VDSO, why is that valid, why does it > >>> need > >>> that field to be set to NULL, is it possible to signify that in some other > >>> way etc.? > >> > >> It was originally for CRIU. So a niche workload on a niche architecture. > >> > >> But from the commit that added it, it sounds like CRIU was using mremap, > >> which should be handled these days by vdso_mremap(). So it could be that > >> arch_unmap() is not actually needed for CRIU anymore. > > > > Oh that's interesting! > > > >> > >> Then I guess we have to decide if removing our arch_unmap() would be an > >> ABI break, regardless of whether CRIU needs it or not. > > > > Seems to me like an internal implementation detail that should hopefully > > not result in anything that should have visible ABI impact? > > > > I guess this is something we ought to assess. It would be useful to > > eliminate hooks where we can so we can better control VMA behaviour without > > having to worry about an arch being able to do arbitrary things at > > unexpected times, especially pertinent where we change the order of things. > > > > I see you are talking about arch_unmap(). I didn't follow the entire > discussion but we have some related stuff here: > https://github.com/linuxppc/issues/issues/241 > > If I remember correctly arch_unmap() should have gone away we Dmitry's > series > https://lore.kernel.org/lkml/20210611180242.711399-1-d...@arista.com/#r > but it hasn't been applied yet. > That is good news! To review, ppc is the only arch using this now and it sounds like you want to remove it too. Considering the age of that thread and the possibility of conflict with my series, can I drop the entire arch_unmap() function instead of modifying the handling in core mm? I'm going to assume that's okay and start working on this for v4 (because there hasn't been a public reply for v4 since 2023/10/11). This would mean less arch-specific hooks and that's always a good idea. Thanks, Liam
[PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs
From: "Liam R. Howlett" The arch_unmap call was previously moved above the rbtree modifications in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap() corruption"). The move was motivated by an issue with calling arch_unmap() after the rbtree was modified. Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling arch_unmap() prior to modifying the vma tree no longer exists (regardless of rbtree or maple tree implementations). Furthermore, the powerpc implementation is also no longer needed as per [1] and [2]. So the arch_unmap() function can be completely removed. Link: https://lore.kernel.org/lkml/20210611180242.711399-1-d...@arista.com/ Link: https://github.com/linuxppc/issues/issues/241 Signed-off-by: Liam R. Howlett Cc: Dave Hansen Cc: LEROY Christophe Cc: linuxppc-dev@lists.ozlabs.org Cc: Dmitry Safonov Cc: Michael Ellerman --- arch/powerpc/include/asm/mmu_context.h | 9 - arch/x86/include/asm/mmu_context.h | 5 - include/asm-generic/mm_hooks.h | 11 +++ mm/mmap.c | 12 ++-- 4 files changed, 5 insertions(+), 32 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 37bffa0f7918..a334a1368848 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -260,15 +260,6 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, extern void arch_exit_mmap(struct mm_struct *mm); -static inline void arch_unmap(struct mm_struct *mm, - unsigned long start, unsigned long end) -{ - unsigned long vdso_base = (unsigned long)mm->context.vdso; - - if (start <= vdso_base && vdso_base < end) - mm->context.vdso = NULL; -} - #ifdef CONFIG_PPC_MEM_KEYS bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write, bool execute, bool foreign); diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 8dac45a2c7fc..80f2a3187aa6 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -232,11 +232,6 @@ static inline bool is_64bit_mm(struct mm_struct *mm) } #endif -static inline void arch_unmap(struct mm_struct *mm, unsigned long start, - unsigned long end) -{ -} - /* * We only want to enforce protection keys on the current process * because we effectively have no access to PKRU for other diff --git a/include/asm-generic/mm_hooks.h b/include/asm-generic/mm_hooks.h index 4dbb177d1150..f7996376baf9 100644 --- a/include/asm-generic/mm_hooks.h +++ b/include/asm-generic/mm_hooks.h @@ -1,8 +1,8 @@ /* SPDX-License-Identifier: GPL-2.0 */ /* - * Define generic no-op hooks for arch_dup_mmap, arch_exit_mmap - * and arch_unmap to be included in asm-FOO/mmu_context.h for any - * arch FOO which doesn't need to hook these. + * Define generic no-op hooks for arch_dup_mmap and arch_exit_mmap to be + * included in asm-FOO/mmu_context.h for any arch FOO which doesn't need to hook + * these. */ #ifndef _ASM_GENERIC_MM_HOOKS_H #define _ASM_GENERIC_MM_HOOKS_H @@ -17,11 +17,6 @@ static inline void arch_exit_mmap(struct mm_struct *mm) { } -static inline void arch_unmap(struct mm_struct *mm, - unsigned long start, unsigned long end) -{ -} - static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write, bool execute, bool foreign) { diff --git a/mm/mmap.c b/mm/mmap.c index d5bd404893a8..df565f51971d 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2652,6 +2652,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, mm = vms->mm; mm->map_count -= vms->vma_count; mm->locked_vm -= vms->locked_vm; + if (vms->unlock) mmap_write_downgrade(mm); @@ -2879,7 +2880,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, * * This function takes a @mas that is either pointing to the previous VMA or set * to MA_START and sets it up to remove the mapping(s). The @len will be - * aligned and any arch_unmap work will be preformed. + * aligned. * * Return: 0 on success and drops the lock if so directed, error and leaves the * lock held otherwise. @@ -2899,16 +2900,12 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, return -EINVAL; /* -* Check if memory is sealed before arch_unmap. * Prevent unmapping a sealed VMA. * can_modify_mm assumes we have acquired the lock on MM. */ if (unlikely(!can_modify_mm(mm, start, end))) return -EPERM; -/* arch_unmap() might do unmaps itself. */ - arch_unmap(mm, start, end); - /* Find t
Re: [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs
* LEROY Christophe [240710 17:02]: > > > Le 10/07/2024 à 21:22, Liam R. Howlett a écrit : > > From: "Liam R. Howlett" > > > > The arch_unmap call was previously moved above the rbtree modifications > > in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap() > > corruption"). The move was motivated by an issue with calling > > arch_unmap() after the rbtree was modified. > > > > Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c > > ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling > > arch_unmap() prior to modifying the vma tree no longer exists > > (regardless of rbtree or maple tree implementations). > > > > Furthermore, the powerpc implementation is also no longer needed as per > > [1] and [2]. So the arch_unmap() function can be completely removed. > > I'm not sure to understand. Is it replaced by something else ? > We wanted to get rid of arch_unmap() but it was supposed to be replaced > by some core function because the functionnality itself is still > required and indeed all the discussion around [2] demonstrated that not > only powerpc but at least arm and probably others needed to properly > clean-up reference to VDSO mappings on unmapping. > > So as mentioned by Michael you can't just drop that without replacing it > by something else. We need the VDSO signal handling to properly fallback > on stack-based trampoline when the VDSO trampoline gets mapped out. I'll address this after the part I missed.. > > Or did I miss something ? > I think I missed something in regards to what you need in ppc. >From what I understand, other platforms still map and use the vdso (context.vdso is set), but unmap_arch() does nothing. It is only the powerpc version that clears the vdso pointer if it is unmapped. git grep -w arch_unmap shows: arch/powerpc/include/asm/mmu_context.h arch/x86/include/asm/mmu_context.h include/asm-generic/mm_hooks.h mm/mmap.c The generic and x86 versions are empty. >From the patch set you referenced, we see changes related to the files modified, but I don't think any of them did anything with unmap_arch(). arm: a0d2fcd62ac2 ("vdso/ARM: Make union vdso_data_store available for all architectures") arm64: d0fba04847ae ("arm64: vdso: Use generic union vdso_data_store") mips: d697a9997a0d ("MIPS: vdso: Use generic union vdso_data_store") s390: cb3444cfdb48 ("s390/vdso: Use generic union vdso_data_store") riscv: eba755314fa7 ("riscv: vdso: Use generic union vdso_data_store") ia64 is dead nds32 is dead hexagon has a bunch of vdso work in the logs as well. There is also a6c19dfe3994 ("arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area") I do not see sparc changing away from what the patches were doing, but again, the arch_unmap() seems to do nothing there as well. So, what I was looking to do is to avoid a call to arch specific functions that does nothing but set the vdso pointer to NULL for powerpc. The thread referenced in the git bug [1] seems to indicate this is for CRIU unmapping/restoring a task, but CRIU now just moves the vdso mapping (or just works on ppc at this point?). Since [2] hasn't landed, isn't this still broken for CRIU on powerpc as it is? So, are we keeping the unmap_arch() function around, which has errors that were never fixed, for a single application that utilizes a newer method of moving the vdso anyways? On the note of CRIU, it seems it cannot handle tasks which don't have the vdso mapped anymore [3], so setting it to NULL is probably a bad plan even for that one application? So, I think this just leaves the fallback when the VDSO is unmapped.. Well, it seems what people have been doing is unmap the vdso to stop these functions from working [4]. At least this is what some users are doing. The ability to replace this vma with a guard vma leads me to believe that other archs don't fall back at all - please correct me if I'm wrong! I also cannot find any reference to other archs clearing the context.vdso (aside from failures in __setup_additional_pages). But maybe I don't fully understand how this works? Thanks, Liam [1] https://lore.kernel.org/lkml/87d0lht1c0@concordia.ellerman.id.au/ [2] https://lore.kernel.org/lkml/9c2b2826-4083-fc9c-5a4d-c101858dd...@linux.vnet.ibm.com/ [3] https://github.com/checkpoint-restore/criu/issues/488 [4] https://github.com/insanitybit/void-ship Thanks, Liam
Re: [PATCH v4 17/21] mm/mmap: Drop arch_unmap() call from all archs
* LEROY Christophe [240711 04:28]: > > > Le 11/07/2024 à 01:26, Liam R. Howlett a écrit : > > * LEROY Christophe [240710 17:02]: > >> > >> > >> Le 10/07/2024 à 21:22, Liam R. Howlett a écrit : > >>> From: "Liam R. Howlett" > >>> > >>> The arch_unmap call was previously moved above the rbtree modifications > >>> in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap() > >>> corruption"). The move was motivated by an issue with calling > >>> arch_unmap() after the rbtree was modified. > >>> > >>> Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c > >>> ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling > >>> arch_unmap() prior to modifying the vma tree no longer exists > >>> (regardless of rbtree or maple tree implementations). > >>> > >>> Furthermore, the powerpc implementation is also no longer needed as per > >>> [1] and [2]. So the arch_unmap() function can be completely removed. > >> > >> I'm not sure to understand. Is it replaced by something else ? > >> We wanted to get rid of arch_unmap() but it was supposed to be replaced > >> by some core function because the functionnality itself is still > >> required and indeed all the discussion around [2] demonstrated that not > >> only powerpc but at least arm and probably others needed to properly > >> clean-up reference to VDSO mappings on unmapping. > >> > >> So as mentioned by Michael you can't just drop that without replacing it > >> by something else. We need the VDSO signal handling to properly fallback > >> on stack-based trampoline when the VDSO trampoline gets mapped out. > > > > I'll address this after the part I missed.. > > After ? What do you mean ? It needs to be addressed _before_ removing > arch_unmap() After the later comments in this email, sorry that wasn't clear. > > > > >> > >> Or did I miss something ? > >> > > > > I think I missed something in regards to what you need in ppc. > > It is not only powerpc. Powerpc is the only one doing it at the moment > but investigation has demonstrated that other architectures are affected. > > > > > From what I understand, other platforms still map and use the vdso > > (context.vdso is set), but unmap_arch() does nothing. It is only the > > powerpc version that clears the vdso pointer if it is unmapped. > > Yes on powerpc it works. On other platforms like arm it segfaults so it > should be fixed > (https://lore.kernel.org/lkml/87imd5h5kb@mpe.ellerman.id.au/) > > Could be fixed by properly implementing arch_unmap() on every arch, or > carry-on with Dmitry's series. Okay, I understand what you are saying now. I'm not going to tackle that change within this series, so I'll just relocate the arch_munmap() back to where it was, after the removal of the vmas in v5. > I think you fully understand that it doesn't work as it is except on > powerpc. Again the goal should be to make it work on all architectures. Got it, thanks for clarifying. Regards, Liam
[PATCH v5 17/21] mm/mmap: Relocate arch_unmap() to vms_complete_munmap_vmas()
From: "Liam R. Howlett" The arch_unmap call was previously moved above the rbtree modifications in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap() corruption"). The move was motivated by an issue with calling arch_unmap() after the rbtree was modified. Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling arch_unmap() prior to modifying the vma tree no longer exists (regardless of rbtree or maple tree implementations). Signed-off-by: Liam R. Howlett Cc: LEROY Christophe Cc: linuxppc-dev@lists.ozlabs.org Cc: Dave Hansen --- mm/mmap.c | 12 ++-- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 9f870e715a47..117e8240f697 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2680,6 +2680,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, mm = vms->mm; mm->map_count -= vms->vma_count; mm->locked_vm -= vms->locked_vm; + arch_unmap(mm, vms->start, vms->end); /* write lock needed */ if (vms->unlock) mmap_write_downgrade(mm); @@ -2907,7 +2908,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, * * This function takes a @mas that is either pointing to the previous VMA or set * to MA_START and sets it up to remove the mapping(s). The @len will be - * aligned and any arch_unmap work will be preformed. + * aligned prior to munmap. * * Return: 0 on success and drops the lock if so directed, error and leaves the * lock held otherwise. @@ -2927,16 +2928,12 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, return -EINVAL; /* -* Check if memory is sealed before arch_unmap. * Prevent unmapping a sealed VMA. * can_modify_mm assumes we have acquired the lock on MM. */ if (unlikely(!can_modify_mm(mm, start, end))) return -EPERM; -/* arch_unmap() might do unmaps itself. */ - arch_unmap(mm, start, end); - /* Find the first overlapping VMA */ vma = vma_find(vmi, end); if (!vma) { @@ -2997,9 +2994,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr, if (unlikely(!can_modify_mm(mm, addr, end))) return -EPERM; -/* arch_unmap() might do unmaps itself. */ - arch_unmap(mm, addr, end); - /* Find the first overlapping VMA */ vma = vma_find(&vmi, end); init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false); @@ -3377,14 +3371,12 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, struct mm_struct *mm = vma->vm_mm; /* -* Check if memory is sealed before arch_unmap. * Prevent unmapping a sealed VMA. * can_modify_mm assumes we have acquired the lock on MM. */ if (unlikely(!can_modify_mm(mm, start, end))) return -EPERM; - arch_unmap(mm, start, end); return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock); } -- 2.43.0
Re: [PATCH v5 17/21] mm/mmap: Relocate arch_unmap() to vms_complete_munmap_vmas()
* Lorenzo Stoakes [240722 10:25]: > On Wed, Jul 17, 2024 at 04:07:05PM GMT, Liam R. Howlett wrote: > > From: "Liam R. Howlett" > > > > The arch_unmap call was previously moved above the rbtree modifications > > in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap() > > corruption"). The move was motivated by an issue with calling > > arch_unmap() after the rbtree was modified. > > > > Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c > > ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling > > arch_unmap() prior to modifying the vma tree no longer exists > > (regardless of rbtree or maple tree implementations). > > > > Signed-off-by: Liam R. Howlett > > Cc: LEROY Christophe > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: Dave Hansen > > --- > > mm/mmap.c | 12 ++-- > > 1 file changed, 2 insertions(+), 10 deletions(-) > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 9f870e715a47..117e8240f697 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -2680,6 +2680,7 @@ static void vms_complete_munmap_vmas(struct > > vma_munmap_struct *vms, > > mm = vms->mm; > > mm->map_count -= vms->vma_count; > > mm->locked_vm -= vms->locked_vm; > > + arch_unmap(mm, vms->start, vms->end); /* write lock needed */ > > Worth having a mmap_assert_write_locked() here? Would make this > self-documenting also. No, this is just to point out it cannot be lowered further in this function. > > > if (vms->unlock) > > mmap_write_downgrade(mm); > > > > @@ -2907,7 +2908,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct > > vm_area_struct *vma, > > * > > * This function takes a @mas that is either pointing to the previous VMA > > or set > > * to MA_START and sets it up to remove the mapping(s). The @len will be > > - * aligned and any arch_unmap work will be preformed. > > + * aligned prior to munmap. > > * > > * Return: 0 on success and drops the lock if so directed, error and > > leaves the > > * lock held otherwise. > > @@ -2927,16 +2928,12 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct > > mm_struct *mm, > > return -EINVAL; > > > > /* > > -* Check if memory is sealed before arch_unmap. > > * Prevent unmapping a sealed VMA. > > * can_modify_mm assumes we have acquired the lock on MM. > > */ > > if (unlikely(!can_modify_mm(mm, start, end))) > > return -EPERM; > > > > -/* arch_unmap() might do unmaps itself. */ > > - arch_unmap(mm, start, end); > > - > > /* Find the first overlapping VMA */ > > vma = vma_find(vmi, end); > > if (!vma) { > > @@ -2997,9 +2994,6 @@ unsigned long mmap_region(struct file *file, unsigned > > long addr, > > if (unlikely(!can_modify_mm(mm, addr, end))) > > return -EPERM; > > > > -/* arch_unmap() might do unmaps itself. */ > > - arch_unmap(mm, addr, end); > > - > > It seems to me that the intent of this particular invocation was to ensure > we have done what we can to unmap before trying to unmap ourselves. > > However this seems stupid to me anyway - 'hey maybe the arch will do this > for us' - yeah probably not. > > So this should definitely go regardless, given we will invoke it later now > anyway. This was covered in the commit message, it was because we needed to remove the VMAs earlier for a dead feature (mpx). > > > /* Find the first overlapping VMA */ > > vma = vma_find(&vmi, end); > > init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false); > > @@ -3377,14 +3371,12 @@ int do_vma_munmap(struct vma_iterator *vmi, struct > > vm_area_struct *vma, > > struct mm_struct *mm = vma->vm_mm; > > > > /* > > -* Check if memory is sealed before arch_unmap. > > * Prevent unmapping a sealed VMA. > > * can_modify_mm assumes we have acquired the lock on MM. > > */ > > if (unlikely(!can_modify_mm(mm, start, end))) > > return -EPERM; > > > > - arch_unmap(mm, start, end); > > return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock); > > } > > > > -- > > 2.43.0 > > > > I hope we can find a way to eliminate these kind of hooks altogether as > they reduce our control over how VMA operations are performed. Agreed. I see a path forward on doing just that. > > LGTM, > > Reviewed-by: Lorenzo Stoakes
Re: [PATCH 1/4] mm: Add optional close() to struct vm_special_mapping
* Michael Ellerman [240807 08:41]: > Add an optional close() callback to struct vm_special_mapping. It will > be used, by powerpc at least, to handle unmapping of the VDSO. > > Suggested-by: Linus Torvalds > Signed-off-by: Michael Ellerman > --- > include/linux/mm_types.h | 2 ++ > mm/mmap.c| 3 +++ > 2 files changed, 5 insertions(+) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 485424979254..ef32d87a3adc 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -1313,6 +1313,8 @@ struct vm_special_mapping { > > int (*mremap)(const struct vm_special_mapping *sm, >struct vm_area_struct *new_vma); nit: missing new line? > + void (*close)(const struct vm_special_mapping *sm, > + struct vm_area_struct *vma); > }; > > enum tlb_flush_reason { > diff --git a/mm/mmap.c b/mm/mmap.c > index d0dfc85b209b..24bd6aa9155c 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3624,6 +3624,9 @@ static vm_fault_t special_mapping_fault(struct vm_fault > *vmf); > */ The above comment should probably be expanded to explain what this is about, or removed. > static void special_mapping_close(struct vm_area_struct *vma) > { > + const struct vm_special_mapping *sm = vma->vm_private_data; > + if (sm->close) > + sm->close(sm, vma); Right now we have the same sort of situation for mremap calls on special: we have a call to the specific vma mremap() function. However, every single one of the vdso mremap() calls that I see: s390, riscv, powerppc, parisc, loongarch, arm64, arm seems to do the same thing, except ppc which verifies the size is okay before doing the same thing. So, are we missing an opportunity to avoid every arch having the same implementation here (that will evolve into random bugs existing in some archs for years before someone realises the cloned code wasn't fixed)? Do we already have a fix in ppc for the size checking that doesn't exist in the other archs in the case of mremap? That is, if it's a special mapping that has the same start as the vdso, can't all platforms do the same thing and set it to NULL and avoid every platform cloning the same function? Since this deals with mm_context_t, which is per-platform data, I think the easiest way to make this more generic is to make a generic_vdso_close() and set it in specific vmas on a per-platform basis. At least then we can use the same close function across multiple platforms and make this less error prone to cloned code not receiving fixes. ... Thanks, Liam
Re: [PATCH 2/4] powerpc/mm: Handle VDSO unmapping via close() rather than arch_unmap()
* Jeff Xu [240807 11:44]: > On Wed, Aug 7, 2024 at 5:41 AM Michael Ellerman wrote: > > > > Add a close() callback to the VDSO special mapping to handle unmapping > > of the VDSO. That will make it possible to remove the arch_unmap() hook > > entirely in a subsequent patch. > > > > Suggested-by: Linus Torvalds > > Signed-off-by: Michael Ellerman > > --- > > arch/powerpc/include/asm/mmu_context.h | 4 > > arch/powerpc/kernel/vdso.c | 17 + > > 2 files changed, 17 insertions(+), 4 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/mmu_context.h > > b/arch/powerpc/include/asm/mmu_context.h > > index 37bffa0f7918..9b8c1555744e 100644 > > --- a/arch/powerpc/include/asm/mmu_context.h > > +++ b/arch/powerpc/include/asm/mmu_context.h > > @@ -263,10 +263,6 @@ extern void arch_exit_mmap(struct mm_struct *mm); > > static inline void arch_unmap(struct mm_struct *mm, > > unsigned long start, unsigned long end) > > { > > - unsigned long vdso_base = (unsigned long)mm->context.vdso; > > - > > - if (start <= vdso_base && vdso_base < end) > > - mm->context.vdso = NULL; > > } > > > > #ifdef CONFIG_PPC_MEM_KEYS > > diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c > > index 7a2ff9010f17..220a76cae7c1 100644 > > --- a/arch/powerpc/kernel/vdso.c > > +++ b/arch/powerpc/kernel/vdso.c > > @@ -81,6 +81,21 @@ static int vdso64_mremap(const struct vm_special_mapping > > *sm, struct vm_area_str > > return vdso_mremap(sm, new_vma, &vdso64_end - &vdso64_start); > > } > > > > +static void vdso_close(const struct vm_special_mapping *sm, struct > > vm_area_struct *vma) > > +{ > > + struct mm_struct *mm = vma->vm_mm; > > + > > + /* > > +* close() is called for munmap() but also for mremap(). In the > > mremap() > > +* case the vdso pointer has already been updated by the mremap() > > hook > > +* above, so it must not be set to NULL here. > > +*/ > > + if (vma->vm_start != (unsigned long)mm->context.vdso) > > + return; > > + > > + mm->context.vdso = NULL; > > +} > > + > > static vm_fault_t vvar_fault(const struct vm_special_mapping *sm, > > struct vm_area_struct *vma, struct vm_fault > > *vmf); > > > > @@ -92,11 +107,13 @@ static struct vm_special_mapping vvar_spec > > __ro_after_init = { > > static struct vm_special_mapping vdso32_spec __ro_after_init = { > > .name = "[vdso]", > > .mremap = vdso32_mremap, > > + .close = vdso_close, > IIUC, only CHECKPOINT_RESTORE requires this, and > CHECKPOINT_RESTORE is in init/Kconfig, with default N > > Can we add #ifdef CONFIG_CHECKPOINT_RESTORE here ? > No, these can be unmapped and it needs to be cleaned up. Valgrind is one application that is known to unmap the vdso and runs into issues on platforms that do not handle the removal correctly. Thanks, Liam
Re: [PATCH 2/4] powerpc/mm: Handle VDSO unmapping via close() rather than arch_unmap()
* Jeff Xu [240807 12:37]: > On Wed, Aug 7, 2024 at 8:56 AM Liam R. Howlett > wrote: > > > > * Jeff Xu [240807 11:44]: > > > On Wed, Aug 7, 2024 at 5:41 AM Michael Ellerman > > > wrote: > > > > > > > > Add a close() callback to the VDSO special mapping to handle unmapping > > > > of the VDSO. That will make it possible to remove the arch_unmap() hook > > > > entirely in a subsequent patch. > > > > > > > > Suggested-by: Linus Torvalds > > > > Signed-off-by: Michael Ellerman > > > > --- > > > > arch/powerpc/include/asm/mmu_context.h | 4 > > > > arch/powerpc/kernel/vdso.c | 17 + > > > > 2 files changed, 17 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/arch/powerpc/include/asm/mmu_context.h > > > > b/arch/powerpc/include/asm/mmu_context.h > > > > index 37bffa0f7918..9b8c1555744e 100644 > > > > --- a/arch/powerpc/include/asm/mmu_context.h > > > > +++ b/arch/powerpc/include/asm/mmu_context.h > > > > @@ -263,10 +263,6 @@ extern void arch_exit_mmap(struct mm_struct *mm); > > > > static inline void arch_unmap(struct mm_struct *mm, > > > > unsigned long start, unsigned long end) > > > > { > > > > - unsigned long vdso_base = (unsigned long)mm->context.vdso; > > > > - > > > > - if (start <= vdso_base && vdso_base < end) > > > > - mm->context.vdso = NULL; > > > > } > > > > > > > > #ifdef CONFIG_PPC_MEM_KEYS > > > > diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c > > > > index 7a2ff9010f17..220a76cae7c1 100644 > > > > --- a/arch/powerpc/kernel/vdso.c > > > > +++ b/arch/powerpc/kernel/vdso.c > > > > @@ -81,6 +81,21 @@ static int vdso64_mremap(const struct > > > > vm_special_mapping *sm, struct vm_area_str > > > > return vdso_mremap(sm, new_vma, &vdso64_end - &vdso64_start); > > > > } > > > > > > > > +static void vdso_close(const struct vm_special_mapping *sm, struct > > > > vm_area_struct *vma) > > > > +{ > > > > + struct mm_struct *mm = vma->vm_mm; > > > > + > > > > + /* > > > > +* close() is called for munmap() but also for mremap(). In the > > > > mremap() > > > > +* case the vdso pointer has already been updated by the > > > > mremap() hook > > > > +* above, so it must not be set to NULL here. > > > > +*/ > > > > + if (vma->vm_start != (unsigned long)mm->context.vdso) > > > > + return; > > > > + > > > > + mm->context.vdso = NULL; > > > > +} > > > > + > > > > static vm_fault_t vvar_fault(const struct vm_special_mapping *sm, > > > > struct vm_area_struct *vma, struct > > > > vm_fault *vmf); > > > > > > > > @@ -92,11 +107,13 @@ static struct vm_special_mapping vvar_spec > > > > __ro_after_init = { > > > > static struct vm_special_mapping vdso32_spec __ro_after_init = { > > > > .name = "[vdso]", > > > > .mremap = vdso32_mremap, > > > > + .close = vdso_close, > > > IIUC, only CHECKPOINT_RESTORE requires this, and > > > CHECKPOINT_RESTORE is in init/Kconfig, with default N > > > > > > Can we add #ifdef CONFIG_CHECKPOINT_RESTORE here ? > > > > > > > No, these can be unmapped and it needs to be cleaned up. Valgrind is > > one application that is known to unmap the vdso and runs into issues on > > platforms that do not handle the removal correctly. > > > Maybe Valgrind needs that exactly for checkpoint restore ? [1] Maybe, and maybe there are 100 other applications unmapping the vdso for some other reason? > > "CRIU fails to restore applications that have unmapped the vDSO > segment. One such > application is Valgrind." > > Usually when the kernel accepts new functionality, the patch needs to > state the user case. This isn't new functionality, the arch_unmap() existed before and this is moving the functionality. We can't just disable it because we assume we know it's only used once. I had planned to do this sort of patch set in a follow up to my patch set [1], so I was
Re: [PATCH 2/4] powerpc/mm: Handle VDSO unmapping via close() rather than arch_unmap()
* Jeff Xu [240807 16:12]: > On Wed, Aug 7, 2024 at 10:11 AM Liam R. Howlett > wrote: > > > > * Jeff Xu [240807 12:37]: > > > On Wed, Aug 7, 2024 at 8:56 AM Liam R. Howlett > > > wrote: > > > > > > > > * Jeff Xu [240807 11:44]: > > > > > On Wed, Aug 7, 2024 at 5:41 AM Michael Ellerman > > > > > wrote: > > > > > > > > > > > > Add a close() callback to the VDSO special mapping to handle > > > > > > unmapping > > > > > > of the VDSO. That will make it possible to remove the arch_unmap() > > > > > > hook > > > > > > entirely in a subsequent patch. > > > > > > > > > > > > Suggested-by: Linus Torvalds > > > > > > Signed-off-by: Michael Ellerman > > > > > > --- > > > > > > arch/powerpc/include/asm/mmu_context.h | 4 > > > > > > arch/powerpc/kernel/vdso.c | 17 + > > > > > > 2 files changed, 17 insertions(+), 4 deletions(-) > > > > > > > > > > > > diff --git a/arch/powerpc/include/asm/mmu_context.h > > > > > > b/arch/powerpc/include/asm/mmu_context.h > > > > > > index 37bffa0f7918..9b8c1555744e 100644 > > > > > > --- a/arch/powerpc/include/asm/mmu_context.h > > > > > > +++ b/arch/powerpc/include/asm/mmu_context.h > > > > > > @@ -263,10 +263,6 @@ extern void arch_exit_mmap(struct mm_struct > > > > > > *mm); > > > > > > static inline void arch_unmap(struct mm_struct *mm, > > > > > > unsigned long start, unsigned long > > > > > > end) > > > > > > { > > > > > > - unsigned long vdso_base = (unsigned long)mm->context.vdso; > > > > > > - > > > > > > - if (start <= vdso_base && vdso_base < end) > > > > > > - mm->context.vdso = NULL; > > > > > > } > > > > > > > > > > > > #ifdef CONFIG_PPC_MEM_KEYS > > > > > > diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c > > > > > > index 7a2ff9010f17..220a76cae7c1 100644 > > > > > > --- a/arch/powerpc/kernel/vdso.c > > > > > > +++ b/arch/powerpc/kernel/vdso.c > > > > > > @@ -81,6 +81,21 @@ static int vdso64_mremap(const struct > > > > > > vm_special_mapping *sm, struct vm_area_str > > > > > > return vdso_mremap(sm, new_vma, &vdso64_end - > > > > > > &vdso64_start); > > > > > > } > > > > > > > > > > > > +static void vdso_close(const struct vm_special_mapping *sm, struct > > > > > > vm_area_struct *vma) > > > > > > +{ > > > > > > + struct mm_struct *mm = vma->vm_mm; > > > > > > + > > > > > > + /* > > > > > > +* close() is called for munmap() but also for mremap(). In > > > > > > the mremap() > > > > > > +* case the vdso pointer has already been updated by the > > > > > > mremap() hook > > > > > > +* above, so it must not be set to NULL here. > > > > > > +*/ > > > > > > + if (vma->vm_start != (unsigned long)mm->context.vdso) > > > > > > + return; > > > > > > + > > > > > > + mm->context.vdso = NULL; > > > > > > +} > > > > > > + > > > > > > static vm_fault_t vvar_fault(const struct vm_special_mapping *sm, > > > > > > struct vm_area_struct *vma, struct > > > > > > vm_fault *vmf); > > > > > > > > > > > > @@ -92,11 +107,13 @@ static struct vm_special_mapping vvar_spec > > > > > > __ro_after_init = { > > > > > > static struct vm_special_mapping vdso32_spec __ro_after_init = { > > > > > > .name = "[vdso]", > > > > > > .mremap = vdso32_mremap, > > > > > > + .close = vdso_close, > > > > > IIUC, only CHECKPOINT_RESTORE requires this, and > > > > > CHECKPOINT_RESTORE is in init/Kconfig, with default N > > > > > >
Re: [PATCH 2/4] powerpc/mm: Handle VDSO unmapping via close() rather than arch_unmap()
* Linus Torvalds [240807 23:21]: > On Wed, 7 Aug 2024 at 16:20, Liam R. Howlett wrote: > > ... > > That said, I don't love how special powerpc is here. I think more (all?) archs should be doing something like ppc when the vdso is removed. If someone removes the vdso, then the speed up provided should just go away and the function calls shouldn't try to use the quick look up and crash. I view this as another 'caching of a vma pointer' issue that isn't cleaned up when the vma goes away. > > What we could do is to is > > - stop calling these things "special mappings", and just admit that > it's for different vdso mappings and nothing else (for some odd reason > arm and nios2 calls it a "kuser helper" rather than vdso, but it's the > exact same thing) But isn't it a special mapping? We don't allow for merging of the vma, the mlock handling has some odd behaviour with this vma, and there is the comment in mm/internal.h's mlock_vma_folio about ignoring these special vmas in a race. There is also some other 'special mapping' of vvars too? I haven't looked deeply into this yet as my investigation was preempted by vacation. > > - don't do this whole indirect function pointer thing with mremap and > close at all, and just do this all unapologetically and for all > architectures in the generic VM layer together with "if (vma->vm_start > == mm->context.vdso)" etc. > > that would get rid of the conceptual complexity of having different > architectures doing different things (and the unnecessary overhead of > having an indirect function pointer that just points to one single > thing). > > But I think the current "clean up the existing mess" is probably the > less invasive one over "make the existing mess be explicitly about > vdso and avoid unnecessary per-architecture differences". Okay, sure. > > If people want to, we can do the unification (and stop pretending the > "special mappings" could be something else) later. > I was planning to use the regular vma vm_ops to jump into the 'special unmap code' and then do all the checks there. IOW, keep the vma flagged as VM_SPECIAL and call the special_mapping_whatever() function as a regular vmops for, say, ->remove_vma() or ->mremap(). Keeping the flag means all the race avoidance/locking/merging works the same as it does today. What I am trying to avoid is another arch_get_unmapped_area() scenario where a bug exists for a decade in some versions of the cloned code. Thanks, Liam
Re: [PATCH 2/4] powerpc/mm: Handle VDSO unmapping via close() rather than arch_unmap()
* Jeff Xu [240807 23:37]: > On Wed, Aug 7, 2024 at 8:21 PM Linus Torvalds > wrote: > > > > On Wed, 7 Aug 2024 at 16:20, Liam R. Howlett > > wrote: > > > > > > Okay, I'm going to try one more time here. You are suggesting to have a > > > conf flag to leave the vdso pointer unchanged when it is unmapped. > > > Having the close behind the conf option will not prevent it from being > > > unmapped or mapped over, so what you are suggesting is have a > > > configuration option that leaves a pointer, mm->context.vdso, to be > > > unsafe if it is unmapped if you disable checkpoint restore. > > > This is a new point that I didn't realize before, if we are going to handle > unmap vdso safely, yes, this is a bugfix that should be applied everywhere > for all arch, without CHECKPOINT_RESTORE config. > > Do we need to worry about mmap(fixed) ? which can have the same effect > as mremap. Yes, but it should be handled by vm_ops->close() when MAP_FIXED unmaps the vdso. Note that you cannot MAP_FIXED over half of the vma as the vm_ops->may_split() is special_mapping_split(), which just returns -EINVAL. Thanks, Liam
Re: [PATCH 2/4] powerpc/mm: Handle VDSO unmapping via close() rather than arch_unmap()
* Jeff Xu [240808 14:37]: > On Thu, Aug 8, 2024 at 11:08 AM Liam R. Howlett > wrote: > > > > * Jeff Xu [240807 23:37]: > > > On Wed, Aug 7, 2024 at 8:21 PM Linus Torvalds > > > wrote: > > > > > > > > On Wed, 7 Aug 2024 at 16:20, Liam R. Howlett > > > > wrote: > > > > > > > > > > Okay, I'm going to try one more time here. You are suggesting to > > > > > have a > > > > > conf flag to leave the vdso pointer unchanged when it is unmapped. > > > > > Having the close behind the conf option will not prevent it from being > > > > > unmapped or mapped over, so what you are suggesting is have a > > > > > configuration option that leaves a pointer, mm->context.vdso, to be > > > > > unsafe if it is unmapped if you disable checkpoint restore. > > > > > > > This is a new point that I didn't realize before, if we are going to > > > handle > > > unmap vdso safely, yes, this is a bugfix that should be applied everywhere > > > for all arch, without CHECKPOINT_RESTORE config. > > > > > > Do we need to worry about mmap(fixed) ? which can have the same effect > > > as mremap. > > > > Yes, but it should be handled by vm_ops->close() when MAP_FIXED unmaps > > the vdso. Note that you cannot MAP_FIXED over half of the vma as the > > vm_ops->may_split() is special_mapping_split(), which just returns > > -EINVAL. > > > The may_split() failure logic is specific to vm_special_mapping, right ? Not really, it's just what exists for these vmas vm_ops struct. It's called on every vma for every split in __split_vma(). > > Do we still need to keep vm_special_mapping struct , if we are going to > treat special vma as normal vma ? No, just set the vm_ops may_split to something that returns -EINVAL.
Re: [PATCH v2 1/4] mm: Add optional close() to struct vm_special_mapping
* Michael Ellerman [240812 04:26]: > Add an optional close() callback to struct vm_special_mapping. It will > be used, by powerpc at least, to handle unmapping of the VDSO. > > Although support for unmapping the VDSO was initially added > for CRIU[1], it is not desirable to guard that support behind > CONFIG_CHECKPOINT_RESTORE. > > There are other known users of unmapping the VDSO which are not related > to CRIU, eg. Valgrind [2] and void-ship [3]. > > The powerpc arch_unmap() hook has been in place for ~9 years, with no > ifdef, so there may be other unknown users that have come to rely on > unmapping the VDSO. Even if the code was behind an ifdef, major distros > enable CHECKPOINT_RESTORE so users may not realise unmapping the VDSO > depends on that configuration option. > > It's also undesirable to have such core mm behaviour behind a relatively > obscure CONFIG option. > > Longer term the unmap behaviour should be standardised across > architectures, however that is complicated by the fact the VDSO pointer > is stored differently across architectures. There was a previous attempt > to unify that handling [4], which could be revived. > > See [5] for further discussion. Thank you for tackling this issue, it's much improved. For the whole series: Reviewed-by: Liam R. Howlett > > [1]: commit 83d3f0e90c6c ("powerpc/mm: tracking vDSO remap") > [2]: > https://sourceware.org/git/?p=valgrind.git;a=commit;h=3a004915a2cbdcdebafc1612427576bf3321eef5 > [3]: https://github.com/insanitybit/void-ship > [4]: https://lore.kernel.org/lkml/20210611180242.711399-17-d...@arista.com/ > [5]: > https://lore.kernel.org/linuxppc-dev/shiq5v3jrmyi6ncwke7wgl76ojysgbhrchsk32q4lbx2hadqqc@kzyy2igem256 > > Suggested-by: Linus Torvalds > Signed-off-by: Michael Ellerman > Reviewed-by: David Hildenbrand > --- > include/linux/mm_types.h | 3 +++ > mm/mmap.c| 6 ++ > 2 files changed, 9 insertions(+) > > v2: > - Add some blank lines as requested. > - Expand special_mapping_close() comment. > - Add David's reviewed-by. > - Expand change log to capture review discussion. > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 485424979254..78bdfc59abe5 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -1313,6 +1313,9 @@ struct vm_special_mapping { > > int (*mremap)(const struct vm_special_mapping *sm, >struct vm_area_struct *new_vma); > + > + void (*close)(const struct vm_special_mapping *sm, > + struct vm_area_struct *vma); > }; > > enum tlb_flush_reason { > diff --git a/mm/mmap.c b/mm/mmap.c > index d0dfc85b209b..af4dbf0d3bd4 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3620,10 +3620,16 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t > flags, long npages) > static vm_fault_t special_mapping_fault(struct vm_fault *vmf); > > /* > + * Close hook, called for unmap() and on the old vma for mremap(). > + * > * Having a close hook prevents vma merging regardless of flags. > */ > static void special_mapping_close(struct vm_area_struct *vma) > { > + const struct vm_special_mapping *sm = vma->vm_private_data; > + > + if (sm->close) > + sm->close(sm, vma); > } > > static const char *special_mapping_name(struct vm_area_struct *vma) > -- > 2.45.2 >
Re: [PATCH 00/16] mm: Introduce MAP_BELOW_HINT
* Charlie Jenkins [240828 01:49]: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the maximum address space, > unless the hint address is greater than this value. Wait, what arch(s) allows for greater than the max? The passed hint should be where we start searching, but we go to the lower limit then start at the hint and search up (or vice-versa on the directions). I don't understand how unmapping works on a higher address; we would fail to free it on termination of the application. Also, there are archs that map outside of the VMAs, which are freed by freeing from the prev->vm_end to next->vm_start, so I don't understand what that looks like in this reality as well. > > On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This > flag allows applications a way to specify exactly how many bits they > want to be left unused by mmap. This eliminates the need for > applications to know the page table hierarchy of the system to be able > to reason which addresses mmap will be allowed to return. But, why do they need to know today? We have a limit for this don't we? Also, these upper limits are how some archs use the upper bits that you are trying to use. > > --- > riscv made this feature of mmap returning addresses less than the hint > address the default behavior. This was in contrast to the implementation > of x86/arm64 that have a single boundary at the 5-level page table > region. However this restriction proved too great -- the reduced > address space when using a hint address was too small. Yes, the hint is used to group things close together so it would literally be random chance on if you have enough room or not (aslr and all). > > A patch for riscv [1] reverts the behavior that broke userspace. This > series serves to make this feature available to all architectures. I don't fully understand this statement, you say it broke userspace so now you are porting it to everyone? This reads as if you are braking the userspace on all architectures :) If you fail to find room below, then your application fails as there is no way to get the upper bits you need. It would be better to fix this in userspace - if your application is returned too high an address, then free it and exit because it's going to fail anyways. > > I have only tested on riscv and x86. This should be an RFC then. > There is a tremendous amount of > duplicated code in mmap so the implementations across architectures I > believe should be mostly consistent. I added this feature to all > architectures that implement either > arch_get_mmap_end()/arch_get_mmap_base() or > arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added > it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base(). Way too much duplicate code. We should be figuring out how to make this all work with the same code. This is going to make the cloned code problem worse. > > Link: > https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe...@rivosinc.com/T/ > [1] > > To: Arnd Bergmann > To: Paul Walmsley > To: Palmer Dabbelt > To: Albert Ou > To: Catalin Marinas > To: Will Deacon > To: Michael Ellerman > To: Nicholas Piggin > To: Christophe Leroy > To: Naveen N Rao > To: Muchun Song > To: Andrew Morton > To: Liam R. Howlett > To: Vlastimil Babka > To: Lorenzo Stoakes > To: Thomas Gleixner > To: Ingo Molnar > To: Borislav Petkov > To: Dave Hansen > To: x...@kernel.org > To: H. Peter Anvin > To: Huacai Chen > To: WANG Xuerui > To: Russell King > To: Thomas Bogendoerfer > To: James E.J. Bottomley > To: Helge Deller > To: Alexander Gordeev > To: Gerald Schaefer > To: Heiko Carstens > To: Vasily Gorbik > To: Christian Borntraeger > To: Sven Schnelle > To: Yoshinori Sato > To: Rich Felker > To: John Paul Adrian Glaubitz > To: David S. Miller > To: Andreas Larsson > To: Shuah Khan > To: Alexandre Ghiti > Cc: linux-a...@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Cc: Palmer Dabbelt > Cc: linux-ri...@lists.infradead.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux...@kvack.org > Cc: loonga...@lists.linux.dev > Cc: linux-m...@vger.kernel.org > Cc: linux-par...@vger.kernel.org > Cc: linux-s...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: sparcli...@vger.kernel.org > Cc: linux-kselft...@vger.kernel.org > Signed-off-by: Charlie Jenkins > > --- > Charlie Jenkins (16): > mm: Add MAP_BELOW_HINT > riscv: mm: Do not restrict mmap address based on hin
Re: [PATCH 00/16] mm: Introduce MAP_BELOW_HINT
* Dave Hansen [240829 12:54]: > On 8/28/24 13:15, Charlie Jenkins wrote: > > A way to restrict mmap() to return LAM compliant addresses in an entire > > address space also doesn't have to be mutually exclusive with this flag. > > This flag allows for the greatest degree of control from applications. > > I don't believe there is additionally performance saving that could be > > achieved by having this be on a per address space basis. > > I agree with you in general. The MAP_BELOW_HINT _is_ the most flexible. > But it's also rather complicated. There is a (seldom used?) feature of mmap_min_addr, it seems like we could have an mmap_max_addr. Would something like that work for your use case? Perhaps it would be less intrusive to do something in this way? I haven't looked at it in depth and this affects all address spaces as well (new allocations only). There is a note on mmap_min_addr about applications that require the lower addresses, would this mean we'll now have a note about upper limits? I really don't understand why you need this at all, to be honest. If you know the upper limit you could just MAP_FIXED map a huge guard at the top of your address space then do whatever you want with those bits. This will create an entry in the vma tree that no one else will be able to use, and you can do this in any process you want, for as many bits as you want. > > My _hope_ would be that a per-address-space property could share at > least some infrastructure with what x86/LAM and arm/TBI do to the > address space. Basically put the restrictions in place for purely > software reasons instead of the mostly hardware reasons for LAM/TBI. > > Lorenzo also raised some very valid points about a having a generic > address-restriction ABI. I'm certainly not discounting those concerns. > It's not something that can be done lightly. Yes, I am concerned about supporting this (probably forever) and dancing around special code that may cause issues, perhaps on an arch that few have for testing. I already have so many qemu images for testing, some of which no longer have valid install media - and basically none of them use the same code in this area (or have special cases already). I think you understand what we are dealing with considering your comments in your cover letter. Thanks, Liam
Re: [PATCH 1/3] mm: Make arch_get_unmapped_area() take vm_flags by default
* Mark Brown [240902 15:09]: > When we introduced arch_get_unmapped_area_vmflags() in 961148704acd > ("mm: introduce arch_get_unmapped_area_vmflags()") we did so as part of > properly supporting guard pages for shadow stacks on x86_64, which uses > a custom arch_get_unmapped_area(). Equivalent features are also present > on both arm64 and RISC-V, both of which use the generic implementation > of arch_get_unmapped_area() and will require equivalent modification > there. Rather than continue to deal with having two versions of the > functions let's bite the bullet and have all implementations of > arch_get_unmapped_area() take vm_flags as a parameter. > > The new parameter is currently ignored by all implementations other than > x86. The only caller that doesn't have a vm_flags available is > mm_get_unmapped_area(), as for the x86 implementation and the wrapper used > on other architectures this is modified to supply no flags. > > No functional changes. > > Signed-off-by: Mark Brown I don't love sparc32/sparc64 requires a wide screen monitor, but it already broke the 80 char limit. Reviewed-by: Liam R. Howlett > --- > arch/alpha/kernel/osf_sys.c | 2 +- > arch/arc/mm/mmap.c| 3 ++- > arch/arm/mm/mmap.c| 7 --- > arch/csky/abiv1/mmap.c| 3 ++- > arch/loongarch/mm/mmap.c | 5 +++-- > arch/mips/mm/mmap.c | 2 +- > arch/parisc/kernel/sys_parisc.c | 5 +++-- > arch/parisc/mm/hugetlbpage.c | 2 +- > arch/powerpc/mm/book3s64/slice.c | 6 -- > arch/s390/mm/mmap.c | 4 ++-- > arch/sh/mm/mmap.c | 5 +++-- > arch/sparc/kernel/sys_sparc_32.c | 2 +- > arch/sparc/kernel/sys_sparc_64.c | 4 ++-- > arch/x86/include/asm/pgtable_64.h | 1 - > arch/x86/kernel/sys_x86_64.c | 21 +++-- > arch/xtensa/kernel/syscall.c | 3 ++- > include/linux/sched/mm.h | 23 --- > mm/mmap.c | 31 +++ > 18 files changed, 49 insertions(+), 80 deletions(-) > > diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c > index e5f881bc8288..8886ab539273 100644 > --- a/arch/alpha/kernel/osf_sys.c > +++ b/arch/alpha/kernel/osf_sys.c > @@ -1229,7 +1229,7 @@ arch_get_unmapped_area_1(unsigned long addr, unsigned > long len, > unsigned long > arch_get_unmapped_area(struct file *filp, unsigned long addr, > unsigned long len, unsigned long pgoff, > -unsigned long flags) > +unsigned long flags, vm_flags_t vm_flags) > { > unsigned long limit; > > diff --git a/arch/arc/mm/mmap.c b/arch/arc/mm/mmap.c > index 69a915297155..2185afe8d59f 100644 > --- a/arch/arc/mm/mmap.c > +++ b/arch/arc/mm/mmap.c > @@ -23,7 +23,8 @@ > */ > unsigned long > arch_get_unmapped_area(struct file *filp, unsigned long addr, > - unsigned long len, unsigned long pgoff, unsigned long flags) > + unsigned long len, unsigned long pgoff, > + unsigned long flags, vm_flags_t vm_flags) > { > struct mm_struct *mm = current->mm; > struct vm_area_struct *vma; > diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c > index d65d0e6ed10a..3dbb383c26d5 100644 > --- a/arch/arm/mm/mmap.c > +++ b/arch/arm/mm/mmap.c > @@ -28,7 +28,8 @@ > */ > unsigned long > arch_get_unmapped_area(struct file *filp, unsigned long addr, > - unsigned long len, unsigned long pgoff, unsigned long flags) > + unsigned long len, unsigned long pgoff, > + unsigned long flags, vm_flags_t vm_flags) > { > struct mm_struct *mm = current->mm; > struct vm_area_struct *vma; > @@ -78,8 +79,8 @@ arch_get_unmapped_area(struct file *filp, unsigned long > addr, > > unsigned long > arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, > - const unsigned long len, const unsigned long pgoff, > - const unsigned long flags) > + const unsigned long len, const unsigned long pgoff, > + const unsigned long flags, vm_flags_t vm_flags) > { > struct vm_area_struct *vma; > struct mm_struct *mm = current->mm; > diff --git a/arch/csky/abiv1/mmap.c b/arch/csky/abiv1/mmap.c > index 7f826331d409..1047865e82a9 100644 > --- a/arch/csky/abiv1/mmap.c > +++ b/arch/csky/abiv1/mmap.c > @@ -23,7 +23,8 @@ > */ > unsigned long > arch_get_unmapped_area(struct file *filp, unsigned long addr, > - unsigned long len, unsigned long pgoff, unsigned long flags) > +
Re: [PATCH 2/3] mm: Pass vm_flags to generic_get_unmapped_area()
* Mark Brown [240902 15:09]: > In preparation for using vm_flags to ensure guard pages for shadow stacks > supply them as an argument to generic_get_unmapped_area(). The only user > outside of the core code is the PowerPC book3s64 implementation which is > trivially wrapping the generic implementation in the radix_enabled() case. > > Signed-off-by: Mark Brown It is interesting that book3s64 ppc is special in this regard. Reviewed-by: Liam R. Howlett > --- > arch/powerpc/mm/book3s64/slice.c | 4 ++-- > include/linux/sched/mm.h | 4 ++-- > mm/mmap.c| 10 ++ > 3 files changed, 10 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/mm/book3s64/slice.c > b/arch/powerpc/mm/book3s64/slice.c > index ada6bf896ef8..87307d0fc3b8 100644 > --- a/arch/powerpc/mm/book3s64/slice.c > +++ b/arch/powerpc/mm/book3s64/slice.c > @@ -641,7 +641,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, >vm_flags_t vm_flags) > { > if (radix_enabled()) > - return generic_get_unmapped_area(filp, addr, len, pgoff, flags); > + return generic_get_unmapped_area(filp, addr, len, pgoff, flags, > vm_flags); > > return slice_get_unmapped_area(addr, len, flags, > > mm_ctx_user_psize(¤t->mm->context), 0); > @@ -655,7 +655,7 @@ unsigned long arch_get_unmapped_area_topdown(struct file > *filp, >vm_flags_t vm_flags) > { > if (radix_enabled()) > - return generic_get_unmapped_area_topdown(filp, addr0, len, > pgoff, flags); > + return generic_get_unmapped_area_topdown(filp, addr0, len, > pgoff, flags, vm_flags); > > return slice_get_unmapped_area(addr0, len, flags, > > mm_ctx_user_psize(¤t->mm->context), 1); > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index c4d34abc45d4..07bb8d4181d7 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -204,11 +204,11 @@ unsigned long mm_get_unmapped_area_vmflags(struct > mm_struct *mm, > unsigned long > generic_get_unmapped_area(struct file *filp, unsigned long addr, > unsigned long len, unsigned long pgoff, > - unsigned long flags); > + unsigned long flags, vm_flags_t vm_flags); > unsigned long > generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr, > unsigned long len, unsigned long pgoff, > - unsigned long flags); > + unsigned long flags, vm_flags_t vm_flags); > #else > static inline void arch_pick_mmap_layout(struct mm_struct *mm, >struct rlimit *rlim_stack) {} > diff --git a/mm/mmap.c b/mm/mmap.c > index 7528146f886f..b06ba847c96e 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1789,7 +1789,7 @@ unsigned long vm_unmapped_area(struct > vm_unmapped_area_info *info) > unsigned long > generic_get_unmapped_area(struct file *filp, unsigned long addr, > unsigned long len, unsigned long pgoff, > - unsigned long flags) > + unsigned long flags, vm_flags_t vm_flags) > { > struct mm_struct *mm = current->mm; > struct vm_area_struct *vma, *prev; > @@ -1823,7 +1823,8 @@ arch_get_unmapped_area(struct file *filp, unsigned long > addr, > unsigned long len, unsigned long pgoff, > unsigned long flags, vm_flags_t vm_flags) > { > - return generic_get_unmapped_area(filp, addr, len, pgoff, flags); > + return generic_get_unmapped_area(filp, addr, len, pgoff, flags, > + vm_flags); > } > #endif > > @@ -1834,7 +1835,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long > addr, > unsigned long > generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr, > unsigned long len, unsigned long pgoff, > - unsigned long flags) > + unsigned long flags, vm_flags_t vm_flags) > { > struct vm_area_struct *vma, *prev; > struct mm_struct *mm = current->mm; > @@ -1887,7 +1888,8 @@ arch_get_unmapped_area_topdown(struct file *filp, > unsigned long addr, > unsigned long len, unsigned long pgoff, > unsigned long flags, vm_flags_t vm_flags) > { > - return generic_get_unmapped_area_topdown(filp, addr, len, pgoff, flags); > + return generic_get_unmapped_area_topdown(filp, addr, len, pgoff, flags, > + vm_flags); > } > #endif > > > -- > 2.39.2 >
Re: [PATCH 3/3] mm: Care about shadow stack guard gap when getting an unmapped area
* Mark Brown [240902 15:09]: > As covered in the commit log for c44357c2e76b ("x86/mm: care about shadow > stack guard gap during placement") our current mmap() implementation does > not take care to ensure that a new mapping isn't placed with existing > mappings inside it's own guard gaps. This is particularly important for > shadow stacks since if two shadow stacks end up getting placed adjacent to > each other then they can overflow into each other which weakens the > protection offered by the feature. > > On x86 there is a custom arch_get_unmapped_area() which was updated by the > above commit to cover this case by specifying a start_gap for allocations > with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and > use the generic implementation of arch_get_unmapped_area() so let's make > the equivalent change there so they also don't get shadow stack pages > placed without guard pages. > > Architectures which do not have this feature will define VM_SHADOW_STACK > to VM_NONE and hence be unaffected. > > Suggested-by: Rick Edgecombe > Signed-off-by: Mark Brown > --- > mm/mmap.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/mm/mmap.c b/mm/mmap.c > index b06ba847c96e..902c482b6084 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1753,6 +1753,14 @@ static unsigned long unmapped_area_topdown(struct > vm_unmapped_area_info *info) > return gap; > } > > +static inline unsigned long stack_guard_placement(vm_flags_t vm_flags) > +{ > + if (vm_flags & VM_SHADOW_STACK) > + return PAGE_SIZE; Is PAGE_SIZE is enough? > + > + return 0; > +} > + > /* > * Search for an unmapped address range. > * > @@ -1814,6 +1822,7 @@ generic_get_unmapped_area(struct file *filp, unsigned > long addr, > info.length = len; > info.low_limit = mm->mmap_base; > info.high_limit = mmap_end; > + info.start_gap = stack_guard_placement(vm_flags); > return vm_unmapped_area(&info); > } > > @@ -1863,6 +1872,7 @@ generic_get_unmapped_area_topdown(struct file *filp, > unsigned long addr, > info.length = len; > info.low_limit = PAGE_SIZE; > info.high_limit = arch_get_mmap_base(addr, mm->mmap_base); > + info.start_gap = stack_guard_placement(vm_flags); > addr = vm_unmapped_area(&info); > > /* > > -- > 2.39.2 >
Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits
* Catalin Marinas [240906 07:44]: > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote: > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann wrote: > > >> It's also unclear to me how we want this flag to interact with > > >> the existing logic in arch_get_mmap_end(), which attempts to > > >> limit the default mapping to a 47-bit address space already. > > > > > > To optimize RISC-V progress, I recommend: > > > > > > Step 1: Approve the patch. > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > > Step 3: Wait approximately several iterations for Go & OpenJDK > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end() > > > > I really want to first see a plausible explanation about why > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > > like all the other major architectures (x86, arm64, powerpc64), > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default > configuration. We end up with a 47-bit with 16K pages but for a > different reason that has to do with LPA2 support (I doubt we need this > for the user mapping but we need to untangle some of the macros there; > that's for a separate discussion). > > That said, we haven't encountered any user space problems with a 48-bit > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar > approach (47 or 48 bit default limit). Better to have some ABI > consistency between architectures. One can still ask for addresses above > this default limit via mmap(). I think that is best as well. Can we please just do what x86 and arm64 does? Thanks, Liam
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
* Steven Price [241021 09:23]: > On 09/09/2024 10:46, Kirill A. Shutemov wrote: > > On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote: > >> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote: > >>> On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote: > Some applications rely on placing data in free bits addresses allocated > by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the > address returned by mmap to be less than the 48-bit address space, > unless the hint address uses more than 47 bits (the 48th bit is reserved > for the kernel address space). > > The riscv architecture needs a way to similarly restrict the virtual > address space. On the riscv port of OpenJDK an error is thrown if > attempted to run on the 57-bit address space, called sv57 [1]. golang > has a comment that sv57 support is not complete, but there are some > workarounds to get it to mostly work [2]. > > > > I also saw libmozjs crashing with 57-bit address space on x86. > > > These applications work on x86 because x86 does an implicit 47-bit > restriction of mmap() address that contain a hint address that is less > than 48 bits. > > Instead of implicitly restricting the address space on riscv (or any > current/future architecture), a flag would allow users to opt-in to this > behavior rather than opt-out as is done on other architectures. This is > desirable because it is a small class of applications that do pointer > masking. > > > > You reiterate the argument about "small class of applications". But it > > makes no sense to me. > > Sorry to chime in late on this - I had been considering implementing > something like MAP_BELOW_HINT and found this thread. > > While the examples of applications that want to use high VA bits and get > bitten by future upgrades is not very persuasive. It's worth pointing > out that there are a variety of somewhat horrid hacks out there to work > around this feature not existing. > > E.g. from my brief research into other code: > > * Box64 seems to have a custom allocator based on reading > /proc/self/maps to allocate a block of VA space with a low enough > address [1] > > * PHP has code reading /proc/self/maps - I think this is to find a > segment which is close enough to the text segment [2] > > * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit > addresses [3][4] Can't the limited number of applications that need to restrict the upper bound use an LD_PRELOAD compatible library to do this? > > * pmdk has some funky code to find the lowest address that meets > certain requirements - this does look like an ALSR alternative and > probably couldn't directly use MAP_BELOW_HINT, although maybe this > suggests we need a mechanism to map without a VA-range? [5] > > * MIT-Scheme parses /proc/self/maps to find the lowest mapping within > a range [6] > > * LuaJIT uses an approach to 'probe' to find a suitable low address > for allocation [7] > Although I did not take a deep dive into each example above, there are some very odd things being done, we will never cover all the use cases with an exact API match. What we have today can be made to work for these users as they have figured ways to do it. Are they pretty? no. Are they common? no. I'm not sure it's worth plumbing in new MM code in for these users. > The biggest benefit I see of MAP_BELOW_HINT is that it would allow a > library to get low addresses without causing any problems for the rest > of the application. The use case I'm looking at is in a library and > therefore a personality mode wouldn't be appropriate (because I don't > want to affect the rest of the application). Reading /proc/self/maps > is also problematic because other threads could be allocating/freeing > at the same time. As long as you don't exhaust the lower limit you are trying to allocate within - which is exactly the issue riscv is hitting. I understand that you are providing examples to prove that this is needed, but I feel like you are better demonstrating the flexibility exists to implement solutions in different ways using todays API. I think it would be best to use the existing methods and work around the issue that was created in riscv while future changes could mirror amd64 and arm64. ... > > > [1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/ > [2] > https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap.c/#L62 > [3] > https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp > [4] > https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb05c > [5] > https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=29#L29 > [6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826 > [7] > https://sources.debi
Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT
* Steven Price [241023 05:31]: > >> * Box64 seems to have a custom allocator based on reading > >> /proc/self/maps to allocate a block of VA space with a low enough > >> address [1] > >> > >> * PHP has code reading /proc/self/maps - I think this is to find a > >> segment which is close enough to the text segment [2] > >> > >> * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit > >> addresses [3][4] > > > > Can't the limited number of applications that need to restrict the upper > > bound use an LD_PRELOAD compatible library to do this? > > I'm not entirely sure what point you are making here. Yes an LD_PRELOAD > approach could be used instead of a personality type as a 'hack' to > preallocate the upper address space. The obvious disadvantage is that > you can't (easily) layer LD_PRELOAD so it won't work in the general case. My point is that riscv could work around the limited number of applications that requires this. It's not really viable for you. > > >> > >> * pmdk has some funky code to find the lowest address that meets > >> certain requirements - this does look like an ALSR alternative and > >> probably couldn't directly use MAP_BELOW_HINT, although maybe this > >> suggests we need a mechanism to map without a VA-range? [5] > >> > >> * MIT-Scheme parses /proc/self/maps to find the lowest mapping within > >> a range [6] > >> > >> * LuaJIT uses an approach to 'probe' to find a suitable low address > >> for allocation [7] > >> > > > > Although I did not take a deep dive into each example above, there are > > some very odd things being done, we will never cover all the use cases > > with an exact API match. What we have today can be made to work for > > these users as they have figured ways to do it. > > > > Are they pretty? no. Are they common? no. I'm not sure it's worth > > plumbing in new MM code in for these users. > > My issue with the existing 'solutions' is that they all seem to be fragile: > > * Using /proc/self/maps is inherently racy if there could be any other > code running in the process at the same time. Yes, it is not thread safe. Parsing text is also undesirable. > > * Attempting to map the upper part of the address space only works if > done early enough - once an allocation arrives there, there's very > little you can robustly do (because the stray allocation might be freed). > > * LuaJIT's probing mechanism is probably robust, but it's inefficient - > LuaJIT has a fallback of linear probing, following by no hint (ASLR), > followed by pseudo-random probing. I don't know the history of the code > but it looks like it's probably been tweaked to try to avoid performance > issues. > > >> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a > >> library to get low addresses without causing any problems for the rest > >> of the application. The use case I'm looking at is in a library and > >> therefore a personality mode wouldn't be appropriate (because I don't > >> want to affect the rest of the application). Reading /proc/self/maps > >> is also problematic because other threads could be allocating/freeing > >> at the same time. > > > > As long as you don't exhaust the lower limit you are trying to allocate > > within - which is exactly the issue riscv is hitting. > > Obviously if you actually exhaust the lower limit then any > MAP_BELOW_HINT API would also fail - there's really not much that can be > done in that case. Today we reverse the search, so you end up in the higher address (bottom-up vs top-down) - although the direction is arch dependent. If the allocation is too high/low then you could detect, free, and handle the failure. > > > I understand that you are providing examples to prove that this is > > needed, but I feel like you are better demonstrating the flexibility > > exists to implement solutions in different ways using todays API. > > My intention is to show that today's API doesn't provide a robust way of > doing this. Although I'm quite happy if you can point me at a robust way > with the current API. As I mentioned my goal is to be able to map memory > in a (multithreaded) library with a (ideally configurable) number of VA > bits. I don't particularly want to restrict the whole process, just > specific allocations. If you don't need to restrict everything, won't the hint work for your usecase? I must be missing something from your requirements. > > I had typed up a series similar to this one as a MAP_BELOW flag would > fit my use-case well. > > > I think it would be best to use the existing methods and work around the > > issue that was created in riscv while future changes could mirror amd64 > > and arm64. > > The riscv issue is a different issue to the one I'm trying to solve. I > agree MAP_BELOW_HINT isn't a great fix for that because we already have > differences between amd64 and arm64 and obviously no software currently > out there uses this
Re: [PATCH v3 7/8] execmem: add support for cache of large ROX pages
* Mike Rapoport [240909 02:49]: > From: "Mike Rapoport (Microsoft)" > > Using large pages to map text areas reduces iTLB pressure and improves > performance. > > Extend execmem_alloc() with an ability to use huge pages with ROX > permissions as a cache for smaller allocations. > > To populate the cache, a writable large page is allocated from vmalloc with > VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped as > ROX. > > Portions of that large page are handed out to execmem_alloc() callers > without any changes to the permissions. > > When the memory is freed with execmem_free() it is invalidated again so > that it won't contain stale instructions. > > The cache is enabled when an architecture sets EXECMEM_ROX_CACHE flag in > definition of an execmem_range. I am not sure you need to convert to xa entries. > > Signed-off-by: Mike Rapoport (Microsoft) > --- > include/linux/execmem.h | 2 + > mm/execmem.c| 289 +++- > 2 files changed, 286 insertions(+), 5 deletions(-) > > diff --git a/include/linux/execmem.h b/include/linux/execmem.h > index dfdf19f8a5e8..7436aa547818 100644 > --- a/include/linux/execmem.h > +++ b/include/linux/execmem.h > @@ -77,12 +77,14 @@ struct execmem_range { > > /** > * struct execmem_info - architecture parameters for code allocations > + * @fill_trapping_insns: set memory to contain instructions that will trap > * @ranges: array of parameter sets defining architecture specific > * parameters for executable memory allocations. The ranges that are not > * explicitly initialized by an architecture use parameters defined for > * @EXECMEM_DEFAULT. > */ > struct execmem_info { > + void (*fill_trapping_insns)(void *ptr, size_t size, bool writable); > struct execmem_rangeranges[EXECMEM_TYPE_MAX]; > }; > > diff --git a/mm/execmem.c b/mm/execmem.c > index 0f6691e9ffe6..f547c1f3c93d 100644 > --- a/mm/execmem.c > +++ b/mm/execmem.c > @@ -7,28 +7,88 @@ > */ > > #include > +#include > #include > #include > +#include > #include > #include > > +#include > + > +#include "internal.h" > + > static struct execmem_info *execmem_info __ro_after_init; > static struct execmem_info default_execmem_info __ro_after_init; > > -static void *__execmem_alloc(struct execmem_range *range, size_t size) > +#ifdef CONFIG_MMU > +struct execmem_cache { > + struct mutex mutex; > + struct maple_tree busy_areas; > + struct maple_tree free_areas; > +}; > + > +static struct execmem_cache execmem_cache = { > + .mutex = __MUTEX_INITIALIZER(execmem_cache.mutex), > + .busy_areas = MTREE_INIT_EXT(busy_areas, MT_FLAGS_LOCK_EXTERN, > + execmem_cache.mutex), > + .free_areas = MTREE_INIT_EXT(free_areas, MT_FLAGS_LOCK_EXTERN, > + execmem_cache.mutex), > +}; > + > +static void execmem_cache_clean(struct work_struct *work) > +{ > + struct maple_tree *free_areas = &execmem_cache.free_areas; > + struct mutex *mutex = &execmem_cache.mutex; > + MA_STATE(mas, free_areas, 0, ULONG_MAX); > + void *area; > + > + mutex_lock(mutex); > + mas_for_each(&mas, area, ULONG_MAX) { > + size_t size; > + > + if (!xa_is_value(area)) > + continue; > + > + size = xa_to_value(area); > + > + if (IS_ALIGNED(size, PMD_SIZE) && > + IS_ALIGNED(mas.index, PMD_SIZE)) { > + void *ptr = (void *)mas.index; If you store this pointer then it would be much nicer. > + > + mas_erase(&mas); mas_store_gfp() would probably be better here to store a null. > + vfree(ptr); > + } > + } > + mutex_unlock(mutex); > +} > + > +static DECLARE_WORK(execmem_cache_clean_work, execmem_cache_clean); > + > +static void execmem_fill_trapping_insns(void *ptr, size_t size, bool > writable) > +{ > + if (execmem_info->fill_trapping_insns) > + execmem_info->fill_trapping_insns(ptr, size, writable); > + else > + memset(ptr, 0, size); > +} > + > +static void *execmem_vmalloc(struct execmem_range *range, size_t size, > + pgprot_t pgprot, unsigned long vm_flags) > { > bool kasan = range->flags & EXECMEM_KASAN_SHADOW; > - unsigned long vm_flags = VM_FLUSH_RESET_PERMS; > gfp_t gfp_flags = GFP_KERNEL | __GFP_NOWARN; > + unsigned int align = range->alignment; > unsigned long start = range->start; > unsigned long end = range->end; > - unsigned int align = range->alignment; > - pgprot_t pgprot = range->pgprot; > void *p; > > if (kasan) > vm_flags |= VM_DEFER_KMEMLEAK; > > + if (vm_flags & VM_ALLOW_HUGE_VMAP) > + align = PMD_SIZE; > + > p = __vmalloc_node_range(size, align, start, end, gfp_flags, >pgprot, vm_flags, N
Re: [PATCH] xarray: port tests to kunit
* Geert Uytterhoeven [250130 08:26]: > Hi Liam, > > On Thu, 30 Jan 2025 at 13:52, Liam R. Howlett wrote: > > * Geert Uytterhoeven [250130 03:21]: > > > On Wed, 29 Jan 2025 at 23:26, Liam R. Howlett > > > wrote: > > > > I've never used the kunit testing of xarray and have used the userspace > > > > testing instead, so I can't speak to the obscure invocation as both > > > > commands seem insanely long and obscure to me. > > > > > > The long and obscure command line is a red herring: a simple > > > "modprobe test_xarray" is all it takes... > > > > That command worked before too... > > Exactly, great! > > > > > You should look at the userspace testing (that this broke) as it has > > > > been really useful in certain scenarios. > > > > > > BTW, how do I even build tools/testing/radix-tree? > > > "make tools/help" doesn't show the radix-tree test. > > > "make tools/all" doesn't seem to try to build it. > > > Same for "make kselftest-all". > > > > make > > Where? > > > BTW, how do I even build tools/testing/radix-tree? ^^^ > > > Or look at the make file and stop guessing. Considering how difficult > > There is no Makefile referencing tools/testing/radix-tree or the > radix-tree subdir. That's why I asked... > > Oh, I am supposed to run make in tools/testing/radix-tree/? > What a surprise! > > Which is a pain when building in a separate output directory, as you > cannot just do "make -C tools/testing/radix-tree" there, but have to > type the full "make -C tools/testing/radix-tree O=..." (and optionally > ARCH=... and CROSS_COMPILE=...; oh wait, these are ignored :-( in the > source directory instead... I'll await your patch to link all this together. Please Cc the authors. > > If these tests are not integrated into the normal build system (see > also [1]), I am not so surprised the auto-builders don't build them, > and breakages are introduced... > > > it is to get m68k to build, you should probably know how to read a > > makefile. > > Like all other kernel cross-compilation? Usually you don't even have > to know where your cross-compiler is living: > > make ARCH=m68k Ignoring that I had to make a config - which asked challenging questions... And ignoring the steps to get m68k compiler... > > > When trying the above, and ignoring failures due to missing packages > > > on my host: > > > - there are several weird build errors, > > > - this doesn't play well with O=, > > > - lots of scary warnings when building for 32-bit, > > > - ... > > > In file included from ./include/linux/sched.h:12, from arch/m68k/kernel/asm-offsets.c:15: ./arch/m68k/include/asm/current.h:7:30: error: invalid register name for ‘current’ 7 | register struct task_struct *current __asm__("%a2"); > > > > At least the kunit tests build (and run[1] ;-) most of the time... > > > > Do they? How about you break something in xarray and then try to boot > > the kunit, or try to boot to load that module. > > If you break the kernel beyond the point of booting, you can indeed > not run any test modules... Which is extremely easy when you are changing code that runs so early in the boot. My code found a compiler issue because it's the first function that returns a boolean. This is stupid. > > Which does _not_ mean the userspace tests are not useful, and that I > approve breaking the userspace tests... Perfect, let's revert the patch then. This is such a waste of time.
Re: [PATCH] xarray: port tests to kunit
* Geert Uytterhoeven [250130 09:25]: > Hi Liam, Hi Geert, I'd like to say sorry for getting upset about this. > > On Thu, 30 Jan 2025 at 15:06, Liam R. Howlett wrote: > > > > I'll await your patch to link all this together. Please Cc the authors. > > I gave it a try for kselftests a few years ago. > https://lore.kernel.org/all/20190114135144.26096-1-geert+rene...@glider.be > Unfortunately only one patch was applied... It is difficult to integrate the test framework due to the nature of it stubbing out a lot of the kernel code it uses. This is also a strength as it can be used to test unlikely failure scenarios that are difficult or impossible to recreate in kunit or a running kernel - even with injections of failures. It can also be used to isolate issues for recreating, which is very valuable in larger, longer running issues. I also use the userspace testing to test rcu using threads and I think that would be even more challenging on a booted system. > > > > > it is to get m68k to build, you should probably know how to read a > > > > makefile. > > > > > > Like all other kernel cross-compilation? Usually you don't even have > > > to know where your cross-compiler is living: > > > > > > make ARCH=m68k > > > > Ignoring that I had to make a config - which asked challenging > > questions... > > make ARCH=m68k defconfig That also prompts, defoldconfig did not. > > > And ignoring the steps to get m68k compiler... > > apt install gcc-m68k-linux-gnu? There are a few compilers, multilib or such? I've had issues with getting all the archs working for cross compile on the same machine (arm, arm64, riscv, m68k, ppc, ppc64, parisc). > > > > > > When trying the above, and ignoring failures due to missing packages > > > > > on my host: > > > > > - there are several weird build errors, > > > > > - this doesn't play well with O=, > > > > > - lots of scary warnings when building for 32-bit, > > > > > - ... > > > > > > > > > In file included from ./include/linux/sched.h:12, > > from arch/m68k/kernel/asm-offsets.c:15: > > ./arch/m68k/include/asm/current.h:7:30: error: invalid register name for > > ‘current’ > > 7 | register struct task_struct *current __asm__("%a2"); > > Which compiler are you using? I've had a hard time getting m68k to boot in qemu because of the lack of userspace. I use m68k for nommu testing, but have a hard time getting the buildroot to work correctly to build what I need. This was a grumpy, pre-coffee way of saying that some tasks are not straight forward and are extremely difficult to make straight forward. Sorry, I should not have made such rash comments. I respect you and your work and appreciate the help you have given me in the past. I would also like to thank you for your earlier statements. After rereading them, I believe I misunderstood what you were saying. I've been trying to make these tests a part of automatic testing somehow, even getting them to run in userspace. > > > > > > At least the kunit tests build (and run[1] ;-) most of the time... > > > > > > > > Do they? How about you break something in xarray and then try to boot > > > > the kunit, or try to boot to load that module. > > > > > > If you break the kernel beyond the point of booting, you can indeed > > > not run any test modules... > > > > Which is extremely easy when you are changing code that runs so early in > > the boot. > > > > My code found a compiler issue because it's the first function that > > returns a boolean. This is stupid. > > Sorry. I don't understand this comment. I had a bug a few years ago that turned out to be a clang compiler issue with booleans. It turns out my code was the first function to return a boolean and that wasn't being properly handled by the compiler (thanks to mitigation of clearing registers with certain .config/compiler flags).. it's not important. More importantly, I think I get your point, you think that the testing should be integrated and complain if it's broken - at least by bots. I don't think this is practical in all cases, unfortunately. Although I would like to strive for this - and I do by keeping any tests I can as a kernel module while keeping the harder to recreate issues as user space. I think we do a good job keeping testing up to date and adding testcases as new issues are discovered. Thanks, Liam
Re: [PATCH] xarray: port tests to kunit
* Geert Uytterhoeven [250130 03:21]: > Hi Liam, > > On Wed, 29 Jan 2025 at 23:26, Liam R. Howlett wrote: > > I've never used the kunit testing of xarray and have used the userspace > > testing instead, so I can't speak to the obscure invocation as both > > commands seem insanely long and obscure to me. > > The long and obscure command line is a red herring: a simple > "modprobe test_xarray" is all it takes... That command worked before too... > > > You should look at the userspace testing (that this broke) as it has > > been really useful in certain scenarios. > > BTW, how do I even build tools/testing/radix-tree? > "make tools/help" doesn't show the radix-tree test. > "make tools/all" doesn't seem to try to build it. > Same for "make kselftest-all". make Or look at the make file and stop guessing. Considering how difficult it is to get m68k to build, you should probably know how to read a makefile. > When trying the above, and ignoring failures due to missing packages > on my host: > - there are several weird build errors, > - this doesn't play well with O=, > - lots of scary warnings when building for 32-bit, > - ... > > At least the kunit tests build (and run[1] ;-) most of the time... Do they? How about you break something in xarray and then try to boot the kunit, or try to boot to load that module. This exchange is a good argument against doing any kunit work on my tests. > > [1] test_xarray started failing on m68k recently > > https://lore.kernel.org/all/CAMuHMdU_bfadUO=0OZ=aoq9eamqpa4wslcbqohxr+qceckr...@mail.gmail.com/ > > Gr{oetje,eeting}s, > > Geert > > > -- > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- > ge...@linux-m68k.org > > In personal conversations with technical people, I call myself a hacker. But > when I'm talking to journalists I just say "programmer" or something like > that. > -- Linus Torvalds
Re: [PATCH mm-unstable v2 06/16] mm: csky: Introduce arch_mmap_hint()
* Kalesh Singh [241211 18:28]: > Introduce csky arch_mmap_hint() and define HAVE_ARCH_MMAP_HINT. > This is a preparatory patch, no functional change is introduced. This also looks like it has changed the validation order and potentially introduced functional changes? All these stem from the same cloned code (sparc32 iirc), but were not updated when the cloned code was updated. This is why I am against arch_* code. We should find a better way to unify the code so that there is nothing different. You seem to have gotten some of the shared code together, but some still exists. In the addresses, there are upper and lower limits, and sometimes "colours". Could we not just define the upper/lower limits in each arch and if colour is used? Maybe this is complicated with 32/64 handled both in the 64 bit code. Is there any plan to unite this code further? We have had errors for many years in cloned but not updated code. I really wish there was more information in the cover letter on what is going on here. I'd like to try and reduce the arch_ code to, basically nothing. I was also disappointed that I wasn't Cc'ed because I've spent a lot of time in this code and this area. I am probably the last one to crawl through and change any of this. > > Signed-off-by: Kalesh Singh > --- > > Changes in v2: > - MAP_FIXED case is also handled in arch_mmap_hint() since this is just a > special case of the hint addr being "enforced", per Yang Shi. > - Consolidate error handling in arch_mmap_hint(). > > arch/csky/abiv1/inc/abi/pgtable-bits.h | 1 + > arch/csky/abiv1/mmap.c | 68 ++ > 2 files changed, 38 insertions(+), 31 deletions(-) > > diff --git a/arch/csky/abiv1/inc/abi/pgtable-bits.h > b/arch/csky/abiv1/inc/abi/pgtable-bits.h > index ae7a2f76dd42..c346a9fcb522 100644 > --- a/arch/csky/abiv1/inc/abi/pgtable-bits.h > +++ b/arch/csky/abiv1/inc/abi/pgtable-bits.h > @@ -51,5 +51,6 @@ > ((offset) << 10)}) > > #define HAVE_ARCH_UNMAPPED_AREA > +#define HAVE_ARCH_MMAP_HINT > > #endif /* __ASM_CSKY_PGTABLE_BITS_H */ > diff --git a/arch/csky/abiv1/mmap.c b/arch/csky/abiv1/mmap.c > index 1047865e82a9..0c5c51a081e4 100644 > --- a/arch/csky/abiv1/mmap.c > +++ b/arch/csky/abiv1/mmap.c > @@ -13,6 +13,39 @@ > addr)+SHMLBA-1)&~(SHMLBA-1)) + \ >(((pgoff)< > +unsigned long arch_mmap_hint(struct file *filp, unsigned long addr, > + unsigned long len, unsigned long pgoff, > + unsigned long flags) > +{ > + bool do_align; > + > + if (len > TASK_SIZE) > + return -ENOMEM; > + > + /* > + * We only need to do colour alignment if either the I or D > + * caches alias. > + */ > + do_align = filp || (flags & MAP_SHARED); > + > + /* > + * We enforce the MAP_FIXED case. > + */ > + if (flags & MAP_FIXED) { > + if (flags & MAP_SHARED && > + (addr - (pgoff << PAGE_SHIFT)) & (SHMLBA - 1)) > + return -EINVAL; > + return addr; > + } > + > + if (do_align) > + addr = COLOUR_ALIGN(addr, pgoff); > + else > + addr = PAGE_ALIGN(addr); > + > + return generic_mmap_hint(filp, addr, len, pgoff, flags); > +} > + > /* > * We need to ensure that shared mappings are correctly aligned to > * avoid aliasing issues with VIPT caches. We need to ensure that > @@ -27,8 +60,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long > addr, > unsigned long flags, vm_flags_t vm_flags) > { > struct mm_struct *mm = current->mm; > - struct vm_area_struct *vma; > - int do_align = 0; > + bool do_align; > struct vm_unmapped_area_info info = { > .length = len, > .low_limit = mm->mmap_base, > @@ -36,37 +68,11 @@ arch_get_unmapped_area(struct file *filp, unsigned long > addr, > .align_offset = pgoff << PAGE_SHIFT > }; > > - /* > - * We only need to do colour alignment if either the I or D > - * caches alias. > - */ This seems like useful information to keep around? > - do_align = filp || (flags & MAP_SHARED); > - > - /* > - * We enforce the MAP_FIXED case. > - */ > - if (flags & MAP_FIXED) { > - if (flags & MAP_SHARED && > - (addr - (pgoff << PAGE_SHIFT)) & (SHMLBA - 1)) > - return -EINVAL; > + addr = arch_mmap_hint(filp, addr, len, pgoff, flags); > + if (addr) > return addr; > - } > - > - if (len > TASK_SIZE) > - return -ENOMEM; > - > - if (addr) { > - if (do_align) > - addr = COLOUR_ALIGN(addr, pgoff); > - else > - addr = PAGE_ALIGN(addr); > - > - vma = find_vma(mm, addr); > - if (TASK_SIZE - len >= addr && > - (!vma
Re: [PATCH mm-unstable v2 05/16] mm: arc: Use generic_mmap_hint()
* Kalesh Singh [241211 18:28]: > Introduce arc arch_mmap_hint() and define HAVE_ARCH_MMAP_HINT. > This is a preparatory patch, no functional change is introduced. You have changed the order of the map fixed check and the len check. I *think* what you have done is correct, but your comment is certainly wrong. Your generic call also has more checks than exist in this version of the code - which, again is probably good, but a functional change surely? > > Signed-off-by: Kalesh Singh > --- > > Changes in v2: > - MAP_FIXED case is also handled in arch_mmap_hint() since this is just a > special case of the hint addr being "enforced", per Yang Shi. > - Consolidate error handling in arch_mmap_hint(). > > arch/arc/include/asm/pgtable.h | 1 + > arch/arc/mm/mmap.c | 43 +- > 2 files changed, 23 insertions(+), 21 deletions(-) > > diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h > index 4cf45a99fd79..af3210ea4888 100644 > --- a/arch/arc/include/asm/pgtable.h > +++ b/arch/arc/include/asm/pgtable.h > @@ -28,6 +28,7 @@ extern pgd_t swapper_pg_dir[] __aligned(PAGE_SIZE); > > /* to cope with aliasing VIPT cache */ > #define HAVE_ARCH_UNMAPPED_AREA > +#define HAVE_ARCH_MMAP_HINT > > #endif /* __ASSEMBLY__ */ > > diff --git a/arch/arc/mm/mmap.c b/arch/arc/mm/mmap.c > index 2185afe8d59f..df01d4d9964b 100644 > --- a/arch/arc/mm/mmap.c > +++ b/arch/arc/mm/mmap.c > @@ -14,6 +14,26 @@ > > #include > > +unsigned long arch_mmap_hint(struct file *filp, unsigned long addr, > + unsigned long len, unsigned long pgoff, > + unsigned long flags) > +{ > + if (len > TASK_SIZE) > + return -ENOMEM; > + > + /* > + * We enforce the MAP_FIXED case. > + */ > + if (flags & MAP_FIXED) { > + if (flags & MAP_SHARED && > + (addr - (pgoff << PAGE_SHIFT)) & (SHMLBA - 1)) > + return -EINVAL; > + return addr; > + } > + > + return generic_mmap_hint(filp, addr, len, pgoff, flags); > +} > + > /* > * Ensure that shared mappings are correctly aligned to > * avoid aliasing issues with VIPT caches. > @@ -27,30 +47,11 @@ arch_get_unmapped_area(struct file *filp, unsigned long > addr, > unsigned long flags, vm_flags_t vm_flags) > { > struct mm_struct *mm = current->mm; > - struct vm_area_struct *vma; > struct vm_unmapped_area_info info = {}; > > - /* > - * We enforce the MAP_FIXED case. > - */ > - if (flags & MAP_FIXED) { > - if (flags & MAP_SHARED && > - (addr - (pgoff << PAGE_SHIFT)) & (SHMLBA - 1)) > - return -EINVAL; > + addr = arch_mmap_hint(filp, addr, len, pgoff, flags); > + if (addr) > return addr; > - } > - > - if (len > TASK_SIZE) > - return -ENOMEM; > - > - if (addr) { > - addr = PAGE_ALIGN(addr); > - > - vma = find_vma(mm, addr); > - if (TASK_SIZE - len >= addr && > - (!vma || addr + len <= vm_start_gap(vma))) > - return addr; > - } > > info.length = len; > info.low_limit = mm->mmap_base; > -- > 2.47.0.338.g60cca15819-goog > >
Re: [PATCH mm-unstable v2 00/16] mm: Introduce arch_mmap_hint()
+ Lorenzo Can you please Cc the people listed in the maintainers on the files you are submitting against? You seemed to Cc everyone but the mmap.c file maintainers? * Kalesh Singh [241211 18:28]: > Hi all, > > This is v2 othe the arch_mmap_hint() series. > > Changes in v2: > - MAP_FIXED case is also handled in arch_mmap_hint() since this is just a > special case of the hint addr being "enforced", per Yang Shi. > - Consolidate most of the error handling in arch_mmap_hint(). > - Patch 16 ("mm: Fallback to generic_mmap_hint()") was folded into > Patch 2 ("mm: x86: Introduce arch_mmap_hint()") > > v1: https://lore.kernel.org/r/20241210024119.2488608-1-kaleshsi...@google.com/ > > === > > This series introduces arch_mmap_hint() to handle allocating VA space > for the hint address. Why? Could we get more details in your cover letter please? This entire email has as much detail as the subject line. I don't want more arch_ anything. If we can do this in a more generic way, then we should. > > Patches 1-16 introduce this new helper and Patch 17 uses it to fix the > issue of mmap hint being ignored in some cases due to THP alignment [1] > > [1] https://lore.kernel.org/r/20241118214650.3667577-1-kaleshsi...@google.com/ > > Thanks, > Kalesh > > > Kalesh Singh (16): > mm: Introduce generic_mmap_hint() > mm: x86: Introduce arch_mmap_hint() > mm: arm: Introduce arch_mmap_hint() > mm: alpha: Introduce arch_mmap_hint() > mm: arc: Use generic_mmap_hint() > mm: csky: Introduce arch_mmap_hint() > mm: loongarch: Introduce arch_mmap_hint() > mm: mips: Introduce arch_align_mmap_hint() > mm: parisc: Introduce arch_align_mmap_hint() > mm: s390: Use generic_mmap_hint() > mm: sh: Introduce arch_mmap_hint() > mm: sparc32: Introduce arch_mmap_hint() > mm: sparc64: Introduce arch_mmap_hint() > mm: xtensa: Introduce arch_mmap_hint() > mm: powerpc: Introduce arch_mmap_hint() > mm: Respect mmap hint before THP alignment if allocation is possible > > arch/alpha/include/asm/pgtable.h | 1 + > arch/alpha/kernel/osf_sys.c| 31 +++--- > arch/arc/include/asm/pgtable.h | 1 + > arch/arc/mm/mmap.c | 43 + > arch/arm/include/asm/pgtable.h | 1 + > arch/arm/mm/mmap.c | 107 + > arch/csky/abiv1/inc/abi/pgtable-bits.h | 1 + > arch/csky/abiv1/mmap.c | 68 +++-- > arch/loongarch/include/asm/pgtable.h | 1 + > arch/loongarch/mm/mmap.c | 49 +- > arch/mips/include/asm/pgtable.h| 1 + > arch/mips/mm/mmap.c| 50 +- > arch/parisc/include/asm/pgtable.h | 1 + > arch/parisc/kernel/sys_parisc.c| 53 +- > arch/powerpc/include/asm/book3s/64/slice.h | 1 + > arch/powerpc/mm/book3s64/slice.c | 31 ++ > arch/s390/include/asm/pgtable.h| 1 + > arch/s390/mm/mmap.c| 51 +- > arch/sh/include/asm/pgtable.h | 1 + > arch/sh/mm/mmap.c | 83 ++-- > arch/sparc/include/asm/pgtable_32.h| 1 + > arch/sparc/include/asm/pgtable_64.h| 1 + > arch/sparc/kernel/sys_sparc_32.c | 33 --- > arch/sparc/kernel/sys_sparc_64.c | 96 +++--- > arch/x86/include/asm/pgtable_64.h | 1 + > arch/x86/kernel/sys_x86_64.c | 64 ++-- > arch/xtensa/kernel/syscall.c | 31 -- > include/linux/sched/mm.h | 9 ++ > mm/huge_memory.c | 17 ++-- > mm/mmap.c | 86 +++-- > 30 files changed, 491 insertions(+), 424 deletions(-) > > -- > 2.47.0.338.g60cca15819-goog > >
Re: [PATCH mm-unstable v2 00/16] mm: Introduce arch_mmap_hint()
* Lorenzo Stoakes [241213 10:16]: > On Fri, Dec 13, 2024 at 10:06:55AM -0500, Kalesh Singh wrote: > > On Fri, Dec 13, 2024 at 4:00 AM Lorenzo Stoakes > > wrote: ... > > > > On the technical side, Liam is right that the copy-pasted arch code > > has inconsistencies (missing checks, order of checks, ...). I agree > > there’s room for further consolidation. I’ll take another stab at it > > and resend it as an RFC with an updated cover letter, as Lorenzo and > > others suggested. Thanks. Can you please include what platforms you have tested in your cover letter (and level of testing - booting, running something, etc). If you have not tested them, then it might be worth it to have it as an RFC to point this out - at least initially. Some of these are very difficult to set up for testing, but it is also possible that you did that and the maintainers/people who usually test these things will assume it's fine if you don't spell out what's going on. > > The most useful thing here as well to help us understand would be to write > more in the cover letter to expand on what it is you ultimately what to > achieve here - it seems like an extension on the existing THP work on a > per-arch basis (I may be wrong)? So adding more detail would be super > useful here! :) > > We do hope to avoid arch hooks if at all possible explicitly for the reason > that they can be applied at unfortunate times in terms of locking/whether > the objects in question are fully/partially instantiated, VMA visibility > etc. etc. based on having had issues in these areas before. > > Also if a hook means 'anything' can happen at a certain point, it means we > can't make any assumptions about what has/hasn't and have to account for > anything which seriously complicates things. > > Ideally we'd find a means to achieve the same thing while also exposing us > as little as possible to what may be mutated. Yes, I'm not sure of what your plans are, but I would like to see all of these custom functions removed, if at all possible. Thanks, Liam
Re: [PATCH] xarray: port tests to kunit
* Tamir Duberstein [250129 16:29]: > On Wed, Jan 29, 2025 at 4:26 PM Liam R. Howlett > wrote: > > > > * Sidhartha Kumar [250129 16:02]: > > > + Liam, Matthew > > > > + linux-mm > > > > Thank you Sid. > > > > > > > > Hello, > > > > > > I believe this patch needs to be reverted for now as it breaks the > > > user-space build of /tools/testing/radix-tree with: > > > > > > In file included from xarray.c:11: > > > ../../../lib/test_xarray.c:9:10: fatal error: kunit/test.h: No such file > > > or directory > > > 9 | #include > > >| ^~ > > > compilation terminated. > > > make: *** [: xarray.o] Error 1 > > > make: *** Waiting for unfinished jobs > > > > > > this then prevents the maple tree test suite from building. > > > > How are grammar corrections going to the right person (but not the > > mailing list) while an entire conversion to kunit is not [1]? > > Very simple: the tests are not properly included in MAINTAINERS. I > sent > https://lore.kernel.org/all/20250129-xarray-test-maintainer-v1-1-482e31f30...@gmail.com/ > a few minutes ago for this reason. Fair enough, but from the patch: @@ -6,11 +6,10 @@ * Author: Matthew Wilcox */ -#include -#include +#include ... -module_init(xarray_checks); -module_exit(xarray_exit); MODULE_AUTHOR("Matthew Wilcox "); MODULE_DESCRIPTION("XArray API test module"); MODULE_LICENSE("GPL"); I don't get why the huge list of Cc's didn't include the author who is in the git commit signers: $ ./scripts/get_maintainer.pl --git lib/xarray.c Matthew Wilcox (supporter:XARRAY,commit_signer:1/3=33%,authored:1/3=33%,added_lines:19/52=37%,removed_lines:4/23=17%) Andrew Morton (supporter:LIBRARY CODE,commit_signer:3/3=100%) ... > > > Does the patch really need to drop the module testing too? > > > > What exactly is the point of converting one testing system to another > > besides disruption of actual work? Who asked for this? What is the > > point? > > All this is described in the commit message. The commit message says you like the output more and implies you like the command better. I've never used the kunit testing of xarray and have used the userspace testing instead, so I can't speak to the obscure invocation as both commands seem insanely long and obscure to me. > > > Is anyone doing work on the xarray running the kunit tests? > > I was doing work on xarray and I was running the kunit tests. ... You should look at the userspace testing (that this broke) as it has been really useful in certain scenarios. Thanks, Liam
Re: [PATCH] xarray: port tests to kunit
* Sidhartha Kumar [250129 16:02]: > + Liam, Matthew + linux-mm Thank you Sid. > > Hello, > > I believe this patch needs to be reverted for now as it breaks the > user-space build of /tools/testing/radix-tree with: > > In file included from xarray.c:11: > ../../../lib/test_xarray.c:9:10: fatal error: kunit/test.h: No such file > or directory > 9 | #include >| ^~ > compilation terminated. > make: *** [: xarray.o] Error 1 > make: *** Waiting for unfinished jobs > > this then prevents the maple tree test suite from building. How are grammar corrections going to the right person (but not the mailing list) while an entire conversion to kunit is not [1]? Does the patch really need to drop the module testing too? What exactly is the point of converting one testing system to another besides disruption of actual work? Who asked for this? What is the point? Is anyone doing work on the xarray running the kunit tests? This should be reverted as it should never have been merged. Regards, Liam [1]. https://lore.kernel.org/all/20241009193602.41797-2-tam...@gmail.com/
Re: [RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range()
* Ryan Roberts [250530 10:05]: > Lazy mmu mode applies to the current task and permits pte modifications > to be deferred and updated at a later time in a batch to improve > performance. apply_to_page_range() calls its callback in lazy mmu mode > and some of those callbacks call into the page allocator to either > allocate or free pages. > > This is problematic with CONFIG_DEBUG_PAGEALLOC because > debug_pagealloc_[un]map_pages() calls the arch implementation of > __kernel_map_pages() which must modify the ptes for the linear map. > > There are two possibilities at this point: > > - If the arch implementation modifies the ptes directly without first >entering lazy mmu mode, the pte modifications may get deferred until >the existing lazy mmu mode is exited. This could result in taking >spurious faults for example. > > - If the arch implementation enters a nested lazy mmu mode before >modification of the ptes (many arches use apply_to_page_range()), >then the linear map updates will definitely be applied upon leaving >the inner lazy mmu mode. But because lazy mmu mode does not support >nesting, the remainder of the outer user is no longer in lazy mmu >mode and the optimization opportunity is lost. > > So let's just ensure that the page allocator is never called from within > lazy mmu mode. New "_nolazy" variants of apply_to_page_range() and > apply_to_existing_page_range() are introduced which don't enter lazy mmu > mode. Then users which need to call into the page allocator within their > callback are updated to use the _nolazy variants. > > Signed-off-by: Ryan Roberts > --- > include/linux/mm.h | 6 ++ > kernel/bpf/arena.c | 6 +++--- > mm/kasan/shadow.c | 2 +- > mm/memory.c| 54 +++--- > 4 files changed, 51 insertions(+), 17 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e51dba8398f7..11cae6ce04ff 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3743,9 +3743,15 @@ static inline bool gup_can_follow_protnone(struct > vm_area_struct *vma, > typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data); > extern int apply_to_page_range(struct mm_struct *mm, unsigned long address, > unsigned long size, pte_fn_t fn, void *data); > +extern int apply_to_page_range_nolazy(struct mm_struct *mm, > + unsigned long address, unsigned long size, > + pte_fn_t fn, void *data); We are removing externs as things are edited, so probably drop them here. > extern int apply_to_existing_page_range(struct mm_struct *mm, > unsigned long address, unsigned long size, > pte_fn_t fn, void *data); > +extern int apply_to_existing_page_range_nolazy(struct mm_struct *mm, > +unsigned long address, unsigned long size, > +pte_fn_t fn, void *data); > > #ifdef CONFIG_PAGE_POISONING > extern void __kernel_poison_pages(struct page *page, int numpages); > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > index 0d56cea71602..ca833cfeefb7 100644 > --- a/kernel/bpf/arena.c > +++ b/kernel/bpf/arena.c > @@ -187,10 +187,10 @@ static void arena_map_free(struct bpf_map *map) > /* >* free_vm_area() calls remove_vm_area() that calls > free_unmap_vmap_area(). >* It unmaps everything from vmalloc area and clears pgtables. > - * Call apply_to_existing_page_range() first to find populated ptes and > - * free those pages. > + * Call apply_to_existing_page_range_nolazy() first to find populated > + * ptes and free those pages. >*/ > - apply_to_existing_page_range(&init_mm, > bpf_arena_get_kern_vm_start(arena), > + apply_to_existing_page_range_nolazy(&init_mm, > bpf_arena_get_kern_vm_start(arena), >KERN_VM_SZ - GUARD_SZ, existing_page_cb, > NULL); > free_vm_area(arena->kern_vm); > range_tree_destroy(&arena->rt); > diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c > index d2c70cd2afb1..2325c5166c3a 100644 > --- a/mm/kasan/shadow.c > +++ b/mm/kasan/shadow.c > @@ -590,7 +590,7 @@ void kasan_release_vmalloc(unsigned long start, unsigned > long end, > > > if (flags & KASAN_VMALLOC_PAGE_RANGE) > - apply_to_existing_page_range(&init_mm, > + apply_to_existing_page_range_nolazy(&init_mm, >(unsigned long)shadow_start, >size, kasan_depopulate_vmalloc_pte, >NULL); > diff --git a/mm/memory.c b/mm/memory.c > index 49199410805c..24436074ce48 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2913,7 +2913,7 @@ EXPORT_SYMBOL(vm_iomap_memory); > static int apply_to_pte_range(struct mm_struct
Re: [RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range()
* Ryan Roberts [250530 12:50]: ... > > > > > > These wrappers are terrible for readability and annoying for argument > > lists too. > > Agreed. > > > > > Could we do something like the pgtbl_mod_mask or zap_details and pass > > through a struct or one unsigned int for create and lazy_mmu? > > Or just create some enum flags? > > > > > At least we'd have better self-documenting code in the wrappers.. and if > > we ever need a third boolean, we could avoid multiplying the wrappers > > again. > > > > WDYT? > > I'm happy with either approach. I was expecting more constination about the > idea > of being able to disable lazy mode though, so perhaps I'll wait and see if any > arrives. If it doesn't... flags? Yes, that works as well. Please use pmd_flags or anything more descriptive than just 'flags' :) I wonder which approach is best in asm instructions and self-documenting code. Regards, Liam