Re: [PATCH 12/44] mfd: ab8500-sysctrl: Register with kernel poweroff handler

2014-10-09 Thread Catalin Marinas
On Tue, Oct 07, 2014 at 09:00:48AM +0100, Lee Jones wrote:
> On Mon, 06 Oct 2014, Guenter Roeck wrote:
> > --- a/drivers/mfd/ab8500-sysctrl.c
> > +++ b/drivers/mfd/ab8500-sysctrl.c
> > @@ -6,6 +6,7 @@
> 
> [...]
> 
> > +static int ab8500_power_off(struct notifier_block *this, unsigned long 
> > unused1,
> > +   void *unused2)
> >  {
> > sigset_t old;
> > sigset_t all;
> > @@ -34,11 +36,6 @@ static void ab8500_power_off(void)
> > struct power_supply *psy;
> > int ret;
> >  
> > -   if (sysctrl_dev == NULL) {
> > -   pr_err("%s: sysctrl not initialized\n", __func__);
> > -   return;
> > -   }
> 
> Can you explain the purpose of this change please?

I guess it's because the sysctrl_dev is already initialised when
registering the power_off handler, so there isn't a way to call the
above function with a NULL sysctrl_dev. Probably even with the original
code you didn't need this check (after some race fix in
ab8500_sysctrl_remove but races is one of the things Guenter's patches
try to address).

-- 
Catalin
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 32/44] arm64: psci: Register with kernel poweroff handler

2014-10-09 Thread Catalin Marinas
On Tue, Oct 07, 2014 at 06:28:34AM +0100, Guenter Roeck wrote:
> Register with kernel poweroff handler instead of setting pm_power_off
> directly.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Signed-off-by: Guenter Roeck 
> ---
>  arch/arm64/kernel/psci.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kernel/psci.c b/arch/arm64/kernel/psci.c
> index 5539547..c1f3d09 100644
> --- a/arch/arm64/kernel/psci.c
> +++ b/arch/arm64/kernel/psci.c
> @@ -286,7 +286,7 @@ static int __init psci_0_2_init(struct device_node *np)
>  
>   arm_pm_restart = psci_sys_reset;
>  
> - pm_power_off = psci_sys_poweroff;
> + register_poweroff_handler_simple(psci_sys_poweroff, 128);
>  
>  out_put_node:
>   of_node_put(np);

Acked-by: Catalin Marinas 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 08/44] kernel: Move pm_power_off to common code

2014-10-09 Thread Catalin Marinas
On Tue, Oct 07, 2014 at 06:28:10AM +0100, Guenter Roeck wrote:
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index e0ef8ba..db396bb 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -94,8 +94,6 @@ void soft_restart(unsigned long addr)
>  /*
>   * Function pointers to optional machine specific functions
>   */
> -void (*pm_power_off)(void);
> -EXPORT_SYMBOL_GPL(pm_power_off);
> 
>  void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);
> 
> @@ -155,8 +153,7 @@ void machine_power_off(void)
>  {
> local_irq_disable();
> smp_send_stop();
> -   if (pm_power_off)
> -   pm_power_off();
> +   do_kernel_poweroff();
>  }

Acked-by: Catalin Marinas 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 02/10] arm64: uaccess s/might_sleep/might_fault/

2013-05-16 Thread Catalin Marinas
On 16 May 2013 12:10, Michael S. Tsirkin  wrote:
> The only reason uaccess routines might sleep
> is if they fault. Make this explicit.
>
> Signed-off-by: Michael S. Tsirkin 
> ---
>  arch/arm64/include/asm/uaccess.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

For arm64:

Acked-by: Catalin Marinas 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH V3 2/5] mm: mlock: Add new mlock, munlock, and munlockall system calls

2015-07-08 Thread Catalin Marinas
On Tue, Jul 07, 2015 at 01:03:40PM -0400, Eric B Munson wrote:
> diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
> index 05745eb..514e77b 100644
> --- a/arch/arm/kernel/calls.S
> +++ b/arch/arm/kernel/calls.S
> @@ -397,6 +397,9 @@
>  /* 385 */CALL(sys_memfd_create)
>   CALL(sys_bpf)
>   CALL(sys_execveat)
> + CALL(sys_mlock2)
> + CALL(sys_munlock2)
> +/* 400 */CALL(sys_munlockall2)

s/400/390/

> diff --git a/arch/arm64/include/asm/unistd32.h 
> b/arch/arm64/include/asm/unistd32.h
> index cef934a..318072aa 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -797,3 +797,9 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>  __SYSCALL(__NR_bpf, sys_bpf)
>  #define __NR_execveat 387
>  __SYSCALL(__NR_execveat, compat_sys_execveat)
> +#define __NR_mlock2 388
> +__SYSCALL(__NR_mlock2, sys_mlock2)
> +#define __NR_munlock2 389
> +__SYSCALL(__NR_munlock2, sys_munlock2)
> +#define __NR_munlockall2 390
> +__SYSCALL(__NR_munlockall2, sys_munlockall2)

These look fine.

Catalin
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 31/31] dma-mapping-common: skip kmemleak checks for page-less SG entries

2015-08-12 Thread Catalin Marinas
Christoph,

On 12 August 2015 at 08:05, Christoph Hellwig  wrote:
> Signed-off-by: Christoph Hellwig 
> ---
>  include/asm-generic/dma-mapping-common.h | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/include/asm-generic/dma-mapping-common.h 
> b/include/asm-generic/dma-mapping-common.h
> index 940d5ec..afc3eaf 100644
> --- a/include/asm-generic/dma-mapping-common.h
> +++ b/include/asm-generic/dma-mapping-common.h
> @@ -51,8 +51,10 @@ static inline int dma_map_sg_attrs(struct device *dev, 
> struct scatterlist *sg,
> int i, ents;
> struct scatterlist *s;
>
> -   for_each_sg(sg, s, nents, i)
> -   kmemcheck_mark_initialized(sg_virt(s), s->length);
> +   for_each_sg(sg, s, nents, i) {
> +   if (sg_has_page(s))
> +   kmemcheck_mark_initialized(sg_virt(s), s->length);
> +   }

Just a nitpick for the subject, it should say "kmemcheck" rather than
"kmemleak" (different features ;)).

-- 
Catalin
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: kmemleak: Unable to handle kernel paging request

2014-06-13 Thread Catalin Marinas
On Fri, Jun 13, 2014 at 08:12:08AM +0100, Denis Kirjanov wrote:
> On 6/12/14, Catalin Marinas  wrote:
> > On Thu, Jun 12, 2014 at 01:00:57PM +0100, Denis Kirjanov wrote:
> >> On 6/12/14, Denis Kirjanov  wrote:
> >> > On 6/12/14, Catalin Marinas  wrote:
> >> >> On 11 Jun 2014, at 21:04, Denis Kirjanov 
> >> >> wrote:
> >> >>> On 6/11/14, Catalin Marinas  wrote:
> >> >>>> On Wed, Jun 11, 2014 at 04:13:07PM +0400, Denis Kirjanov wrote:
> >> >>>>> I got a trace while running 3.15.0-08556-gdfb9454:
> >> >>>>>
> >> >>>>> [  104.534026] Unable to handle kernel paging request for data at
> >> >>>>> address 0xc0007f00
> >> >>>>
> >> >>>> Were there any kmemleak messages prior to this, like "kmemleak
> >> >>>> disabled"? There could be a race when kmemleak is disabled because
> >> >>>> of
> >> >>>> some fatal (for kmemleak) error while the scanning is taking place
> >> >>>> (which needs some more thinking to fix properly).
> >> >>>
> >> >>> No. I checked for the similar problem and didn't find anything
> >> >>> relevant.
> >> >>> I'll try to bisect it.
> >> >>
> >> >> Does this happen soon after boot? I guess it’s the first scan
> >> >> (scheduled at around 1min after boot). Something seems to be telling
> >> >> kmemleak that there is a valid memory block at 0xc0007f00.
> >> >
> >> > Yeah, it happens after a while with a booted system so that's the
> >> > first kmemleak scan.
> >>
> >> I've bisected to this commit: d4c54919ed86302094c0ca7d48a8cbd4ee753e92
> >> "mm: add !pte_present() check on existing hugetlb_entry callbacks".
> >> Reverting the commit fixes the issue
> >
> > I can't figure how this causes the problem but I have more questions. Is
> > 0xc0007f00 address always the same in all crashes? If yes, you
> > could comment out start_scan_thread() in kmemleak_late_init() to avoid
> > the scanning thread starting. Once booted, you can run:
> >
> >   echo dump=0xc0007f00 > /sys/kernel/debug/kmemleak
> >
> > and check the dmesg for what kmemleak knows about that address, when it
> > was allocated and whether it should be mapped or not.
> 
> The address is always the same.
> 
> [  179.466239] kmemleak: Object 0xc0007f00 (size 16777216):
> [  179.466503] kmemleak:   comm "swapper/0", pid 0, jiffies 4294892300
> [  179.466508] kmemleak:   min_count = 0
> [  179.466512] kmemleak:   count = 0
> [  179.466517] kmemleak:   flags = 0x1
> [  179.466522] kmemleak:   checksum = 0
> [  179.466526] kmemleak:   backtrace:
> [  179.466531]  [] .memblock_alloc_range_nid+0x68/0x88
> [  179.466544]  [] .memblock_alloc_base+0x20/0x58
> [  179.466553]  [] .alloc_dart_table+0x5c/0xb0
> [  179.466561]  [] .pmac_probe+0x38/0xa0
> [  179.466569]  [<0002166c>] 0x2166c
> [  179.466579]  [<00ae0e68>] 0xae0e68
> [  179.466587]  [<9bc4>] 0x9bc4

OK, so that's the DART table allocated via alloc_dart_table(). Is
dart_tablebase removed from the kernel linear mapping after allocation?
If that's the case, we need to tell kmemleak to ignore this block (see
patch below, untested). But I still can't explain how commit
d4c54919ed863020 causes this issue.

(also cc'ing the powerpc list and maintainers)

---8<--

From 09a7f1c97166c7bdca7ca4e8a4ff2774f3706ea3 Mon Sep 17 00:00:00 2001
From: Catalin Marinas 
Date: Fri, 13 Jun 2014 09:44:21 +0100
Subject: [PATCH] powerpc/kmemleak: Do not scan the DART table

The DART table allocation is registered to kmemleak via the
memblock_alloc_base() call. However, the DART table is later unmapped
and dart_tablebase VA no longer accessible. This patch tells kmemleak
not to scan this block and avoid an unhandled paging request.

Signed-off-by: Catalin Marinas 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
---
 arch/powerpc/sysdev/dart_iommu.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
index 62c47bb76517..9e5353ff6d1b 100644
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -476,6 +476,11 @@ void __init alloc_dart_table(void)
 */
dart_tablebase = (unsigned long)
__va(memblock_alloc_base(1UL<<24, 1UL<<24, 0x8000L));
+   /*
+* The DART space is later unmapped from the kernel linear mapping and
+* accessing dart_tablebase during kmemleak scanning will fault.
+*/
+   kmemleak_no_scan((void *)dart_tablebase);
 
printk(KERN_INFO "DART table allocated at: %lx\n", dart_tablebase);
 }
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: kmemleak: Unable to handle kernel paging request

2014-06-14 Thread Catalin Marinas
On 13 Jun 2014, at 22:44, Benjamin Herrenschmidt  
wrote:
> On Fri, 2014-06-13 at 09:56 +0100, Catalin Marinas wrote:
> 
>> OK, so that's the DART table allocated via alloc_dart_table(). Is
>> dart_tablebase removed from the kernel linear mapping after allocation?
> 
> Yes.
> 
>> If that's the case, we need to tell kmemleak to ignore this block (see
>> patch below, untested). But I still can't explain how commit
>> d4c54919ed863020 causes this issue.
>> 
>> (also cc'ing the powerpc list and maintainers)
> 
> We remove the DART from the linear mapping because it has to be mapped
> non-cachable and having it in the linear mapping would cause cache
> paradoxes. We also can't just change the caching attributes in the
> linear mapping because we use 16M pages for it and 970 CPUs don't
> support cache-inhibited 16M pages :-( And due to the MMU segmentation
> model, we also can't mix & match page sizes in that area.
> 
> So we just unmap it, and ioremap it elsewhere.

OK, thanks for the explanation. So the kmemleak annotation makes sense.

Would you please take the I patch earlier (I guess with Denis’ tested-
by). I can send it separately if more convenient.

Thanks,

Catalin
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/9] ARM64: get rid of arch_cpu_idle_prepare()

2014-01-27 Thread Catalin Marinas
On Mon, Jan 27, 2014 at 06:08:17AM +, Nicolas Pitre wrote:
> ARM and ARM64 are the only two architectures implementing
> arch_cpu_idle_prepare() simply to call local_fiq_enable().
> 
> We have secondary_start_kernel() already calling local_fiq_enable() and
> this is done a second time in arch_cpu_idle_prepare() in that case. And
> enabling FIQs has nothing to do with idling the CPU to start with.
> 
> So let's introduce init_fiq_boot_cpu() to take care of FIQs on the boot
> CPU and remove arch_cpu_idle_prepare(). This is now done a bit earlier
> at late_initcall time but this shouldn't make a difference in practice
> given that FIQs are not currently used on ARM64.
> 
> Signed-off-by: Nicolas Pitre 

For arm64, we could simply remove any reference to FIQs. I'm not aware
of anyone using them.

-- 
Catalin
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/9] ARM64: get rid of arch_cpu_idle_prepare()

2014-01-27 Thread Catalin Marinas
On Mon, Jan 27, 2014 at 03:51:02PM +, Nicolas Pitre wrote:
> On Mon, 27 Jan 2014, Catalin Marinas wrote:
> 
> > On Mon, Jan 27, 2014 at 06:08:17AM +, Nicolas Pitre wrote:
> > > ARM and ARM64 are the only two architectures implementing
> > > arch_cpu_idle_prepare() simply to call local_fiq_enable().
> > > 
> > > We have secondary_start_kernel() already calling local_fiq_enable() and
> > > this is done a second time in arch_cpu_idle_prepare() in that case. And
> > > enabling FIQs has nothing to do with idling the CPU to start with.
> > > 
> > > So let's introduce init_fiq_boot_cpu() to take care of FIQs on the boot
> > > CPU and remove arch_cpu_idle_prepare(). This is now done a bit earlier
> > > at late_initcall time but this shouldn't make a difference in practice
> > > given that FIQs are not currently used on ARM64.
> > > 
> > > Signed-off-by: Nicolas Pitre 
> > 
> > For arm64, we could simply remove any reference to FIQs. I'm not aware
> > of anyone using them.
> 
> OK. What if I sumply remove arch_cpu_idle_prepare() and let you do the 
> remove the rest?
> 
> IMHO I'd simply remove local_fiq_{enable/disable}() from 
> arm64/kernel/smp.c and leave the infrastructure in place in case someone 
> needs it eventually.  In which case I could include that into my patch 
> as well.

Sounds good. We can keep the local_fiq_*() functions but remove about 4
calling sites (process.c and smp.c) until needed.

Thanks.

-- 
Catalin
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/2] ARM64: powernv: remove redundant cpuidle_idle_call()

2014-02-07 Thread Catalin Marinas
On 6 February 2014 14:16, Nicolas Pitre  wrote:
> The core idle loop now takes care of it.
>
> Signed-off-by: Nicolas Pitre 

Acked-by: Catalin Marinas 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: msi: mark bitmap with kmemleak_not_leak()

2015-09-14 Thread Catalin Marinas
On Mon, Sep 14, 2015 at 04:04:37PM +1000, Michael Ellerman wrote:
> On Sun, 2015-09-13 at 21:36 +0300, Denis Kirjanov wrote:
> > During the MSI bitmap test on boot kmemleak spews the following trace:
> > 
> > unreferenced object 0xc0016e86c900 (size 64):
> > comm "swapper/0", pid 1, jiffies 4294893173 (age 518.024s)
> > hex dump (first 32 bytes):
> > 00 00 01 ff 7f ff 7f 37 00 00 00 00 00 00 00 00
> > ...7
> > ff ff ff ff ff ff ff ff 01 ff ff ff ff ff ff ff
> > 
> > backtrace:
> > [] .zalloc_maybe_bootmem+0x3c/0x380
> > [] .msi_bitmap_alloc+0x3c/0xb0
> > []
> > .msi_bitmap_selftest+0x30/0x2b4
> > []
> > .do_one_initcall+0xd4/0x270
> > []
> > .kernel_init_freeable+0x1a0/0x280
> > []
> > .kernel_init+0x1c/0x120
> > []
> > .ret_from_kernel_thread+0x58/0x9c
> > 
> > The comment in msi_bitmap_free() states that we can't free
> > the bitmap so mark it with the kmemleak_not_leak().
> 
> Yeah, and I've always hated that comment :)
> 
> I assume it's still true now that we don't use bootmem?

If it's not bootmem, it's probably memblock. Looking at the code, it
seems that msi_bitmap_free() cannot make the distinction between a slab
allocation and a bootmem/memblock one (allocated via
zalloc_maybe_bootmem).

You could add some flag to struct msi_bitmap based on mem_init_done to
be able to reclaim some slab memory later. If the bitmap is small and
such allocation doesn't happen outside boot, it may not be worth the
effort.

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: msi: mark bitmap with kmemleak_not_leak()

2015-09-14 Thread Catalin Marinas
On Mon, Sep 14, 2015 at 07:36:49PM +1000, Michael Ellerman wrote:
> On Mon, 2015-09-14 at 10:15 +0100, Catalin Marinas wrote:
> > You could add some flag to struct msi_bitmap based on mem_init_done to
> > be able to reclaim some slab memory later. If the bitmap is small and
> > such allocation doesn't happen outside boot, it may not be worth the
> > effort.
> 
> Right, I think that's the only solution, and it's not quite worth the trouble
> because it's generally not freed at all, except via error paths.
> 
> Still I think it would be better to move the kmemleak annotation into the msi
> bitmap test code, or maybe a wrapper that's called by the test code.

Other kmemleak annotations throughout the kernel are usually added
immediately after the allocation place (since that's what kmemleak
reports as a leak).

> That way if someone starts calling alloc/free from a hotplug path for example,
> kmemleak will still notice that free isn't really freeing.

Kmemleak would only notice the moment you clear the last reference to
the allocated memory block (like bmp->bitmap = NULL), so this patch
should work as along as "freeing" is done via msi_bitmap_free().

BTW, you can even use kmemleak_ignore(). The difference is that the
bitmap won't be scanned by kmemleak and that's fine since it doesn't
contain any pointers.

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] powerpc: msi: mark bitmap with kmemleak_not_leak()

2015-09-16 Thread Catalin Marinas
On Tue, Sep 15, 2015 at 08:14:08PM +0300, Denis Kirjanov wrote:
> diff --git a/arch/powerpc/include/asm/msi_bitmap.h 
> b/arch/powerpc/include/asm/msi_bitmap.h
> index 97ac3f4..9a1d2fb 100644
> --- a/arch/powerpc/include/asm/msi_bitmap.h
> +++ b/arch/powerpc/include/asm/msi_bitmap.h
> @@ -19,6 +19,7 @@ struct msi_bitmap {
>   unsigned long   *bitmap;
>   spinlock_t  lock;
>   unsigned intirq_count;
> + boolbitmap_from_slab;

Nitpick: same alignment for bitmap_from_slab with irq_count etc. (unless
it's just my mail client).

>  };
>  
>  int msi_bitmap_alloc_hwirqs(struct msi_bitmap *bmp, int num);
> diff --git a/arch/powerpc/sysdev/msi_bitmap.c 
> b/arch/powerpc/sysdev/msi_bitmap.c
> index 73b64c7..305ebe3 100644
> --- a/arch/powerpc/sysdev/msi_bitmap.c
> +++ b/arch/powerpc/sysdev/msi_bitmap.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -122,7 +123,12 @@ int msi_bitmap_alloc(struct msi_bitmap *bmp, unsigned 
> int irq_count,
>   size = BITS_TO_LONGS(irq_count) * sizeof(long);
>   pr_debug("msi_bitmap: allocator bitmap size is 0x%x bytes\n", size);
>  
> - bmp->bitmap = zalloc_maybe_bootmem(size, GFP_KERNEL);
> + if (slab_is_available()) {
> + bmp->bitmap = kzalloc(size, GFP_KERNEL);
> + bmp->bitmap_from_slab = true;
> + } else
> + bmp->bitmap = memblock_virt_alloc(size, 0);

I don't think bmp->bitmap_from_slab is always initialised, so you need
to explicitly set it to false here.

> +
>   if (!bmp->bitmap) {
>   pr_debug("msi_bitmap: ENOMEM allocating allocator bitmap!\n");
>   return -ENOMEM;
> @@ -138,7 +144,8 @@ int msi_bitmap_alloc(struct msi_bitmap *bmp, unsigned int 
> irq_count,
>  
>  void msi_bitmap_free(struct msi_bitmap *bmp)
>  {
> - /* we can't free the bitmap we don't know if it's bootmem etc. */
> + if (bmp->bitmap_from_slab)
> + kfree(bmp->bitmap);
>   of_node_put(bmp->of_node);
>   bmp->bitmap = NULL;
>  }
> @@ -200,11 +207,11 @@ static void __init test_basics(void)
>   WARN_ON(rc < 0 && rc % 128 != 0);
>  
>   msi_bitmap_free(&bmp);
> + if (!bmp.bitmap_from_slab)
> + kmemleak_not_leak(bmp.bitmap);

As I mentioned in the other thread, I think you can call
kmemleak_not_leak() immediately after memblock_virt_alloc() (together
with a comment that this is never going to be freed). That would match
the other kmemleak API uses throughout the kernel.

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3] powerpc: msi: mark bitmap with kmemleak_not_leak()

2015-09-17 Thread Catalin Marinas
On Wed, Sep 16, 2015 at 10:26:14PM +0300, Denis Kirjanov wrote:
> During the MSI bitmap test on boot kmemleak spews the following trace:
> 
> unreferenced object 0xc0016e86c900 (size 64):
> comm "swapper/0", pid 1, jiffies 4294893173 (age 518.024s)
> hex dump (first 32 bytes):
>   00 00 01 ff 7f ff 7f 37 00 00 00 00 00 00 00 00
>   ...7
>   ff ff ff ff ff ff ff ff 01 ff ff ff ff
>   ff ff ff
>   
>   backtrace:
>   [] .zalloc_maybe_bootmem+0x3c/0x380
>   [] .msi_bitmap_alloc+0x3c/0xb0
>   [] .msi_bitmap_selftest+0x30/0x2b4
>   [] .do_one_initcall+0xd4/0x270
>   [] .kernel_init_freeable+0x1a0/0x280
>   [] .kernel_init+0x1c/0x120
>   [] .ret_from_kernel_thread+0x58/0x9c
> 
> Added a flag to msi_bitmap for tracking allocations
> from slab and memblock so we can properly free/handle
> memory in msi_bitmap_free().
> 
> Signed-off-by: Denis Kirjanov 

Reviewed-by: Catalin Marinas 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] kmemleak: skip scanning holes in the .bss section

2019-03-19 Thread Catalin Marinas
Hi Qian,

On Wed, Mar 13, 2019 at 10:57:17AM -0400, Qian Cai wrote:
> @@ -1531,7 +1547,14 @@ static void kmemleak_scan(void)
>  
>   /* data/bss scanning */
>   scan_large_block(_sdata, _edata);
> - scan_large_block(__bss_start, __bss_stop);
> +
> + if (bss_hole_start) {
> + scan_large_block(__bss_start, bss_hole_start);
> + scan_large_block(bss_hole_stop, __bss_stop);
> + } else {
> + scan_large_block(__bss_start, __bss_stop);
> + }
> +
>   scan_large_block(__start_ro_after_init, __end_ro_after_init);

I'm not a fan of this approach but I couldn't come up with anything
better. I was hoping we could check for PageReserved() in scan_block()
but on arm64 it ends up not scanning the .bss at all.

Until another user appears, I'm ok with this patch.

Acked-by: Catalin Marinas 


Re: [PATCH v2] kmemleak: skip scanning holes in the .bss section

2019-03-20 Thread Catalin Marinas
On Thu, Mar 21, 2019 at 12:15:46AM +1100, Michael Ellerman wrote:
> Catalin Marinas  writes:
> > On Wed, Mar 13, 2019 at 10:57:17AM -0400, Qian Cai wrote:
> >> @@ -1531,7 +1547,14 @@ static void kmemleak_scan(void)
> >>  
> >>/* data/bss scanning */
> >>scan_large_block(_sdata, _edata);
> >> -  scan_large_block(__bss_start, __bss_stop);
> >> +
> >> +  if (bss_hole_start) {
> >> +  scan_large_block(__bss_start, bss_hole_start);
> >> +  scan_large_block(bss_hole_stop, __bss_stop);
> >> +  } else {
> >> +  scan_large_block(__bss_start, __bss_stop);
> >> +  }
> >> +
> >>scan_large_block(__start_ro_after_init, __end_ro_after_init);
> >
> > I'm not a fan of this approach but I couldn't come up with anything
> > better. I was hoping we could check for PageReserved() in scan_block()
> > but on arm64 it ends up not scanning the .bss at all.
> >
> > Until another user appears, I'm ok with this patch.
> >
> > Acked-by: Catalin Marinas 
> 
> I actually would like to rework this kvm_tmp thing to not be in bss at
> all. It's a bit of a hack and is incompatible with strict RWX.
> 
> If we size it a bit more conservatively we can hopefully just reserve
> some space in the text section for it.
> 
> I'm not going to have time to work on that immediately though, so if
> people want this fixed now then this patch could go in as a temporary
> solution.

I think I have a simpler idea. Kmemleak allows punching holes in
allocated objects, so just turn the data/bss sections into dedicated
kmemleak objects. This happens when kmemleak is initialised, before the
initcalls are invoked. The kvm_free_tmp() would just free the
corresponding part of the bss.

Patch below, only tested briefly on arm64. Qian, could you give it a try
on powerpc? Thanks.

8<--
diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 683b5b3805bd..c4b8cb3c298d 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -712,6 +712,8 @@ static void kvm_use_magic_page(void)
 
 static __init void kvm_free_tmp(void)
 {
+   kmemleak_free_part(&kvm_tmp[kvm_tmp_index],
+  ARRAY_SIZE(kvm_tmp) - kvm_tmp_index);
free_reserved_area(&kvm_tmp[kvm_tmp_index],
   &kvm_tmp[ARRAY_SIZE(kvm_tmp)], -1, NULL);
 }
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 707fa5579f66..0f6adcbfc2c7 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1529,11 +1529,6 @@ static void kmemleak_scan(void)
}
rcu_read_unlock();
 
-   /* data/bss scanning */
-   scan_large_block(_sdata, _edata);
-   scan_large_block(__bss_start, __bss_stop);
-   scan_large_block(__start_ro_after_init, __end_ro_after_init);
-
 #ifdef CONFIG_SMP
/* per-cpu sections scanning */
for_each_possible_cpu(i)
@@ -2071,6 +2066,15 @@ void __init kmemleak_init(void)
}
local_irq_restore(flags);
 
+   /* register the data/bss sections */
+   create_object((unsigned long)_sdata, _edata - _sdata,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+   create_object((unsigned long)__bss_start, __bss_stop - __bss_start,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+   create_object((unsigned long)__start_ro_after_init,
+ __end_ro_after_init - __start_ro_after_init,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+
/*
 * This is the point where tracking allocations is safe. Automatic
 * scanning is started during the late initcall. Add the early logged


[PATCH] kmemleak: powerpc: skip scanning holes in the .bss section

2019-03-21 Thread Catalin Marinas
The commit 2d4f567103ff ("KVM: PPC: Introduce kvm_tmp framework") adds
kvm_tmp[] into the .bss section and then free the rest of unused spaces
back to the page allocator.

kernel_init
  kvm_guest_init
kvm_free_tmp
  free_reserved_area
free_unref_page
  free_unref_page_prepare

With DEBUG_PAGEALLOC=y, it will unmap those pages from kernel. As the
result, kmemleak scan will trigger a panic when it scans the .bss
section with unmapped pages.

This patch creates dedicated kmemleak objects for the .data, .bss and
potentially .data..ro_after_init sections to allow partial freeing via
the kmemleak_free_part() in the powerpc kvm_free_tmp() function.

Acked-by: Michael Ellerman  (powerpc)
Reported-by: Qian Cai 
Signed-off-by: Catalin Marinas 
---

Posting as a proper patch following the inlined one here:

http://lkml.kernel.org/r/20190320181656.gb38...@arrakis.emea.arm.com

Changes from the above:

- Added comment to the powerpc kmemleak_free_part() call

- Only register the .data..ro_after_init in kmemleak if not contained
  within the .data sections (which seems to be the case for lots of
  architectures)

I preserved part of Qian's original commit message but changed the
author since I rewrote the patch.

 arch/powerpc/kernel/kvm.c |  7 +++
 mm/kmemleak.c | 16 +++-
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 683b5b3805bd..cd381e2291df 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -712,6 +713,12 @@ static void kvm_use_magic_page(void)
 
 static __init void kvm_free_tmp(void)
 {
+   /*
+* Inform kmemleak about the hole in the .bss section since the
+* corresponding pages will be unmapped with DEBUG_PAGEALLOC=y.
+*/
+   kmemleak_free_part(&kvm_tmp[kvm_tmp_index],
+  ARRAY_SIZE(kvm_tmp) - kvm_tmp_index);
free_reserved_area(&kvm_tmp[kvm_tmp_index],
   &kvm_tmp[ARRAY_SIZE(kvm_tmp)], -1, NULL);
 }
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 707fa5579f66..6c318f5ac234 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1529,11 +1529,6 @@ static void kmemleak_scan(void)
}
rcu_read_unlock();
 
-   /* data/bss scanning */
-   scan_large_block(_sdata, _edata);
-   scan_large_block(__bss_start, __bss_stop);
-   scan_large_block(__start_ro_after_init, __end_ro_after_init);
-
 #ifdef CONFIG_SMP
/* per-cpu sections scanning */
for_each_possible_cpu(i)
@@ -2071,6 +2066,17 @@ void __init kmemleak_init(void)
}
local_irq_restore(flags);
 
+   /* register the data/bss sections */
+   create_object((unsigned long)_sdata, _edata - _sdata,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+   create_object((unsigned long)__bss_start, __bss_stop - __bss_start,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+   /* only register .data..ro_after_init if not within .data */
+   if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata)
+   create_object((unsigned long)__start_ro_after_init,
+ __end_ro_after_init - __start_ro_after_init,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+
/*
 * This is the point where tracking allocations is safe. Automatic
 * scanning is started during the late initcall. Add the early logged


Re: [PATCH 1/5] arm64: Fix vDSO clock_getres()

2019-04-15 Thread Catalin Marinas
On Mon, Apr 01, 2019 at 12:51:48PM +0100, Vincenzo Frascino wrote:
> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> index 2d419006ad43..47ba72345739 100644
> --- a/arch/arm64/kernel/vdso.c
> +++ b/arch/arm64/kernel/vdso.c
> @@ -245,6 +245,8 @@ void update_vsyscall(struct timekeeper *tk)
>   vdso_data->cs_shift = tk->tkr_mono.shift;
>   }
>  
> + vdso_data->hrtimer_res  = hrtimer_resolution;
> +
>   smp_wmb();
>   ++vdso_data->tb_seq_count;
>  }
> diff --git a/arch/arm64/kernel/vdso/gettimeofday.S 
> b/arch/arm64/kernel/vdso/gettimeofday.S
> index c39872a7b03c..7a2cd2f8e13a 100644
> --- a/arch/arm64/kernel/vdso/gettimeofday.S
> +++ b/arch/arm64/kernel/vdso/gettimeofday.S
> @@ -296,32 +296,35 @@ ENDPROC(__kernel_clock_gettime)
>  /* int __kernel_clock_getres(clockid_t clock_id, struct timespec *res); */
>  ENTRY(__kernel_clock_getres)
>   .cfi_startproc
> + adr vdso_data, _vdso_data
>   cmp w0, #CLOCK_REALTIME
>   ccmpw0, #CLOCK_MONOTONIC, #0x4, ne
>   ccmpw0, #CLOCK_MONOTONIC_RAW, #0x4, ne
> - b.ne1f
> + b.ne2f
>  
> - ldr x2, 5f
> - b   2f
> -1:
> +1:   /* Get hrtimer_res */
> + seqcnt_acquire
> + syscall_check fail=5f
> + ldr x2, [vdso_data, #CLOCK_REALTIME_RES]
> + seqcnt_check fail=1b
> + b   3f
> +2:

We talked briefly but I'm still confused why we need the fallback to the
syscall here if archdata.vdso_direct is false. Is it because if the
timer driver code sets vdso_direct to false, we don't don't support
highres timers? If my understanding is correct, you may want to move the
hrtimer_res setting in update_vsyscall() to the !use_syscall block.

-- 
Catalin


Re: [PATCH] [v2] arch: add pidfd and io_uring syscalls everywhere

2019-04-16 Thread Catalin Marinas
On Mon, Apr 15, 2019 at 04:22:57PM +0200, Arnd Bergmann wrote:
> Add the io_uring and pidfd_send_signal system calls to all architectures.
> 
> These system calls are designed to handle both native and compat tasks,
> so all entries are the same across architectures, only arm-compat and
> the generic tale still use an old format.
> 
> Acked-by: Michael Ellerman  (powerpc)
> Acked-by: Heiko Carstens  (s390)
> Acked-by: Geert Uytterhoeven 
> Signed-off-by: Arnd Bergmann 
> ---
> Changes since v1:
> - fix s390 table
> - use 'n64' tag in mips-n64 instead of common.
> ---
>  arch/alpha/kernel/syscalls/syscall.tbl  | 4 
>  arch/arm/tools/syscall.tbl  | 4 
>  arch/arm64/include/asm/unistd.h | 2 +-
>  arch/arm64/include/asm/unistd32.h   | 8 
>  arch/ia64/kernel/syscalls/syscall.tbl   | 4 
>  arch/m68k/kernel/syscalls/syscall.tbl   | 4 
>  arch/microblaze/kernel/syscalls/syscall.tbl | 4 
>  arch/mips/kernel/syscalls/syscall_n32.tbl   | 4 
>  arch/mips/kernel/syscalls/syscall_n64.tbl   | 4 
>  arch/mips/kernel/syscalls/syscall_o32.tbl   | 4 
>  arch/parisc/kernel/syscalls/syscall.tbl | 4 
>  arch/powerpc/kernel/syscalls/syscall.tbl| 4 
>  arch/s390/kernel/syscalls/syscall.tbl   | 4 
>  arch/sh/kernel/syscalls/syscall.tbl | 4 
>  arch/sparc/kernel/syscalls/syscall.tbl  | 4 
>  arch/xtensa/kernel/syscalls/syscall.tbl     | 4 
>  16 files changed, 65 insertions(+), 1 deletion(-)

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH v2 1/5] arm64: Fix vDSO clock_getres()

2019-04-16 Thread Catalin Marinas
On Tue, Apr 16, 2019 at 05:14:30PM +0100, Vincenzo Frascino wrote:
> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> index 2d419006ad43..5f5759d51c33 100644
> --- a/arch/arm64/kernel/vdso.c
> +++ b/arch/arm64/kernel/vdso.c
> @@ -245,6 +245,8 @@ void update_vsyscall(struct timekeeper *tk)
>   vdso_data->cs_shift = tk->tkr_mono.shift;
>   }
>  
> + vdso_data->hrtimer_res  = hrtimer_resolution;

This should be a WRITE_ONCE(), just in case.

> +
>   smp_wmb();
>   ++vdso_data->tb_seq_count;
>  }
> diff --git a/arch/arm64/kernel/vdso/gettimeofday.S 
> b/arch/arm64/kernel/vdso/gettimeofday.S
> index c39872a7b03c..e2e9dfe9ba4a 100644
> --- a/arch/arm64/kernel/vdso/gettimeofday.S
> +++ b/arch/arm64/kernel/vdso/gettimeofday.S
> @@ -296,32 +296,32 @@ ENDPROC(__kernel_clock_gettime)
>  /* int __kernel_clock_getres(clockid_t clock_id, struct timespec *res); */
>  ENTRY(__kernel_clock_getres)
>   .cfi_startproc
> + adr vdso_data, _vdso_data
>   cmp w0, #CLOCK_REALTIME
>   ccmpw0, #CLOCK_MONOTONIC, #0x4, ne
>   ccmpw0, #CLOCK_MONOTONIC_RAW, #0x4, ne
> - b.ne1f
> + b.ne2f
>  
> - ldr x2, 5f
> - b   2f
> -1:
> +1:   /* Get hrtimer_res */
> + ldr x2, [vdso_data, #CLOCK_REALTIME_RES]

And here we need an "ldr w2, ..." since hrtimer_res is u32.

With the above (which Will can fix up):

Reviewed-by: Catalin Marinas 


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-03 Thread Catalin Marinas
On Fri, Apr 03, 2020 at 01:58:31AM +0100, Al Viro wrote:
> On Thu, Apr 02, 2020 at 11:35:57AM -0700, Kees Cook wrote:
> > Yup, I think it's a weakness of the ARM implementation and I'd like to
> > not extend it further. AFAIK we should never nest, but I would not be
> > surprised at all if we did.
> > 
> > If we were looking at a design goal for all architectures, I'd like
> > to be doing what the public PaX patchset did for their memory access
> > switching, which is to alarm if calling into "enable" found the access
> > already enabled, etc. Such a condition would show an unexpected nesting
> > (like we've seen with similar constructs with set_fs() not getting reset
> > during an exception handler, etc etc).
> 
> FWIW, maybe I'm misreading the ARM uaccess logics, but... it smells like
> KERNEL_DS is somewhat more dangerous there than on e.g. x86.
> 
> Look: with CONFIG_CPU_DOMAINS, set_fs(KERNEL_DS) tells MMU to ignore
> per-page permission bits in DOMAIN_KERNEL (i.e. for kernel address
> ranges), allowing them even if they would normally be denied.  We need
> that for actual uaccess loads/stores, since those use insns that pretend
> to be done in user mode and we want them to access the kernel pages.
> But that affects the normal loads/stores as well; unless I'm misreading
> that code, it will ignore (supervisor) r/o on a page.  And that's not
> just for the code inside the uaccess blocks; *everything* done under
> KERNEL_DS is subject to that.

That's correct. Luckily this only affects ARMv5 and earlier. From ARMv6
onwards, CONFIG_CPU_USE_DOMAINS is no longer selected and the uaccess
instructions are just plain ldr/str.

Russell should know the details on whether there was much choice. Since
the kernel was living in the linear map with full rwx permissions, the
KERNEL_DS overriding was probably not a concern and the ldrt/strt for
uaccess deemed more secure. We also have weird permission setting
pre-ARMv6 (or rather v6k) where RO user pages are writable from the
kernel with standard str instructions (breaking CoW). I don't recall
whether it was a choice made by the kernel or something the architecture
enforced. The vectors page has to be kernel writable (and user RO) to
store the TLS value in the absence of a TLS register but maybe we could
do this via the linear alias together with the appropriate cache
maintenance.

>From ARMv6, the domain overriding had the side-effect of ignoring the XN
bit and causing random instruction fetches from ioremap() areas. So we
had to remove the domain switching. We also gained a dedicated TLS
register.

> Why do we do that (modify_domain(), that is) inside set_fs() and not
> in uaccess_enable() et.al.?

I think uaccess_enable() could indeed switch the kernel domain if
KERNEL_DS is set and move this out of set_fs(). It would reduce the
window the kernel domain permissions are overridden. Anyway,
uaccess_enable() appeared much later on arm when Russell introduced PAN
(SMAP) like support by switching the user domain.

-- 
Catalin


Re: [PATCH] dma/direct: turn ARCH_ZONE_DMA_BITS into a variable

2019-10-31 Thread Catalin Marinas
On Thu, Oct 31, 2019 at 05:58:53PM +0100, Christoph Hellwig wrote:
> On Thu, Oct 31, 2019 at 05:22:59PM +0100, Nicolas Saenz Julienne wrote:
> > OK, I see what you mean now. It's wrong indeed.
> > 
> > The trouble is the ZONE_DMA series[1] in arm64, also due for v5.5, will be
> > affected by this patch. I don't know the right way to approach this problem
> > since depending on the merge order, this patch should be updated or the 
> > arm64
> > ZONE_DMA series fixed.
> > 
> > Maybe it's easier to just wait for v5.6.
> 
> Ok, I can wait.  Or the arm64 maintainers can pick up this patch if
> you want to add it to that series.

This branch is stable (may add a fix but not I'm not rebasing it) if you
want to base this patch on top:

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/zone-dma

Otherwise, with your ack, I can add it on top of the above branch (aimed
at 5.5).

-- 
Catalin


Re: [PATCH 02/31] arm64: fix the flush_icache_range arguments in machine_kexec

2020-05-11 Thread Catalin Marinas
On Mon, May 11, 2020 at 08:51:15AM +0100, Will Deacon wrote:
> On Sun, May 10, 2020 at 09:54:41AM +0200, Christoph Hellwig wrote:
> > The second argument is the end "pointer", not the length.
> > 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  arch/arm64/kernel/machine_kexec.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/arm64/kernel/machine_kexec.c 
> > b/arch/arm64/kernel/machine_kexec.c
> > index 8e9c924423b4e..a0b144cfaea71 100644
> > --- a/arch/arm64/kernel/machine_kexec.c
> > +++ b/arch/arm64/kernel/machine_kexec.c
> > @@ -177,6 +177,7 @@ void machine_kexec(struct kimage *kimage)
> >  * the offline CPUs. Therefore, we must use the __* variant here.
> >  */
> > __flush_icache_range((uintptr_t)reboot_code_buffer,
> > +(uintptr_t)reboot_code_buffer +
> >  arm64_relocate_new_kernel_size);
> 
> Urgh, well spotted. It's annoyingly different from __flush_dcache_area().
> 
> But now I'm wondering what this code actually does... the loop condition
> in invalidate_icache_by_line works with 64-bit arithmetic, so we could
> spend a /very/ long time here afaict.

I think it goes through the loop only once. The 'b.lo' saves us here.
OTOH, there is no I-cache maintenance done.

> It's also a bit annoying that we do a bunch of redundant D-cache
> maintenance too. Should we use invalidate_icache_range() here instead?

Since we have the __flush_dcache_area() above it for cleaning to PoC, we
could use invalidate_icache_range() here. We probably didn't have this
function at the time, it was added for KVM (commit 4fee94736603cd6).

> (and why does that thing need to toggle uaccess)?

invalidate_icache_range() doesn't need to, it works on the kernel linear
map.

__flush_icache_range() doesn't need to either, that's a side-effect of
the fall-through implementation.

Anyway, I think Christoph's patch needs to go in with a fixes tag:

Fixes: d28f6df1305a ("arm64/kexec: Add core kexec support")
Cc:  # 4.8.x-

and we'll change these functions/helpers going forward for arm64.

Happy to pick this up via the arm64 for-next/fixes branch.

-- 
Catalin


Re: [PATCH] powerpc/kvm: silence kmemleak false positives

2020-05-11 Thread Catalin Marinas
On Mon, May 11, 2020 at 09:15:55PM +1000, Michael Ellerman wrote:
> Qian Cai  writes:
> > kvmppc_pmd_alloc() and kvmppc_pte_alloc() allocate some memory but then
> > pud_populate() and pmd_populate() will use __pa() to reference the newly
> > allocated memory. The same is in xive_native_provision_pages().
> >
> > Since kmemleak is unable to track the physical memory resulting in false
> > positives, silence those by using kmemleak_ignore().
> 
> There is kmemleak_alloc_phys(), which according to the docs can be used
> for tracking a phys address.

This won't help. While kmemleak_alloc_phys() allows passing a physical
address, it doesn't track physical address references to this object. It
still expects VA pointing to it, otherwise the object would be reported
as a leak.

We currently only call this from the memblock code with a min_count of
0, meaning it will not be reported as a leak if no references are found.

We don't have this issue with page tables on other architectures since
most of them use whole page allocations which aren't tracked by
kmemleak. These powerpc functions use kmem_cache_alloc() which would be
tracked automatically by kmemleak. While we could add a phys alias to
kmemleak (another search tree), I think the easiest is as per Qian's
patch, just ignore those objects.

-- 
Catalin


Re: [PATCH] powerpc/kvm: silence kmemleak false positives

2020-05-12 Thread Catalin Marinas
On Mon, May 11, 2020 at 07:43:30AM -0400, Qian Cai wrote:
> On May 11, 2020, at 7:15 AM, Michael Ellerman  wrote:
> > There is kmemleak_alloc_phys(), which according to the docs can be used
> > for tracking a phys address.
> > 
> > Did you try that?
> 
> Catalin, feel free to give your thoughts here.
> 
> My understanding is that it seems the doc is a bit misleading.
> kmemleak_alloc_phys() is to allocate kmemleak objects for a physical
> address range, so  kmemleak could scan those memory pointers within
> for possible referencing other memory. It was only used in memblock so
> far, but those new memory allocations here contain no reference to
> other memory.
> 
> In this case, we have already had kmemleak objects for those memory
> allocation. It is just that other pointers reference those memory by
> their physical address which is a known kmemleak limitation won’t be
> able to track the the connection. Thus, we always use
> kmemleak_ignore() to not reporting those as leaks and don’t scan those
> because they do not contain other memory reference.

Indeed. I replied directly to Michael along the same lines.

-- 
Catalin


Re: Several suspected memory leaks

2018-08-07 Thread Catalin Marinas
(catching up with emails)

On Wed, 11 Jul 2018 at 00:40, Benjamin Herrenschmidt
 wrote:
> On Tue, 2018-07-10 at 17:17 +0200, Paul Menzel wrote:
> > On a the IBM S822LC (8335-GTA) with Ubuntu 18.04 I built Linux master
> > – 4.18-rc4+, commit 092150a2 (Merge branch 'for-linus'
> > of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid) – with
> > kmemleak. Several issues are found.
>
> Some of these are completely uninteresting though and look like
> kmemleak bugs to me :-)
>
> > [] __pud_alloc+0x80/0x270
> > [<07135d64>] hash__map_kernel_page+0x30c/0x4d0
> > [<71677858>] __ioremap_at+0x108/0x140
> > [<0023e921>] __ioremap_caller+0x130/0x180
> > [<9dbc3923>] icp_native_init_one_node+0x5cc/0x760
> > [<15f3168a>] icp_native_init+0x70/0x13c
> > [<60ed>] xics_init+0x38/0x1ac
> > [<88dbf9d1>] pnv_init_IRQ+0x30/0x5c
>
> This is the interrupt controller mapping its registers, why on earth
> would that be considered a leak ? kmemleak needs to learn to ignore
> kernel page tables allocations.

Indeed, that's just a false positive for powerpc. Kmemleak ignores
page allocations and most architectures use __get_free_pages() for the
page table. In this particular case, the powerpc code uses
kmem_cache_alloc() and that's tracked by kmemleak. Since the pgd
stores the __pa(pud), kmemleak doesn't detect this pointer and reports
it as a leak. To work around this, you can pass SLAB_NOLEAKTRACE to
kmem_cache_create() in pgtable_cache_add()
(arch/powerpc/mm/init-common.c).

-- 
Catalin


Re: [REPOST PATCH v6 0/4] kgdb: Fix kgdb_roundup_cpus()

2018-12-07 Thread Catalin Marinas
On Tue, Dec 04, 2018 at 07:38:24PM -0800, Douglas Anderson wrote:
> Douglas Anderson (4):
>   kgdb: Remove irq flags from roundup
>   kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()
>   kgdb: Don't round up a CPU that failed rounding up before
>   kdb: Don't back trace on a cpu that didn't round up

FWIW, trying these on arm64 (ThunderX2) with CONFIG_KGDB_TESTS_ON_BOOT=y
on top of 4.20-rc5 doesn't boot. It looks like they leave interrupts
disabled when they shouldn't and it trips over the BUG at
mm/vmalloc.c:1380 (called via do_fork -> copy_process).

Now, I don't think these patches make things worse on arm64 since prior
to them the kgdb boot tests on arm64 were stuck in a loop (RUN
singlestep).

-- 
Catalin


Re: [REPOST PATCH v6 0/4] kgdb: Fix kgdb_roundup_cpus()

2018-12-10 Thread Catalin Marinas
Hi Doug,

On Fri, Dec 07, 2018 at 10:40:24AM -0800, Doug Anderson wrote:
> On Fri, Dec 7, 2018 at 9:42 AM Catalin Marinas  
> wrote:
> > On Tue, Dec 04, 2018 at 07:38:24PM -0800, Douglas Anderson wrote:
> > > Douglas Anderson (4):
> > >   kgdb: Remove irq flags from roundup
> > >   kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()
> > >   kgdb: Don't round up a CPU that failed rounding up before
> > >   kdb: Don't back trace on a cpu that didn't round up
> >
> > FWIW, trying these on arm64 (ThunderX2) with CONFIG_KGDB_TESTS_ON_BOOT=y
> > on top of 4.20-rc5 doesn't boot. It looks like they leave interrupts
> > disabled when they shouldn't and it trips over the BUG at
> > mm/vmalloc.c:1380 (called via do_fork -> copy_process).
> >
> > Now, I don't think these patches make things worse on arm64 since prior
> > to them the kgdb boot tests on arm64 were stuck in a loop (RUN
> > singlestep).
> 
> Thanks for the report!  ...actually, I'd never tried CONFIG_KGDB_TESTS
> before.  ...so I tried them now:
> 
> A) chromeos-4.19 tree on qcom-sdm845 without this series: booted up OK
> B) chromeos-4.19 tree on qcom-sdm845 with this series: booted up OK
> C) v4.20-rc5-90-g30002dd008ed on rockchip-rk3399 (kevin) with this
> series: booted up OK
> 
> Example output from B) above:
> 
> localhost ~ # dmesg | grep kgdbts
> [2.139814] KGDB: Registered I/O driver kgdbts
> [2.144582] kgdbts:RUN plant and detach test
> [2.165333] kgdbts:RUN sw breakpoint test
> [2.172990] kgdbts:RUN bad memory access test
> [2.178640] kgdbts:RUN singlestep test 1000 iterations
> [2.187765] kgdbts:RUN singlestep [0/1000]
> [2.559596] kgdbts:RUN singlestep [100/1000]
> [2.931419] kgdbts:RUN singlestep [200/1000]
> [3.303474] kgdbts:RUN singlestep [300/1000]
> [3.675121] kgdbts:RUN singlestep [400/1000]
> [4.046867] kgdbts:RUN singlestep [500/1000]
> [4.418920] kgdbts:RUN singlestep [600/1000]
> [4.790824] kgdbts:RUN singlestep [700/1000]
> [5.162479] kgdbts:RUN singlestep [800/1000]
> [5.534103] kgdbts:RUN singlestep [900/1000]
> [5.902299] kgdbts:RUN do_fork for 100 breakpoints
> [8.463900] KGDB: Unregistered I/O driver kgdbts, debugger disabled
> 
> ...so I guess I'm a little confused.  Either I have a different config
> than you do or something is special about your machine?

I tried it now on a Juno board both as a host and a guest and boots
fine. It must be something that only triggers ThunderX2. Ignore the
report for now, if I find anything interesting I'll let you know.

-- 
Catalin


Re: [PATCH v2 1/3] arm64: mm: Add p?d_large() definitions

2019-07-01 Thread Catalin Marinas
On Mon, Jul 01, 2019 at 04:40:24PM +1000, Nicholas Piggin wrote:
> walk_page_range() is going to be allowed to walk page tables other than
> those of user space. For this it needs to know when it has reached a
> 'leaf' entry in the page tables. This information will be provided by the
> p?d_large() functions/macros.
> 
> For arm64, we already have p?d_sect() macros which we can reuse for
> p?d_large().
> 
> pud_sect() is defined as a dummy function when CONFIG_PGTABLE_LEVELS < 3
> or CONFIG_ARM64_64K_PAGES is defined. However when the kernel is
> configured this way then architecturally it isn't allowed to have a
> large page that this level, and any code using these page walking macros
> is implicitly relying on the page size/number of levels being the same as
> the kernel. So it is safe to reuse this for p?d_large() as it is an
> architectural restriction.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Signed-off-by: Steven Price 

Acked-by: Catalin Marinas 


Re: [PATCH 1/2] dma-mapping: fix page attributes for dma_mmap_*

2019-08-05 Thread Catalin Marinas
On Mon, Aug 05, 2019 at 11:01:44AM +0300, Christoph Hellwig wrote:
> All the way back to introducing dma_common_mmap we've defaulyed to mark

s/defaultyed/defaulted/

> the pages as uncached.  But this is wrong for DMA coherent devices.
> Later on DMA_ATTR_WRITE_COMBINE also got incorrect treatment as that
> flag is only treated special on the alloc side for non-coherent devices.
> 
> Introduce a new dma_pgprot helper that deals with the check for coherent
> devices so that only the remapping cases even reach arch_dma_mmap_pgprot

s/even/ever/

> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index 1d3f0b5a9940..bd2b039f43a6 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -14,9 +14,7 @@
>  pgprot_t arch_dma_mmap_pgprot(struct device *dev, pgprot_t prot,
>   unsigned long attrs)
>  {
> - if (!dev_is_dma_coherent(dev) || (attrs & DMA_ATTR_WRITE_COMBINE))
> - return pgprot_writecombine(prot);
> - return prot;
> + return pgprot_writecombine(prot);
>  }

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH v2 06/29] ARM: add migrate_pages() system call

2019-01-25 Thread Catalin Marinas
On Fri, Jan 18, 2019 at 05:18:12PM +0100, Arnd Bergmann wrote:
> The migrate_pages system call has an assigned number on all architectures
> except ARM. When it got added initially in commit d80ade7b3231 ("ARM:
> Fix warning: #warning syscall migrate_pages not implemented"), it was
> intentionally left out based on the observation that there are no 32-bit
> ARM NUMA systems.
> 
> However, there are now arm64 NUMA machines that can in theory run 32-bit
> kernels (actually enabling NUMA there would require additional work)
> as well as 32-bit user space on 64-bit kernels, so that argument is no
> longer very strong.
> 
> Assigning the number lets us use the system call on 64-bit kernels as well
> as providing a more consistent set of syscalls across architectures.
> 
> Signed-off-by: Arnd Bergmann 

For the arm64 part:

Acked-by: Catalin Marinas 


Re: [PATCH v2 07/29] ARM: add kexec_file_load system call number

2019-01-25 Thread Catalin Marinas
On Fri, Jan 18, 2019 at 05:18:13PM +0100, Arnd Bergmann wrote:
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index 86de9eb34296..20ed7e026723 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -415,3 +415,4 @@
>  398  common  rseqsys_rseq
>  399  common  io_pgetevents   sys_io_pgetevents
>  400  common  migrate_pages   sys_migrate_pages
> +401  common  kexec_file_load sys_kexec_file_load

I presume on arm32 this would still return -ENOSYS.

> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 261216c3336e..2c30e6f145ff 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -44,7 +44,7 @@
>  #define __ARM_NR_compat_set_tls  (__ARM_NR_COMPAT_BASE + 5)
>  #define __ARM_NR_COMPAT_END  (__ARM_NR_COMPAT_BASE + 0x800)
>  
> -#define __NR_compat_syscalls 401
> +#define __NR_compat_syscalls 402
>  #endif
>  
>  #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h 
> b/arch/arm64/include/asm/unistd32.h
> index f15bcbacb8f6..8ca1d4c304f4 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -823,6 +823,8 @@ __SYSCALL(__NR_rseq, sys_rseq)
>  __SYSCALL(__NR_io_pgetevents, compat_sys_io_pgetevents)
>  #define __NR_migrate_pages 400
>  __SYSCALL(__NR_migrate_pages, compat_sys_migrate_pages)
> +#define __NR_kexec_file_load 401
> +__SYSCALL(__NR_kexec_file_load, sys_kexec_file_load)

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH v2 25/29] y2038: syscalls: rename y2038 compat syscalls

2019-01-25 Thread Catalin Marinas
On Fri, Jan 18, 2019 at 05:18:31PM +0100, Arnd Bergmann wrote:
> A lot of system calls that pass a time_t somewhere have an implementation
> using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
> been reworked so that this implementation can now be used on 32-bit
> architectures as well.
> 
> The missing step is to redefine them using the regular SYSCALL_DEFINEx()
> to get them out of the compat namespace and make it possible to build them
> on 32-bit architectures.
> 
> Any system call that ends in 'time' gets a '32' suffix on its name for
> that version, while the others get a '_time32' suffix, to distinguish
> them from the normal version, which takes a 64-bit time argument in the
> future.
> 
> In this step, only 64-bit architectures are changed, doing this rename
> first lets us avoid touching the 32-bit architectures twice.
> 
> Signed-off-by: Arnd Bergmann 

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH v2 29/29] y2038: add 64-bit time_t syscalls to all 32-bit architectures

2019-01-25 Thread Catalin Marinas
On Fri, Jan 18, 2019 at 05:18:35PM +0100, Arnd Bergmann wrote:
> This adds 21 new system calls on each ABI that has 32-bit time_t
> today. All of these have the exact same semantics as their existing
> counterparts, and the new ones all have macro names that end in 'time64'
> for clarification.
> 
> This gets us to the point of being able to safely use a C library
> that has 64-bit time_t in user space. There are still a couple of
> loose ends to tie up in various areas of the code, but this is the
> big one, and should be entirely uncontroversial at this point.
> 
> In particular, there are four system calls (getitimer, setitimer,
> waitid, and getrusage) that don't have a 64-bit counterpart yet,
> but these can all be safely implemented in the C library by wrapping
> around the existing system calls because the 32-bit time_t they
> pass only counts elapsed time, not time since the epoch. They
> will be dealt with later.
> 
> Signed-off-by: Arnd Bergmann 

Acked-by: Catalin Marinas 

(as long as compat follows the arm32 syscall numbers)


Re: [PATCH v2 6/9] arm64: kexec: no need to ClearPageReserved()

2019-01-25 Thread Catalin Marinas
On Mon, Jan 14, 2019 at 01:59:00PM +0100, David Hildenbrand wrote:
> This will be done by free_reserved_page().
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Bhupesh Sharma 
> Cc: James Morse 
> Cc: Marc Zyngier 
> Cc: Dave Kleikamp 
> Cc: Mark Rutland 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Acked-by: James Morse 
> Signed-off-by: David Hildenbrand 

Acked-by: Catalin Marinas 


Re: [PATCH v2 7/9] arm64: kdump: No need to mark crashkernel pages manually PG_reserved

2019-01-25 Thread Catalin Marinas
On Mon, Jan 14, 2019 at 01:59:01PM +0100, David Hildenbrand wrote:
> The crashkernel is reserved via memblock_reserve(). memblock_free_all()
> will call free_low_memory_core_early(), which will go over all reserved
> memblocks, marking the pages as PG_reserved.
> 
> So manually marking pages as PG_reserved is not necessary, they are
> already in the desired state (otherwise they would have been handed over
> to the buddy as free pages and bad things would happen).
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: James Morse 
> Cc: Bhupesh Sharma 
> Cc: David Hildenbrand 
> Cc: Mark Rutland 
> Cc: Dave Kleikamp 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Florian Fainelli 
> Cc: Stefan Agner 
> Cc: Laura Abbott 
> Cc: Greg Hackmann 
> Cc: Johannes Weiner 
> Cc: Kristina Martsenko 
> Cc: CHANDAN VN 
> Cc: AKASHI Takahiro 
> Cc: Logan Gunthorpe 
> Reviewed-by: Matthias Brugger 
> Signed-off-by: David Hildenbrand 

Acked-by: Catalin Marinas 


Re: [PATCH v2 06/21] memblock: memblock_phys_alloc_try_nid(): don't panic

2019-01-25 Thread Catalin Marinas
On Mon, Jan 21, 2019 at 10:03:53AM +0200, Mike Rapoport wrote:
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index ae34e3a..2c61ea4 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -237,6 +237,10 @@ static void __init setup_node_data(int nid, u64 
> start_pfn, u64 end_pfn)
>   pr_info("Initmem setup node %d []\n", nid);
>  
>   nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> + if (!nd_pa)
> + panic("Cannot allocate %zu bytes for node %d data\n",
> +   nd_size, nid);
> +
>   nd = __va(nd_pa);
>  
>   /* report and initialize */

Does it mean that memblock_phys_alloc_try_nid() never returns valid
physical memory starting at 0?

-- 
Catalin


Re: [PATCH v5 00/11] hugetlb: Factorize hugetlb architecture primitives

2018-07-31 Thread Catalin Marinas
On Tue, Jul 31, 2018 at 06:01:44AM +, Alexandre Ghiti wrote:
> Alexandre Ghiti (11):
>   hugetlb: Harmonize hugetlb.h arch specific defines with pgtable.h
>   hugetlb: Introduce generic version of hugetlb_free_pgd_range
>   hugetlb: Introduce generic version of set_huge_pte_at
>   hugetlb: Introduce generic version of huge_ptep_get_and_clear
>   hugetlb: Introduce generic version of huge_ptep_clear_flush
>   hugetlb: Introduce generic version of huge_pte_none
>   hugetlb: Introduce generic version of huge_pte_wrprotect
>   hugetlb: Introduce generic version of prepare_hugepage_range
>   hugetlb: Introduce generic version of huge_ptep_set_wrprotect
>   hugetlb: Introduce generic version of huge_ptep_set_access_flags
>   hugetlb: Introduce generic version of huge_ptep_get
[...]
>  arch/arm64/include/asm/hugetlb.h | 39 +++-

For the arm64 bits in this series:

Acked-by: Catalin Marinas 


Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-28 Thread Catalin Marinas
On Mon, Jan 27, 2020 at 09:11:53PM -0500, Qian Cai wrote:
> On Jan 27, 2020, at 8:28 PM, Anshuman Khandual  
> wrote:
> > This adds tests which will validate architecture page table helpers and
> > other accessors in their compliance with expected generic MM semantics.
> > This will help various architectures in validating changes to existing
> > page table helpers or addition of new ones.
[...]
> What’s the value of this block of new code? It only supports x86 and
> arm64 which are supposed to be good now. Did those tests ever find any
> regression or this is almost only useful for new architectures which
> only happened once in a few years?

The primary goal here is not finding regressions but having clearly
defined semantics of the page table accessors across architectures. x86
and arm64 are a good starting point and other architectures will be
enabled as they are aligned to the same semantics.

See for example this past discussion:

https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/

These tests should act as the 'contract' between the generic mm code and
the architecture port. Without clear semantics, some bugs may be a lot
subtler than a boot failure.

FTR, I fully support this patch (and I should get around to review it
properly; thanks for the reminder ;)).

-- 
Catalin


Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-29 Thread Catalin Marinas
On Tue, Jan 28, 2020 at 02:07:10PM -0500, Qian Cai wrote:
> On Jan 28, 2020, at 12:47 PM, Catalin Marinas  wrote:
> > The primary goal here is not finding regressions but having clearly
> > defined semantics of the page table accessors across architectures. x86
> > and arm64 are a good starting point and other architectures will be
> > enabled as they are aligned to the same semantics.
> 
> This still does not answer the fundamental question. If this test is
> simply inefficient to find bugs,

Who said this is inefficient (other than you)?

> who wants to spend time to use it regularly? 

Arch maintainers, mm maintainers introducing new macros or assuming
certain new semantics of the existing macros.

> If this is just one off test that may get running once in a few years
> (when introducing a new arch), how does it justify the ongoing cost to
> maintain it?

You are really missing the point. It's not only for a new arch but
changes to existing arch code. And if the arch code churn in this area
is relatively small, I'd expect a similarly small cost of maintaining
this test.

If you only turn DEBUG_VM on once every few years, don't generalise this
to the rest of the kernel developers (as others pointed out, this test
is default y if DEBUG_VM).

Anyway, I think that's a pointless discussion, so not going to reply
further (unless you have technical content to add).

-- 
Catalin


Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-02-10 Thread Catalin Marinas
On Tue, Jan 28, 2020 at 06:57:53AM +0530, Anshuman Khandual wrote:
> This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
> CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
> select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
> arm64. Going forward, other architectures too can enable this after fixing
> build or runtime problems (if any) with their page table helpers.

It may be worth posting the next version to linux-arch to reach out to
other arch maintainers.

Also I've seen that you posted a v13 but it hasn't reached
linux-arm-kernel (likely held in moderation because of the large amount
of addresses cc'ed) and I don't normally follow LKML. I'm not cc'ed to
this patch either (which is fine as long as you post to a list that I
read).

Since I started the reply on v12 about a week ago, I'll follow up here.
When you post a v14, please trim the people on cc only to those strictly
necessary (e.g. arch maintainers, linux-mm, linux-arch and lkml).

> diff --git a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt 
> b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
> new file mode 100644
> index ..f3f8111edbe3
> --- /dev/null
> +++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
> @@ -0,0 +1,35 @@
> +#
> +# Feature name:  debug-vm-pgtable
> +# Kconfig:   ARCH_HAS_DEBUG_VM_PGTABLE
> +# description:   arch supports pgtable tests for semantics compliance
> +#
> +---
> +| arch |status|
> +---
> +|   alpha: | TODO |
> +| arc: |  ok  |
> +| arm: | TODO |

I'm sure you can find some arm32 hardware around (or a VM) to give this
a try ;).

> diff --git a/arch/x86/include/asm/pgtable_64.h 
> b/arch/x86/include/asm/pgtable_64.h
> index 0b6c4042942a..fb0e76d254b3 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
[...]
> @@ -1197,6 +1197,7 @@ static noinline void __init kernel_init_freeable(void)
>   sched_init_smp();
>  
>   page_alloc_init_late();
> + debug_vm_pgtable();
>   /* Initialize page ext after all struct pages are initialized. */
>   page_ext_init();

I guess you could even make debug_vm_pgtable() an early_initcall(). I
don't have a strong opinion either way.

> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> new file mode 100644
> index ..0f37f32d15f1
> --- /dev/null
> +++ b/mm/debug_vm_pgtable.c
> @@ -0,0 +1,388 @@
[...]
> +/*
> + * Basic operations
> + *
> + * mkold(entry)  = An old and not a young entry
> + * mkyoung(entry)= A young and not an old entry
> + * mkdirty(entry)= A dirty and not a clean entry
> + * mkclean(entry)= A clean and not a dirty entry
> + * mkwrite(entry)= A write and not a write protected entry
> + * wrprotect(entry)  = A write protected and not a write entry
> + * pxx_bad(entry)= A mapped and non-table entry
> + * pxx_same(entry1, entry2)  = Both entries hold the exact same value
> + */
> +#define VMFLAGS  (VM_READ|VM_WRITE|VM_EXEC)
> +
> +/*
> + * On s390 platform, the lower 12 bits are used to identify given page table
> + * entry type and for other arch specific requirements. But these bits might
> + * affect the ability to clear entries with pxx_clear(). So while loading up
> + * the entries skip all lower 12 bits in order to accommodate s390 platform.
> + * It does not have affect any other platform.
> + */
> +#define RANDOM_ORVALUE   (0xf000UL)

I'd suggest you generate this mask with something like
GENMASK(BITS_PER_LONG, PAGE_SHIFT).

> +#define RANDOM_NZVALUE   (0xff)
> +
> +static void __init pte_basic_tests(unsigned long pfn, pgprot_t prot)
> +{
> + pte_t pte = pfn_pte(pfn, prot);
> +
> + WARN_ON(!pte_same(pte, pte));
> + WARN_ON(!pte_young(pte_mkyoung(pte)));
> + WARN_ON(!pte_dirty(pte_mkdirty(pte)));
> + WARN_ON(!pte_write(pte_mkwrite(pte)));
> + WARN_ON(pte_young(pte_mkold(pte)));
> + WARN_ON(pte_dirty(pte_mkclean(pte)));
> + WARN_ON(pte_write(pte_wrprotect(pte)));

Given that you start with rwx permissions set,
some of these ops would not have any effect. For example, on arm64 at
least, mkwrite clears a bit already cleared here. You could try with
multiple rwx combinations values (e.g. all set and all cleared) or maybe
something like below:

WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte;

You could also try something like this:

WARN_ON(!pte_same(pte_wrprotect(pte), pte_wrprotect(pte_mkwrite(pte;

though the above approach may not work for arm64 ptep_set_wrprotect() on
a dirty pte (if you extend these tests later).

> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot)
> +{
> + pmd_t pmd = pfn_pmd(pfn, pr

Re: [RFC PATCH] Kconfig: Enable Kconfig fragments to be used for defconfig

2010-07-16 Thread Catalin Marinas
On Wed, 2010-07-14 at 00:04 +0100, Grant Likely wrote:
> - It still doesn't resolve dependencies.  A solver would help with this.
>   For the time being I work around the problem by running the generated
>   config through 'oldconfig' and looking for differences.  If the files
>   differ (ignoring comments and generateconfig_* options) after oldconfig,
>   then the _defconfig target returns a failure.  (but leaves the
>   new .config intact so the user can resolve it with menuconfig).  This
>   way at least the user is told when a Kconfig fragment is invalid.

It's not a solver but I'm pushing a patch to warn on selecting symbols
with unmet dependencies so that you can select further symbols (manual
solving). The patch is in linux-next but you also can grab it from:

http://git.kernel.org/?p=linux/kernel/git/cmarinas/linux-2.6-cm.git;a=commitdiff_plain;h=5d87db2d2a332784bbf2b1ec3e141486f4d41d6f

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH] Kconfig: Enable Kconfig fragments to be used for defconfig

2010-07-16 Thread Catalin Marinas
On Fri, 2010-07-16 at 19:46 +0100, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 11:40 AM, Nicolas Pitre  wrote:
> >
> > DOH.
> 
> Well, it's possible that the correct approach is a mixture.
> 
> Automatically do the trivial cases (recursive selects, dependencies
> that are simple or of the form "x && y" etc), and warn about the cases
> that aren't trivial (where "not trivial" may not necessarily be about
> fundamentally ambiguous ones, but just "complex enough that I won't
> even try").

There is still a risk with this approach when the Kconfig isn't entirely
correct. For example, on ARM we have (I pushed a patch already):

config CPU_32v6K
depends on CPU_V6

config CPU_V7
select CPU_32v6K

In this simple approach, we end up selecting CPU_V6 when we only need
CPU_V7. There other places like this in the kernel.

Of course, kbuild could still warn but if people rely on this feature to
select options automatically I suspect they would ignore the warnings.

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH] Kconfig: Enable Kconfig fragments to be used for defconfig

2010-07-16 Thread Catalin Marinas
On Fri, 2010-07-16 at 21:17 +0100, Grant Likely wrote:
> On Fri, Jul 16, 2010 at 2:09 PM, Catalin Marinas
>  wrote:
> > On Fri, 2010-07-16 at 19:46 +0100, Linus Torvalds wrote:
> >> On Fri, Jul 16, 2010 at 11:40 AM, Nicolas Pitre  wrote:
> >> >
> >> > DOH.
> >>
> >> Well, it's possible that the correct approach is a mixture.
> >>
> >> Automatically do the trivial cases (recursive selects, dependencies
> >> that are simple or of the form "x && y" etc), and warn about the cases
> >> that aren't trivial (where "not trivial" may not necessarily be about
> >> fundamentally ambiguous ones, but just "complex enough that I won't
> >> even try").
> >
> > There is still a risk with this approach when the Kconfig isn't entirely
> > correct. For example, on ARM we have (I pushed a patch already):
> >
> > config CPU_32v6K
> >depends on CPU_V6
> >
> > config CPU_V7
> >select CPU_32v6K
> >
> > In this simple approach, we end up selecting CPU_V6 when we only need
> > CPU_V7. There other places like this in the kernel.
> >
> > Of course, kbuild could still warn but if people rely on this feature to
> > select options automatically I suspect they would ignore the warnings.
> 
> In my first patch, I made Kconfig problems errors instead of warnings.
>  That would prevent people from ignoring them.

My point was that if we allow kbuild to select dependencies
automatically (as per Nico's initial suggestion, followed up by Linus),
in the above situation CPU_V7 would trigger the selection of CPU_V6 and
I don't want this. If we rely on such automatic selection of the
"depends on" options, we can't make the warnings be errors.

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] kmemleak: Allow kmemleak to be built on powerpc

2009-07-16 Thread Catalin Marinas
On Thu, 2009-07-16 at 07:31 -0400, Josh Boyer wrote:
> On Thu, Jul 16, 2009 at 05:43:50PM +1000, Michael Ellerman wrote:
> >On Thu, 2009-07-16 at 11:25 +1000, Michael Ellerman wrote:
> >> Very lightly tested, doesn't crash the kernel.
> >> 
> >> Signed-off-by: Michael Ellerman 
> >> ---
> >> 
> >> It doesn't look like we actually need to add any support in the
> >> arch code - or is there something I'm missing?
> >
> >Hmm, I think we want to add annotations in lib/lmb.c don't we? That's
> >our low-level pre-bootmem allocator.
> >
> >And we have the same problem with _edata as x86.
> 
> I'll point out that the Fedora developers enabled this in the rawhide kernels
> for 3 days, and then turned it back off because most of the things found were
> false positives.

Yes, but with 2.6.31-rc3 the number of false positives dropped
considerably. I don't plan to push any new kmemleak patches for 2.6.31
(probably only a bug-fix).

Anyway, even when it reports real leaks, it is very time consuming to
investigate.

I don't have any PPC hardware but as long as someone sorts out things
like _edata or other PPC-specific allocators which aren't currently
tracked by kmemleak, I'm OK with the original patch:

Acked-by: Catalin Marinas 

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] kmemleak: Allow kmemleak to be built on powerpc

2009-07-16 Thread Catalin Marinas
On Thu, 2009-07-16 at 17:43 +1000, Michael Ellerman wrote:
> On Thu, 2009-07-16 at 11:25 +1000, Michael Ellerman wrote:
> > Very lightly tested, doesn't crash the kernel.
> > 
> > Signed-off-by: Michael Ellerman 
> > ---
> > 
> > It doesn't look like we actually need to add any support in the
> > arch code - or is there something I'm missing?
> 
> Hmm, I think we want to add annotations in lib/lmb.c don't we? That's
> our low-level pre-bootmem allocator.

Yes, I think so (I'm not using this on ARM or x86 so I can't really test
it). Without these hooks, there kmemleak reports aren't that useful
(probably too many).

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] kmemleak: Allow kmemleak to be built on powerpc

2009-07-17 Thread Catalin Marinas
On Fri, 2009-07-17 at 10:29 +1000, Michael Ellerman wrote:
> On Thu, 2009-07-16 at 18:52 +0100, Catalin Marinas wrote:
> > On Thu, 2009-07-16 at 17:43 +1000, Michael Ellerman wrote:
> > > On Thu, 2009-07-16 at 11:25 +1000, Michael Ellerman wrote:
> > > > Very lightly tested, doesn't crash the kernel.
> > > > 
> > > > Signed-off-by: Michael Ellerman 
> > > > ---
> > > > 
> > > > It doesn't look like we actually need to add any support in the
> > > > arch code - or is there something I'm missing?
> > > 
> > > Hmm, I think we want to add annotations in lib/lmb.c don't we? That's
> > > our low-level pre-bootmem allocator.
> > 
> > Yes, I think so (I'm not using this on ARM or x86 so I can't really test
> > it). Without these hooks, there kmemleak reports aren't that useful
> > (probably too many).
> 
> The wrinkle is that lmb never frees, so by definition it can't leak :)

You can pass min_count = 0 to kmemleak_alloc() so that it would never
report such blocks as leaks (see the alloc_bootmem hooks).

> But we could have memory allocated with lmb that has pointers to other
> objects allocated later, and we want kmemleak to scan the lmb allocated
> blocks to find those references.
> 
> So the question is do we need to annotate lmb so that will happen, or
> does kmemleak scan all kernel memory, regardless of where it's
> allocated?

Apart from the data and bss sections, it only scans the memory which was
explicitly allocated. So, you need these hooks in the lmb code.

The mainline kernel scans for the task stacks by going through the full
tasks list. However, traversing this list requires a lock which
increases the latency quite a lot. For the next merging window, I added
hooks to the alloc_thread_info/free_thread_info functions. If the ppc
code has __HAVE_ARCH_THREAD_INFO_ALLOCATOR defined, you may need to add
some hooks in there as well (but note that kmallloc/kmem_cache_alloc are
tracked by kmemleak already, so you only need the hooks if the
thread_info allocator uses __get_free_pages).

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] kmemleak: Allow kmemleak to be built on powerpc

2009-07-17 Thread Catalin Marinas
On Fri, 2009-07-17 at 18:32 +1000, Michael Ellerman wrote:
> On Fri, 2009-07-17 at 09:26 +0100, Catalin Marinas wrote:
> > On Fri, 2009-07-17 at 10:29 +1000, Michael Ellerman wrote:
> > > The wrinkle is that lmb never frees, so by definition it can't leak :)
> > 
> > You can pass min_count = 0 to kmemleak_alloc() so that it would never
> > report such blocks as leaks (see the alloc_bootmem hooks).
> 
> OK. I couldn't see any description of what min_count meant anywhere,

There isn't any, indeed. I'll try to add some description to the
kmemleak api. But it basically means the minimum number a pointer value
should be found during scanning so that it is not reported as a leak. If
this value is 0, it would never be reported. The
kmalloc/kmem_cache_alloc have this value 1 and vmalloc is higher because
of other structures like vm_area holding pointers o the same block.

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] Add kmemleak annotations to lmb.c

2009-08-13 Thread Catalin Marinas
On Thu, 2009-08-13 at 13:01 +1000, Michael Ellerman wrote:
> We don't actually want kmemleak to track the lmb allocations, so we
> pass min_count as 0. However telling kmemleak about lmb allocations
> allows it to scan that memory for pointers to other memory that is
> tracked by kmemleak, ie. slab allocations etc.

Looks alright to me (though I haven't tested it). You can add a
Reviewed-by: Catalin Marinas 

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] Add kmemleak annotations to lmb.c

2009-08-14 Thread Catalin Marinas
On Fri, 2009-08-14 at 17:56 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2009-08-13 at 16:40 +0100, Catalin Marinas wrote:
> > On Thu, 2009-08-13 at 13:01 +1000, Michael Ellerman wrote:
> > > We don't actually want kmemleak to track the lmb allocations, so we
> > > pass min_count as 0. However telling kmemleak about lmb allocations
> > > allows it to scan that memory for pointers to other memory that is
> > > tracked by kmemleak, ie. slab allocations etc.
> > 
> > Looks alright to me (though I haven't tested it). You can add a
> > Reviewed-by: Catalin Marinas 
> 
> Actually, Milton pointed to me that we may not want to allow all
> LMB chunks to be scanned by kmemleaks, things like the DART hole
> that's taken out of the linear mapping for example may need to
> be avoided, though I'm not sure what would be the right way to
> do it.

I suspect there are more blocks to be scanned than those that shouldn't,
so maybe ignore the latter explicitly using kmemleak_ignore(). This was
raised recently on x86_64 as well which has a memory hole for some
aperture - http://lkml.org/lkml/2009/8/13/237.

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] Add kmemleak annotations to lmb.c

2009-08-14 Thread Catalin Marinas
On Fri, 2009-08-14 at 12:49 -0700, David Miller wrote:
> From: Benjamin Herrenschmidt 
> Date: Fri, 14 Aug 2009 17:56:40 +1000
> 
> > On Thu, 2009-08-13 at 16:40 +0100, Catalin Marinas wrote:
> >> On Thu, 2009-08-13 at 13:01 +1000, Michael Ellerman wrote:
> >> > We don't actually want kmemleak to track the lmb allocations, so we
> >> > pass min_count as 0. However telling kmemleak about lmb allocations
> >> > allows it to scan that memory for pointers to other memory that is
> >> > tracked by kmemleak, ie. slab allocations etc.
> >> 
> >> Looks alright to me (though I haven't tested it). You can add a
> >> Reviewed-by: Catalin Marinas 
> > 
> > Actually, Milton pointed to me that we may not want to allow all
> > LMB chunks to be scanned by kmemleaks, things like the DART hole
> > that's taken out of the linear mapping for example may need to
> > be avoided, though I'm not sure what would be the right way to
> > do it.
> 
> I think that annotating LMB for kmemleak may be more problems
> that it's worth.

BTW, are there many LMB allocations used for storing pointers to other
objects? If not, it may be worth just annotating those with
kmemleak_alloc() if you get false positives.

-- 
Catalin

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: linux-next: manual merge of the powerpc tree with the arm64 tree

2019-09-02 Thread Catalin Marinas
On Mon, Sep 02, 2019 at 11:44:43AM +1000, Michael Ellerman wrote:
> Stephen Rothwell  writes:
> > Hi all,
> >
> > Today's linux-next merge of the powerpc tree got a conflict in:
> >
> >   arch/Kconfig
> >
> > between commit:
> >
> >   5cf896fb6be3 ("arm64: Add support for relocating the kernel with RELR 
> > relocations")
> >
> > from the arm64 tree and commit:
> >
> >   0c9c1d563975 ("x86, s390: Move ARCH_HAS_MEM_ENCRYPT definition to 
> > arch/Kconfig")
> >
> > from the powerpc tree.
> >
> > I fixed it up (see below) and can carry the fix as necessary. This
> > is now fixed as far as linux-next is concerned, but any non trivial
> > conflicts should be mentioned to your upstream maintainer when your tree
> > is submitted for merging.  You may also want to consider cooperating
> > with the maintainer of the conflicting tree to minimise any particularly
> > complex conflicts.
> 
> Thanks.
> 
> That conflict seems entirely trivial, but Catalin/Will if it bothers you
> I have the conflicting commit in a topic branch based on rc2 which you
> could merge to resolve it:

It's a trivial conflict, easy to resolve. I don't think it's worth
trying to avoid it (Linus normally doesn't mind such conflicts).

-- 
Catalin


Re: [PATCH RFC 0/5] ARM: Raspberry Pi 4 DMA support

2019-10-14 Thread Catalin Marinas
On Mon, Oct 14, 2019 at 08:31:02PM +0200, Nicolas Saenz Julienne wrote:
> the Raspberry Pi 4 offers up to 4GB of memory, of which only the first
> is DMA capable device wide. This forces us to use of bounce buffers,
> which are currently not very well supported by ARM's custom DMA ops.
> Among other things the current mechanism (see dmabounce.c) isn't
> suitable for high memory. Instead of fixing it, this series introduces a
> way of selecting dma-direct as the default DMA ops provider which allows
> for the Raspberry Pi to make use of swiotlb.

I presume these patches go on top of this series:

http://lkml.kernel.org/r/20190911182546.17094-1-nsaenzjulie...@suse.de

which I queued here:

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/zone-dma

-- 
Catalin


Re: [PATCH RFC 0/5] ARM: Raspberry Pi 4 DMA support

2019-10-15 Thread Catalin Marinas
On Tue, Oct 15, 2019 at 09:48:22AM +0200, Nicolas Saenz Julienne wrote:
> A little off topic but I was wondering if you have a preferred way to refer to
> the arm architecture in a way that it unambiguously excludes arm64 (for 
> example
> arm32 would work).

arm32 should be fine. Neither arm64 nor arm32 are officially endorsed
ARM Ltd names (officially the exception model is AArch32 while the
instruction set is one of A32/T32/T16).

-- 
Catalin


Re: [PATCH v1 0/8] Fix set_huge_pte_at() panic on arm64

2023-09-21 Thread Catalin Marinas
On Thu, Sep 21, 2023 at 05:35:54PM +0100, Ryan Roberts wrote:
> On 21/09/2023 17:30, Andrew Morton wrote:
> > On Thu, 21 Sep 2023 17:19:59 +0100 Ryan Roberts  
> > wrote:
> >> Ryan Roberts (8):
> >>   parisc: hugetlb: Convert set_huge_pte_at() to take vma
> >>   powerpc: hugetlb: Convert set_huge_pte_at() to take vma
> >>   riscv: hugetlb: Convert set_huge_pte_at() to take vma
> >>   s390: hugetlb: Convert set_huge_pte_at() to take vma
> >>   sparc: hugetlb: Convert set_huge_pte_at() to take vma
> >>   mm: hugetlb: Convert set_huge_pte_at() to take vma
> >>   arm64: hugetlb: Convert set_huge_pte_at() to take vma
> >>   arm64: hugetlb: Fix set_huge_pte_at() to work with all swap entries
> >>
> >>  arch/arm64/include/asm/hugetlb.h  |  2 +-
> >>  arch/arm64/mm/hugetlbpage.c   | 22 --
> >>  arch/parisc/include/asm/hugetlb.h |  2 +-
> >>  arch/parisc/mm/hugetlbpage.c  |  4 +--
> >>  .../include/asm/nohash/32/hugetlb-8xx.h   |  3 +-
> >>  arch/powerpc/mm/book3s64/hugetlbpage.c|  2 +-
> >>  arch/powerpc/mm/book3s64/radix_hugetlbpage.c  |  2 +-
> >>  arch/powerpc/mm/nohash/8xx.c  |  2 +-
> >>  arch/powerpc/mm/pgtable.c |  7 -
> >>  arch/riscv/include/asm/hugetlb.h  |  2 +-
> >>  arch/riscv/mm/hugetlbpage.c   |  3 +-
> >>  arch/s390/include/asm/hugetlb.h   |  8 +++--
> >>  arch/s390/mm/hugetlbpage.c|  8 -
> >>  arch/sparc/include/asm/hugetlb.h  |  8 +++--
> >>  arch/sparc/mm/hugetlbpage.c   |  8 -
> >>  include/asm-generic/hugetlb.h |  6 ++--
> >>  include/linux/hugetlb.h   |  6 ++--
> >>  mm/damon/vaddr.c  |  2 +-
> >>  mm/hugetlb.c  | 30 +--
> >>  mm/migrate.c  |  2 +-
> >>  mm/rmap.c | 10 +++
> >>  mm/vmalloc.c  |  5 +++-
> >>  22 files changed, 80 insertions(+), 64 deletions(-)
> > 
> > Looks scary but it's actually a fairly modest patchset.  It could
> > easily be all rolled into a single patch for ease of backporting. 
> > Maybe Greg has an opinion?
> 
> Yes, I thought about doing that; or perhaps 2 patches - one for the interface
> change across all arches and core code, and one for the actual bug fix?

I think this would make more sense, especially if we want to backport
it. The first patch would have no functional change, only an interface
change, followed by the arm64 fix.

-- 
Catalin


Re: [PATCH v2] arch: Reserve map_shadow_stack() syscall number for all architectures

2023-10-04 Thread Catalin Marinas
On Thu, Sep 14, 2023 at 06:58:03PM +, Sohil Mehta wrote:
> commit c35559f94ebc ("x86/shstk: Introduce map_shadow_stack syscall")
> recently added support for map_shadow_stack() but it is limited to x86
> only for now. There is a possibility that other architectures (namely,
> arm64 and RISC-V), that are implementing equivalent support for shadow
> stacks, might need to add support for it.
> 
> Independent of that, reserving arch-specific syscall numbers in the
> syscall tables of all architectures is good practice and would help
> avoid future conflicts. map_shadow_stack() is marked as a conditional
> syscall in sys_ni.c. Adding it to the syscall tables of other
> architectures is harmless and would return ENOSYS when exercised.
> 
> Note, map_shadow_stack() was assigned #453 during the merge process
> since #452 was taken by fchmodat2().
> 
> For Powerpc, map it to sys_ni_syscall() as is the norm for Powerpc
> syscall tables.
> 
> For Alpha, map_shadow_stack() takes up #563 as Alpha still diverges from
> the common syscall numbering system in the other architectures.
> 
> Link: 
> https://lore.kernel.org/lkml/20230515212255.ga562...@debug.ba.rivosinc.com/
> Link: 
> https://lore.kernel.org/lkml/b402b80b-a7c6-4ef0-b977-c0f5f582b...@sirena.org.uk/
> 
> Signed-off-by: Sohil Mehta 
> ---
> v2:
> - Skip syscall table changes to tools/. They will be handled separetely by the
>   perf folks.
> - Map Powerpc to sys_ni_syscall (Rick Edgecombe)
> ---
>  arch/alpha/kernel/syscalls/syscall.tbl  | 1 +
>  arch/arm/tools/syscall.tbl  | 1 +
>  arch/arm64/include/asm/unistd.h     | 2 +-
>  arch/arm64/include/asm/unistd32.h   | 2 ++

For arm64 (compat):

Acked-by: Catalin Marinas 


Re: [PATCH v3 9/9] efi: move screen_info into efi init code

2023-10-10 Thread Catalin Marinas
On Mon, Oct 09, 2023 at 11:18:45PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> After the vga console no longer relies on global screen_info, there are
> only two remaining use cases:
> 
>  - on the x86 architecture, it is used for multiple boot methods
>(bzImage, EFI, Xen, kexec) to commucate the initial VGA or framebuffer
>settings to a number of device drivers.
> 
>  - on other architectures, it is only used as part of the EFI stub,
>and only for the three sysfb framebuffers (simpledrm, simplefb, efifb).
> 
> Remove the duplicate data structure definitions by moving it into the
> efi-init.c file that sets it up initially for the EFI case, leaving x86
> as an exception that retains its own definition for non-EFI boots.
> 
> The added #ifdefs here are optional, I added them to further limit the
> reach of screen_info to configurations that have at least one of the
> users enabled.
> 
> Reviewed-by: Ard Biesheuvel 
> Reviewed-by: Javier Martinez Canillas 
> Acked-by: Helge Deller 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/arm/kernel/setup.c   |  4 
>  arch/arm64/kernel/efi.c   |  4 
>  arch/arm64/kernel/image-vars.h|  2 ++

It's more Ard's thing and he reviewed it already but if you need another
ack:

Acked-by: Catalin Marinas 


Re: get_user_pages() and EXEC_ONLY mapping.

2023-11-10 Thread Catalin Marinas
On Fri, Nov 10, 2023 at 08:19:23PM +0530, Aneesh Kumar K.V wrote:
> Some architectures can now support EXEC_ONLY mappings and I am wondering
> what get_user_pages() on those addresses should return. Earlier
> PROT_EXEC implied PROT_READ and pte_access_permitted() returned true for
> that. But arm64 does have this explicit comment that says
> 
>  /*
>  * p??_access_permitted() is true for valid user mappings (PTE_USER
>  * bit set, subject to the write permission check). For execute-only
>  * mappings, like PROT_EXEC with EPAN (both PTE_USER and PTE_UXN bits
>  * not set) must return false. PROT_NONE mappings do not have the
>  * PTE_VALID bit set.
>  */
> 
> Is that correct? We should be able to get struct page for PROT_EXEC
> mappings?

I don't remember why we ended up with this briefly looking at the code,
pte_access_permitted() is only used on the fast GUP path. On the slow
path, there is a check_vma_flags() call which returns -EFAULT if the vma
is not readable. So the pte_access_permitted() on the fast path matches
the semantics of the slow path.

If one wants the page structure, FOLL_FORCE ignores the read check (on
the slow path), though I think it still fails if VM_MAYREAD is not set.
Unless you have a real use-case where this is not sufficient, I'd leave
the behaviour as is on arm64 (and maybe update other architectures that
support exec-only to do the same).

-- 
Catalin


Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-06-29 Thread Catalin Marinas
On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
> From: Barry Song 
> 
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
[...]
>  .../features/vm/TLB/arch-support.txt  |  2 +-
>  arch/arm64/Kconfig|  1 +
>  arch/arm64/include/asm/tlbbatch.h | 12 
>  arch/arm64/include/asm/tlbflush.h | 33 -
>  arch/arm64/mm/flush.c | 69 +++
>  arch/x86/include/asm/tlbflush.h   |  5 +-
>  include/linux/mm_types_task.h |  4 +-
>  mm/rmap.c | 12 ++--

First of all, this patch needs to be split in some preparatory patches
introducing/renaming functions with no functional change for x86. Once
done, you can add the arm64-only changes.

Now, on the implementation, I had some comments on v7 but we didn't get
to a conclusion and the thread eventually died:

https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/

I know I said a command line argument is better than Kconfig or some
random number of CPUs heuristics but it would be even better if we don't
bother with any, just make this always on. Barry had some comments
around mprotect() being racy and that's why we have
flush_tlb_batched_pending() but I don't think it's needed (or, for
arm64, it can be a DSB since this patch issues the TLBIs but without the
DVM Sync). So we need to clarify this (see Barry's last email on the
above thread) and before attempting new versions of this patchset. With
flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
implementation would be faster on any SoC irrespective of the number of
CPUs.

-- 
Catalin


Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-06-29 Thread Catalin Marinas
On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote:
> On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
> > From: Barry Song 
> > 
> > on x86, batched and deferred tlb shootdown has lead to 90%
> > performance increase on tlb shootdown. on arm64, HW can do
> > tlb shootdown without software IPI. But sync tlbi is still
> > quite expensive.
> [...]
> >  .../features/vm/TLB/arch-support.txt  |  2 +-
> >  arch/arm64/Kconfig|  1 +
> >  arch/arm64/include/asm/tlbbatch.h | 12 
> >  arch/arm64/include/asm/tlbflush.h | 33 -
> >  arch/arm64/mm/flush.c | 69 +++
> >  arch/x86/include/asm/tlbflush.h   |  5 +-
> >  include/linux/mm_types_task.h |  4 +-
> >  mm/rmap.c | 12 ++--
> 
> First of all, this patch needs to be split in some preparatory patches
> introducing/renaming functions with no functional change for x86. Once
> done, you can add the arm64-only changes.
> 
> Now, on the implementation, I had some comments on v7 but we didn't get
> to a conclusion and the thread eventually died:
> 
> https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/
> 
> I know I said a command line argument is better than Kconfig or some
> random number of CPUs heuristics but it would be even better if we don't
> bother with any, just make this always on. Barry had some comments
> around mprotect() being racy and that's why we have
> flush_tlb_batched_pending() but I don't think it's needed (or, for
> arm64, it can be a DSB since this patch issues the TLBIs but without the
> DVM Sync). So we need to clarify this (see Barry's last email on the
> above thread) and before attempting new versions of this patchset. With
> flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
> implementation would be faster on any SoC irrespective of the number of
> CPUs.

I think I got the need for flush_tlb_batched_pending(). If
try_to_unmap() marks the pte !present and we have a pending TLBI,
change_pte_range() will skip the TLB maintenance altogether since it did
not change the pte. So we could be left with stale TLB entries after
mprotect() before TTU does the batch flushing.

We can have an arch-specific flush_tlb_batched_pending() that can be a
DSB only on arm64 and a full mm flush on x86.

-- 
Catalin


Re: [PATCH v10 4/4] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-07-16 Thread Catalin Marinas
On Mon, Jul 10, 2023 at 04:39:14PM +0800, Yicong Yang wrote:
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7856c3a3e35a..f0ce8208c57f 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -96,6 +96,7 @@ config ARM64
>   select ARCH_SUPPORTS_NUMA_BALANCING
>   select ARCH_SUPPORTS_PAGE_TABLE_CHECK
>   select ARCH_SUPPORTS_PER_VMA_LOCK
> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if EXPERT

I don't want EXPERT to turn on a feature that's not selectable by the
user. This would lead to different performance behaviour based on
EXPERT. Just select it unconditionally.

> diff --git a/arch/arm64/include/asm/tlbflush.h 
> b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..4bb9cec62e26 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   dsb(ish);
>  }
>  
> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> -  unsigned long uaddr)
> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> +unsigned long uaddr)
>  {
>   unsigned long addr;
>  
>   dsb(ishst);
> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> + addr = __TLBI_VADDR(uaddr, ASID(mm));
>   __tlbi(vale1is, addr);
>   __tlbi_user(vale1is, addr);
>  }
>  
> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +  unsigned long uaddr)
> +{
> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> +}
> +
>  static inline void flush_tlb_page(struct vm_area_struct *vma,
> unsigned long uaddr)
>  {
> @@ -272,6 +278,42 @@ static inline void flush_tlb_page(struct vm_area_struct 
> *vma,
>   dsb(ish);
>  }
>  
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH

If it's selected unconditionally, we won't need this #ifdef here.

> +
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> + /*
> +  * TLB flush deferral is not required on systems, which are affected 
> with

"affected by" and drop the comma before "which".

-- 
Catalin


Re: [PATCH v11 1/4] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

2023-07-21 Thread Catalin Marinas
On Mon, Jul 17, 2023 at 09:10:01PM +0800, Yicong Yang wrote:
> From: Anshuman Khandual 
> 
> The entire scheme of deferred TLB flush in reclaim path rests on the
> fact that the cost to refill TLB entries is less than flushing out
> individual entries by sending IPI to remote CPUs. But architecture
> can have different ways to evaluate that. Hence apart from checking
> TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
> architecture specific.
> 
> Signed-off-by: Anshuman Khandual 
> [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khand...@linux.vnet.ibm.com/]
> Signed-off-by: Yicong Yang 
> [Rebase and fix incorrect return value type]
> Reviewed-by: Kefeng Wang 
> Reviewed-by: Anshuman Khandual 
> Reviewed-by: Barry Song 
> Reviewed-by: Xin Hao 
> Tested-by: Punit Agrawal 

Reviewed-by: Catalin Marinas 


Re: [PATCH v11 2/4] mm/tlbbatch: Rename and extend some functions

2023-07-21 Thread Catalin Marinas
On Mon, Jul 17, 2023 at 09:10:02PM +0800, Yicong Yang wrote:
> From: Barry Song 
> 
> This patch does some preparation works to extend batched TLB flush to
> arm64. Including:
> - Extend set_tlb_ubc_flush_pending() and arch_tlbbatch_add_mm()
>   to accept an additional argument for address, architectures
>   like arm64 may need this for tlbi.
> - Rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending()
>   to match its current function since we don't need to handle
>   mm on architectures like arm64 and add_mm is not proper,
>   add_pending will make sense to both as on x86 we're pending the
>   TLB flush operations while on arm64 we're pending the synchronize
>   operations.
> 
> This intends no functional changes on x86.
> 
> Cc: Anshuman Khandual 
> Cc: Jonathan Corbet 
> Cc: Nadav Amit 
> Cc: Mel Gorman 
> Tested-by: Yicong Yang 
> Tested-by: Xin Hao 
> Tested-by: Punit Agrawal 
> Signed-off-by: Barry Song 
> Signed-off-by: Yicong Yang 
> Reviewed-by: Kefeng Wang 
> Reviewed-by: Xin Hao 
> Reviewed-by: Anshuman Khandual 

Reviewed-by: Catalin Marinas 


Re: [PATCH v11 3/4] mm/tlbbatch: Introduce arch_flush_tlb_batched_pending()

2023-07-21 Thread Catalin Marinas
On Mon, Jul 17, 2023 at 09:10:03PM +0800, Yicong Yang wrote:
> From: Yicong Yang 
> 
> Currently we'll flush the mm in flush_tlb_batched_pending() to
> avoid race between reclaim unmaps pages by batched TLB flush
> and mprotect/munmap/etc. Other architectures like arm64 may
> only need a synchronization barrier(dsb) here rather than
> a full mm flush. So add arch_flush_tlb_batched_pending() to
> allow an arch-specific implementation here. This intends no
> functional changes on x86 since still a full mm flush for
> x86.
> 
> Signed-off-by: Yicong Yang 

Reviewed-by: Catalin Marinas 


Re: [PATCH v11 4/4] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-07-21 Thread Catalin Marinas
On Mon, Jul 17, 2023 at 09:10:04PM +0800, Yicong Yang wrote:
> +static inline void arch_tlbbatch_add_pending(struct 
> arch_tlbflush_unmap_batch *batch,
> +  struct mm_struct *mm,
> +  unsigned long uaddr)
> +{
> + __flush_tlb_page_nosync(mm, uaddr);
> +}
> +
> +static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> +{
> + dsb(ish);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch 
> *batch)
> +{
> + dsb(ish);
> +}

Nitpick: as an additional patch, I'd add some comment for these two
functions that the TLBI has already been issued and only a DSB is needed
to synchronise its effect on the other CPUs.

Reviewed-by: Catalin Marinas 


Re: [PATCH v3 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs

2023-07-21 Thread Catalin Marinas
On Thu, Jul 20, 2023 at 06:39:25PM +1000, Alistair Popple wrote:
> diff --git a/arch/arm64/include/asm/tlbflush.h 
> b/arch/arm64/include/asm/tlbflush.h
> index 3456866..a99349d 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -252,6 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   __tlbi(aside1is, asid);
>   __tlbi_user(aside1is, asid);
>   dsb(ish);
> + mmu_notifier_invalidate_range(mm, 0, -1UL);
>  }
>  
>  static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> @@ -263,6 +265,8 @@ static inline void __flush_tlb_page_nosync(struct 
> mm_struct *mm,
>   addr = __TLBI_VADDR(uaddr, ASID(mm));
>   __tlbi(vale1is, addr);
>   __tlbi_user(vale1is, addr);
> + mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
> + (uaddr & PAGE_MASK) + 
> PAGE_SIZE);

Nitpick: we have PAGE_ALIGN() for this.

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH v3 5/5] mmu_notifiers: Rename invalidate_range notifier

2023-07-21 Thread Catalin Marinas
On Thu, Jul 20, 2023 at 06:39:27PM +1000, Alistair Popple wrote:
> diff --git a/arch/arm64/include/asm/tlbflush.h 
> b/arch/arm64/include/asm/tlbflush.h
> index a99349d..84a05a0 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -253,7 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   __tlbi(aside1is, asid);
>   __tlbi_user(aside1is, asid);
>   dsb(ish);
> - mmu_notifier_invalidate_range(mm, 0, -1UL);
> + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>  }
>  
>  static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> @@ -265,7 +265,7 @@ static inline void __flush_tlb_page_nosync(struct 
> mm_struct *mm,
>   addr = __TLBI_VADDR(uaddr, ASID(mm));
>   __tlbi(vale1is, addr);
>   __tlbi_user(vale1is, addr);
> - mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
> + mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
>   (uaddr & PAGE_MASK) + 
> PAGE_SIZE);
>  }
>  
> @@ -400,7 +400,7 @@ static inline void __flush_tlb_range(struct 
> vm_area_struct *vma,
>   scale++;
>   }
>   dsb(ish);
> - mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>  }

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH 00/21] dma-mapping: unify support for cache flushes

2023-03-31 Thread Catalin Marinas
On Mon, Mar 27, 2023 at 02:12:56PM +0200, Arnd Bergmann wrote:
> Another difference that I do not address here is what cache invalidation
> does for partical cache lines. On arm32, arm64 and powerpc, a partial
> cache line always gets written back before invalidation in order to
> ensure that data before or after the buffer is not discarded. On all
> other architectures, the assumption is cache lines are never shared
> between DMA buffer and data that is accessed by the CPU.

I don't think sharing the DMA buffer with other data is safe even with
this clean+invalidate on the unaligned cache. Mapping the DMA buffer as
FROM_DEVICE or BIDIRECTIONAL can cause the shared cache line to be
evicted and override the device written data. This sharing only works if
the CPU guarantees not to dirty the corresponding cache line.

I'm fine with removing this partial cache line hack from arm64 as it's
not safe anyway. We'll see if any driver stops working. If there's some
benign sharing (I wouldn't trust it), the cache cleaning prior to
mapping and invalidate on unmap would not lose any data.

-- 
Catalin


Re: [PATCH 18/21] ARM: drop SMP support for ARM11MPCore

2023-03-31 Thread Catalin Marinas
On Mon, Mar 27, 2023 at 02:13:14PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The cache management operations for noncoherent DMA on ARMv6 work
> in two different ways:
> 
>  * When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight
>DMA buffers lead to data corruption when the prefetched data is written
>back on top of data from the device.
> 
>  * When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU
>is not seen by the other core(s), leading to inconsistent contents
>accross the system.
> 
> As a consequence, neither configuration is actually safe to use in a
> general-purpose kernel that is used on both MPCore systems and ARM1176
> with prefetching enabled.

As the author of this terrible hack (created under duress ;))

Acked-by: Catalin Marinas 

IIRC, RWFO is working in combination with the cache operations. Because
the cache maintenance broadcast did not happen, we forced the cache
lines to migrate to a CPU via a write (for ownership) and doing the
cache maintenance on that CPU (that was the FROM_DEVICE case). For the
TO_DEVICE case, reading on a CPU would cause dirty lines on another CPU
to be evicted (or migrated as dirty to the current CPU IIRC) then the
cache maintenance to clean them to PoC on the local CPU.

But there's always a small window between read/write for ownership and
the actual cache maintenance which can cause a cache line to migrate to
other CPUs if they do speculative prefetches. At the time ARM11MPCore
was deemed safe-ish but I haven't followed what later implementations
actually did (luckily we fixed the architecture in ARMv7).

-- 
Catalin


Re: [PATCH v3 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-04-12 Thread Catalin Marinas
On Tue, Apr 04, 2023 at 06:50:01AM -0500, Justin Forbes wrote:
> On Tue, Apr 4, 2023 at 2:22 AM Mike Rapoport  wrote:
> > On Wed, Mar 29, 2023 at 10:55:37AM -0500, Justin Forbes wrote:
> > > On Sat, Mar 25, 2023 at 1:09 AM Mike Rapoport  wrote:
> > > >
> > > > From: "Mike Rapoport (IBM)" 
> > > >
> > > > It is not a good idea to change fundamental parameters of core memory
> > > > management. Having predefined ranges suggests that the values within
> > > > those ranges are sensible, but one has to *really* understand
> > > > implications of changing MAX_ORDER before actually amending it and
> > > > ranges don't help here.
> > > >
> > > > Drop ranges in definition of ARCH_FORCE_MAX_ORDER and make its prompt
> > > > visible only if EXPERT=y
> > >
> > > I do not like suddenly hiding this behind EXPERT for a couple of
> > > reasons.  Most importantly, it will silently change the config for
> > > users building with an old kernel config.  If a user has for instance
> > > "13" set and building with 4K pages, as is the current configuration
> > > for Fedora and RHEL aarch64 builds, an oldconfig build will now set it
> > > to 10 with no indication that it is doing so.  And while I think that
> > > 10 is a fine default for many aarch64 users, there are valid reasons
> > > for choosing other values. Putting this behind expert makes it much
> > > less obvious that this is an option.
> >
> > That's the idea of EXPERT, no?
> >
> > This option was intended to allow allocation of huge pages for
> > architectures that had PMD_ORDER > MAX_ORDER and not to allow user to
> > select size of maximal physically contiguous allocation.
> >
> > Changes to MAX_ORDER fundamentally change the behaviour of core mm and
> > unless users *really* know what they are doing there is no reason to choose
> > non-default values so hiding this option behind EXPERT seems totally
> > appropriate to me.
> 
> It sounds nice in theory. In practice. EXPERT hides too much. When you
> flip expert, you expose over a 175ish new config options which are
> hidden behind EXPERT.  You don't have to know what you are doing just
> with the MAX_ORDER, but a whole bunch more as well.  If everyone were
> already running 10, this might be less of a problem. At least Fedora
> and RHEL are running 13 for 4K pages on aarch64. This was not some
> accidental choice, we had to carry a patch to even allow it for a
> while.  If this does go in as is, we will likely just carry a patch to
> remove the "if EXPERT", but that is a bit of a disservice to users who
> might be trying to debug something else upstream, bisecting upstream
> kernels or testing a patch.  In those cases, people tend to use
> pristine upstream sources without distro patches to verify, and they
> tend to use their existing configs. With this change, their MAX_ORDER
> will drop to 10 from 13 silently.   That can look like a different
> issue enough to ruin a bisect or have them give bad feedback on a
> patch because it introduces a "regression" which is not a regression
> at all, but a config change they couldn't see.

If we remove EXPERT (as prior to this patch), I'd rather keep the ranges
and avoid having to explain to people why some random MAX_ORDER doesn't
build (keeping the range would also make sense for randconfig, not sure
we got to any conclusion there).

-- 
Catalin


Re: [PATCH v3 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-04-19 Thread Catalin Marinas
On Tue, Apr 18, 2023 at 03:05:57PM -0700, Andrew Morton wrote:
> On Wed, 12 Apr 2023 18:27:08 +0100 Catalin Marinas  
> wrote:
> > > It sounds nice in theory. In practice. EXPERT hides too much. When you
> > > flip expert, you expose over a 175ish new config options which are
> > > hidden behind EXPERT.  You don't have to know what you are doing just
> > > with the MAX_ORDER, but a whole bunch more as well.  If everyone were
> > > already running 10, this might be less of a problem. At least Fedora
> > > and RHEL are running 13 for 4K pages on aarch64. This was not some
> > > accidental choice, we had to carry a patch to even allow it for a
> > > while.  If this does go in as is, we will likely just carry a patch to
> > > remove the "if EXPERT", but that is a bit of a disservice to users who
> > > might be trying to debug something else upstream, bisecting upstream
> > > kernels or testing a patch.  In those cases, people tend to use
> > > pristine upstream sources without distro patches to verify, and they
> > > tend to use their existing configs. With this change, their MAX_ORDER
> > > will drop to 10 from 13 silently.   That can look like a different
> > > issue enough to ruin a bisect or have them give bad feedback on a
> > > patch because it introduces a "regression" which is not a regression
> > > at all, but a config change they couldn't see.
> > 
> > If we remove EXPERT (as prior to this patch), I'd rather keep the ranges
> > and avoid having to explain to people why some random MAX_ORDER doesn't
> > build (keeping the range would also make sense for randconfig, not sure
> > we got to any conclusion there).
> 
> Well this doesn't seem to have got anywhere.  I think I'll send the
> patchset into Linus for the next merge window as-is.  Please let's take
> a look at this Kconfig presentation issue during the following -rc
> cycle.

That's fine by me. I have a slight preference to drop EXPERT and keep
the ranges in, especially if it affects current distro kernels. Debian
seems to enable EXPERT already in their arm64 kernel config but I'm not
sure about the Fedora or other distro kernels. If they don't, we can
fix/revert this Kconfig entry once the merging window is closed.

-- 
Catalin


Re: [PATCH v3 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-04-27 Thread Catalin Marinas
On Tue, Apr 25, 2023 at 11:09:58AM -0500, Justin Forbes wrote:
> On Tue, Apr 18, 2023 at 5:22 PM Andrew Morton  
> wrote:
> > On Wed, 12 Apr 2023 18:27:08 +0100 Catalin Marinas 
> >  wrote:
> > > > It sounds nice in theory. In practice. EXPERT hides too much. When you
> > > > flip expert, you expose over a 175ish new config options which are
> > > > hidden behind EXPERT.  You don't have to know what you are doing just
> > > > with the MAX_ORDER, but a whole bunch more as well.  If everyone were
> > > > already running 10, this might be less of a problem. At least Fedora
> > > > and RHEL are running 13 for 4K pages on aarch64. This was not some
> > > > accidental choice, we had to carry a patch to even allow it for a
> > > > while.  If this does go in as is, we will likely just carry a patch to
> > > > remove the "if EXPERT", but that is a bit of a disservice to users who
> > > > might be trying to debug something else upstream, bisecting upstream
> > > > kernels or testing a patch.  In those cases, people tend to use
> > > > pristine upstream sources without distro patches to verify, and they
> > > > tend to use their existing configs. With this change, their MAX_ORDER
> > > > will drop to 10 from 13 silently.   That can look like a different
> > > > issue enough to ruin a bisect or have them give bad feedback on a
> > > > patch because it introduces a "regression" which is not a regression
> > > > at all, but a config change they couldn't see.
> > >
> > > If we remove EXPERT (as prior to this patch), I'd rather keep the ranges
> > > and avoid having to explain to people why some random MAX_ORDER doesn't
> > > build (keeping the range would also make sense for randconfig, not sure
> > > we got to any conclusion there).
> >
> > Well this doesn't seem to have got anywhere.  I think I'll send the
> > patchset into Linus for the next merge window as-is.  Please let's take
> > a look at this Kconfig presentation issue during the following -rc
> > cycle.
> 
> Well, I am very sorry to see this going in as is.  It will silently
> change people building with oldconfig, and anyone not paying attention
> will not notice until an issue is hit where "it worked before, and my
> config hasn't changed".  If EXPERT is unset, there is no notification,
> just a changed behavior.  While it would be easy for me to carry a
> patch dropping the if EXPERT, it will not help any users building on
> upstream with our configs, whether for their own regular use, or while
> trying to debug other issues,  I expect it will result in a reasonable
> amount of frustration from users trying to do the right thing and
> bisect or test patches upstream.

As I said in a previous reply, I'm fine with reverting this commit if it
breaks existing configs. It's only that Andrew had already queued it in
his tree but we have time until the final 6.4 kernel is released.

That said, would you mind sending a patch reverting it (if removing
EXPERT, I'd like to keep the ranges)? ;)

Thanks.

-- 
Catalin


Re: [PATCH 13/14] thread_info: move function declarations to linux/thread_info.h

2023-05-24 Thread Catalin Marinas
On Wed, May 17, 2023 at 03:11:01PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> There are a few __weak functions in kernel/fork.c, which architectures
> can override. If there is no prototype, the compiler warns about them:
> 
> kernel/fork.c:164:13: error: no previous prototype for 
> 'arch_release_task_struct' [-Werror=missing-prototypes]
> kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' 
> [-Werror=missing-prototypes]
> kernel/fork.c:1086:12: error: no previous prototype for 
> 'arch_dup_task_struct' [-Werror=missing-prototypes]
> 
> There are already prototypes in a number of architecture specific headers
> that have addressed those warnings before, but it's much better to have
> these in a single place so the warning no longer shows up anywhere.
> 
> Signed-off-by: Arnd Bergmann 

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH 02/23] arm64: allow pte_offset_map() to fail

2023-05-25 Thread Catalin Marinas
On Tue, May 09, 2023 at 09:43:47PM -0700, Hugh Dickins wrote:
> In rare transient cases, not yet made possible, pte_offset_map() and
> pte_offset_map_lock() may not find a page table: handle appropriately.
> 
> Signed-off-by: Hugh Dickins 

Acked-by: Catalin Marinas 


Re: [PATCH 03/23] arm64/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-05-25 Thread Catalin Marinas
On Tue, May 09, 2023 at 09:45:57PM -0700, Hugh Dickins wrote:
> pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
> that: to keep balance in future, use the recently added pte_alloc_huge()
> instead; with pte_offset_huge() a better name for pte_offset_kernel().
> 
> Signed-off-by: Hugh Dickins 

Acked-by: Catalin Marinas 


Re: [PATCH] irq_work: consolidate arch_irq_work_raise prototypes

2023-05-25 Thread Catalin Marinas
On Tue, May 16, 2023 at 10:02:31PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The prototype was hidden on x86, which causes a warning:
> 
> kernel/irq_work.c:72:13: error: no previous prototype for 
> 'arch_irq_work_raise' [-Werror=missing-prototypes]
> 
> Fix this by providing it in only one place that is always visible.
> 
> Signed-off-by: Arnd Bergmann 

Acked-by: Catalin Marinas 


Re: linux-next: build failure after merge of the mm tree

2023-06-13 Thread Catalin Marinas
Hi Stephen,

On Tue, Jun 13, 2023 at 04:21:19PM +1000, Stephen Rothwell wrote:
> After merging the mm tree, today's linux-next build (powerpc
> ppc44x_defconfig) failed like this:
> 
> In file included from arch/powerpc/include/asm/page.h:247,
>  from arch/powerpc/include/asm/thread_info.h:13,
>  from include/linux/thread_info.h:60,
>  from include/asm-generic/preempt.h:5,
>  from ./arch/powerpc/include/generated/asm/preempt.h:1,
>  from include/linux/preempt.h:78,
>  from include/linux/spinlock.h:56,
>  from include/linux/ipc.h:5,
>  from include/uapi/linux/sem.h:5,
>  from include/linux/sem.h:5,
>  from include/linux/compat.h:14,
>  from arch/powerpc/kernel/asm-offsets.c:12:
> arch/powerpc/include/asm/page_32.h:16: warning: "ARCH_DMA_MINALIGN" redefined
>16 | #define ARCH_DMA_MINALIGN   L1_CACHE_BYTES
>   | 
> In file included from include/linux/time.h:5,
>  from include/linux/compat.h:10:
> include/linux/cache.h:104: note: this is the location of the previous 
> definition
>   104 | #define ARCH_DMA_MINALIGN __alignof__(unsigned long long)
>   | 
> 
> (lots of theses)
> 
> Caused by commit
> 
>   cc7335787e73 ("mm/slab: decouple ARCH_KMALLOC_MINALIGN from 
> ARCH_DMA_MINALIGN")
> 
> I have applied the following hack for today - we need something better.

I just posted this series fixing it for powerpc, microblaze and sh. I
did not add the #ifndef __powerpc64__ line since
CONFIG_NOT_COHERENT_CACHE should not be enabled for those builds.

https://lore.kernel.org/r/20230613155245.1228274-1-catalin.mari...@arm.com

> From: Stephen Rothwell 
> Date: Tue, 13 Jun 2023 16:07:16 +1000
> Subject: [PATCH] fix up for "mm/slab: decouple ARCH_KMALLOC_MINALIGN from 
> ARCH_DMA_MINALIGN"
> 
> Signed-off-by: Stephen Rothwell 
> ---
>  arch/powerpc/include/asm/cache.h | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/cache.h 
> b/arch/powerpc/include/asm/cache.h
> index ae0a68a838e8..e9be1396dfd1 100644
> --- a/arch/powerpc/include/asm/cache.h
> +++ b/arch/powerpc/include/asm/cache.h
> @@ -142,5 +142,14 @@ static inline void iccci(void *addr)
>  }
>  
>  #endif /* !__ASSEMBLY__ */
> +
> +#ifndef __powerpc64__
> +#ifdef CONFIG_NOT_COHERENT_CACHE
> +#ifndef ARCH_DMA_MINALIGN
> +#define ARCH_DMA_MINALIGNL1_CACHE_BYTES
> +#endif
> +#endif
> +#endif
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_POWERPC_CACHE_H */

I think it should also remove the ARCH_DMA_MINALIGN from asm/page.h (as
I did in my series; sorry I did not cc you, only noticed now that you
reported it as well).

-- 
Catalin


[PATCH 0/3] Move the ARCH_DMA_MINALIGN definition to asm/cache.h

2023-06-13 Thread Catalin Marinas
Hi,

The ARCH_KMALLOC_MINALIGN reduction series defines a generic
ARCH_DMA_MINALIGN in linux/cache.h:

https://lore.kernel.org/r/20230612153201.554742-2-catalin.mari...@arm.com/

Unfortunately, this causes a duplicate definition warning for
microblaze, powerpc (32-bit only) and sh as these architectures define
ARCH_DMA_MINALIGN in a different file than asm/cache.h. Move the macro
to asm/cache.h to avoid this issue and also bring them in line with the
other architectures.

Andrew, if the arch maintainers cc'ed are fine with such change, could
you please take these three patches together with the
ARCH_KMALLOC_MINALIGN series?

Thank you.

Catalin Marinas (3):
  powerpc: Move the ARCH_DMA_MINALIGN definition to asm/cache.h
  microblaze: Move the ARCH_{DMA,SLAB}_MINALIGN definitions to
asm/cache.h
  sh: Move the ARCH_DMA_MINALIGN definition to asm/cache.h

 arch/microblaze/include/asm/cache.h | 5 +
 arch/microblaze/include/asm/page.h  | 5 -
 arch/powerpc/include/asm/cache.h| 4 
 arch/powerpc/include/asm/page_32.h  | 4 
 arch/sh/include/asm/cache.h | 6 ++
 arch/sh/include/asm/page.h  | 6 --
 6 files changed, 15 insertions(+), 15 deletions(-)



[PATCH 1/3] powerpc: Move the ARCH_DMA_MINALIGN definition to asm/cache.h

2023-06-13 Thread Catalin Marinas
The powerpc architecture defines ARCH_DMA_MINALIGN in asm/page_32.h and
only if CONFIG_NOT_COHERENT_CACHE is enabled (32-bit platforms only).
Move this macro to asm/cache.h to allow a generic ARCH_DMA_MINALIGN
definition in linux/cache.h without redefine errors/warnings.

Signed-off-by: Catalin Marinas 
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202306131053.1ybvrrho-...@intel.com/
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/cache.h   | 4 
 arch/powerpc/include/asm/page_32.h | 4 
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h
index ae0a68a838e8..69232231d270 100644
--- a/arch/powerpc/include/asm/cache.h
+++ b/arch/powerpc/include/asm/cache.h
@@ -33,6 +33,10 @@
 
 #define IFETCH_ALIGN_BYTES (1 << IFETCH_ALIGN_SHIFT)
 
+#ifdef CONFIG_NOT_COHERENT_CACHE
+#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
+#endif
+
 #if !defined(__ASSEMBLY__)
 #ifdef CONFIG_PPC64
 
diff --git a/arch/powerpc/include/asm/page_32.h 
b/arch/powerpc/include/asm/page_32.h
index 56f217606327..b9ac9e3a771c 100644
--- a/arch/powerpc/include/asm/page_32.h
+++ b/arch/powerpc/include/asm/page_32.h
@@ -12,10 +12,6 @@
 
 #define VM_DATA_DEFAULT_FLAGS  VM_DATA_DEFAULT_FLAGS32
 
-#ifdef CONFIG_NOT_COHERENT_CACHE
-#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
-#endif
-
 #if defined(CONFIG_PPC_256K_PAGES) || \
 (defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES))
 #define PTE_SHIFT  (PAGE_SHIFT - PTE_T_LOG2 - 2)   /* 1/4 of a page */


[PATCH 3/3] sh: Move the ARCH_DMA_MINALIGN definition to asm/cache.h

2023-06-13 Thread Catalin Marinas
The sh architecture defines ARCH_DMA_MINALIGN in asm/page.h. Move it to
asm/cache.h to allow a generic ARCH_DMA_MINALIGN definition in
linux/cache.h without redefine errors/warnings.

Signed-off-by: Catalin Marinas 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: John Paul Adrian Glaubitz 
Cc: linux...@vger.kernel.org
---
 arch/sh/include/asm/cache.h | 6 ++
 arch/sh/include/asm/page.h  | 6 --
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/sh/include/asm/cache.h b/arch/sh/include/asm/cache.h
index 32dfa6b82ec6..b38dbc975581 100644
--- a/arch/sh/include/asm/cache.h
+++ b/arch/sh/include/asm/cache.h
@@ -14,6 +14,12 @@
 
 #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
 
+/*
+ * Some drivers need to perform DMA into kmalloc'ed buffers
+ * and so we have to increase the kmalloc minalign for this.
+ */
+#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
+
 #define __read_mostly __section(".data..read_mostly")
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sh/include/asm/page.h b/arch/sh/include/asm/page.h
index 09ac6c7faee0..62f4b9edcb98 100644
--- a/arch/sh/include/asm/page.h
+++ b/arch/sh/include/asm/page.h
@@ -174,10 +174,4 @@ typedef struct page *pgtable_t;
 #include 
 #include 
 
-/*
- * Some drivers need to perform DMA into kmalloc'ed buffers
- * and so we have to increase the kmalloc minalign for this.
- */
-#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
-
 #endif /* __ASM_SH_PAGE_H */


[PATCH 2/3] microblaze: Move the ARCH_{DMA,SLAB}_MINALIGN definitions to asm/cache.h

2023-06-13 Thread Catalin Marinas
The microblaze architecture defines ARCH_DMA_MINALIGN in asm/page.h.
Move it to asm/cache.h to allow a generic ARCH_DMA_MINALIGN definition
in linux/cache.h without redefine errors/warnings.

While at it, also move ARCH_SLAB_MINALIGN to asm/cache.h for
consistency.

Signed-off-by: Catalin Marinas 
Cc: Michal Simek 
---
 arch/microblaze/include/asm/cache.h | 5 +
 arch/microblaze/include/asm/page.h  | 5 -
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/microblaze/include/asm/cache.h 
b/arch/microblaze/include/asm/cache.h
index a149b3e711ec..1903988b9e23 100644
--- a/arch/microblaze/include/asm/cache.h
+++ b/arch/microblaze/include/asm/cache.h
@@ -18,4 +18,9 @@
 
 #define SMP_CACHE_BYTESL1_CACHE_BYTES
 
+/* MS be sure that SLAB allocates aligned objects */
+#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
+
+#define ARCH_SLAB_MINALIGN L1_CACHE_BYTES
+
 #endif /* _ASM_MICROBLAZE_CACHE_H */
diff --git a/arch/microblaze/include/asm/page.h 
b/arch/microblaze/include/asm/page.h
index 7b9861bcd458..337f23eabc71 100644
--- a/arch/microblaze/include/asm/page.h
+++ b/arch/microblaze/include/asm/page.h
@@ -30,11 +30,6 @@
 
 #ifndef __ASSEMBLY__
 
-/* MS be sure that SLAB allocates aligned objects */
-#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
-
-#define ARCH_SLAB_MINALIGN L1_CACHE_BYTES
-
 /*
  * PAGE_OFFSET -- the first address of the first page of memory. With MMU
  * it is set to the kernel start address (aligned on a page boundary).


Re: [PATCH 0/3] Move the ARCH_DMA_MINALIGN definition to asm/cache.h

2023-06-13 Thread Catalin Marinas
On Tue, Jun 13, 2023 at 04:42:40PM +, Christophe Leroy wrote:
> 
> 
> Le 13/06/2023 à 17:52, Catalin Marinas a écrit :
> > Hi,
> > 
> > The ARCH_KMALLOC_MINALIGN reduction series defines a generic
> > ARCH_DMA_MINALIGN in linux/cache.h:
> > 
> > https://lore.kernel.org/r/20230612153201.554742-2-catalin.mari...@arm.com/
> > 
> > Unfortunately, this causes a duplicate definition warning for
> > microblaze, powerpc (32-bit only) and sh as these architectures define
> > ARCH_DMA_MINALIGN in a different file than asm/cache.h. Move the macro
> > to asm/cache.h to avoid this issue and also bring them in line with the
> > other architectures.
> 
> What about mips ?
> 
> arch/mips/include/asm/mach-generic/kmalloc.h:#define ARCH_DMA_MINALIGN
> 128
> arch/mips/include/asm/mach-ip32/kmalloc.h:#define ARCH_DMA_MINALIGN   32
> arch/mips/include/asm/mach-ip32/kmalloc.h:#define ARCH_DMA_MINALIGN   128
> arch/mips/include/asm/mach-n64/kmalloc.h:#define ARCH_DMA_MINALIGN
> L1_CACHE_BYTES
> arch/mips/include/asm/mach-tx49xx/kmalloc.h:#define ARCH_DMA_MINALIGN 
> L1_CACHE_BYTES

Sorry, I should have mentioned it in the cover letter (discussed here -
https://lore.kernel.org/r/zihpaixb%2f0ve7...@arm.com/). These kmalloc.h
files are included in asm/cache.h, based on which machine is enabled, so
there's no problem for mips. It makes more sense to keep them in those
mach-*/kmalloc.h files instead of having lots of #ifdefs in cache.h.

-- 
Catalin


Re: [PATCH v4 21/34] arm64: Convert various functions to use ptdescs

2023-06-14 Thread Catalin Marinas
On Mon, Jun 12, 2023 at 02:04:10PM -0700, Vishal Moola (Oracle) wrote:
> As part of the conversions to replace pgtable constructor/destructors with
> ptdesc equivalents, convert various page table functions to use ptdescs.
> 
> Signed-off-by: Vishal Moola (Oracle) 

Acked-by: Catalin Marinas 


Re: [PATCH] kernel: exit: cleanup release_thread()

2022-08-21 Thread Catalin Marinas
On Fri, Aug 19, 2022 at 09:44:06AM +0800, Kefeng Wang wrote:
> Only x86 has own release_thread(), introduce a new weak
> release_thread() function to clean empty definitions in
> other ARCHs.
> 
> Signed-off-by: Kefeng Wang 
[...]
>  arch/arm64/include/asm/processor.h  | 3 ---
>  arch/arm64/kernel/process.c | 4 ----

Acked-by: Catalin Marinas 


Re: [PATCH] mm: remove kern_addr_valid() completely

2022-11-02 Thread Catalin Marinas
On Tue, Oct 18, 2022 at 03:40:14PM +0800, Kefeng Wang wrote:
> Most architectures(except arm64/x86/sparc) simply return 1 for
> kern_addr_valid(), which is only used in read_kcore(), and it
> calls copy_from_kernel_nofault() which could check whether the
> address is a valid kernel address, so no need kern_addr_valid(),
> let's remove unneeded kern_addr_valid() completely.
> 
> Signed-off-by: Kefeng Wang 

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH v6 04/18] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:51AM +, Ryan Roberts wrote:
> Core-mm needs to be able to advance the pfn by an arbitrary amount, so
> override the new pte_advance_pfn() API to do so.
> 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 07/18] arm64/mm: Convert READ_ONCE(*ptep) to ptep_get(ptep)

2024-02-15 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:54AM +, Ryan Roberts wrote:
> There are a number of places in the arch code that read a pte by using
> the READ_ONCE() macro. Refactor these call sites to instead use the
> ptep_get() helper, which itself is a READ_ONCE(). Generated code should
> be the same.
> 
> This will benefit us when we shortly introduce the transparent contpte
> support. In this case, ptep_get() will become more complex so we now
> have all the code abstracted through it.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 08/18] arm64/mm: Convert set_pte_at() to set_ptes(..., 1)

2024-02-15 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:55AM +, Ryan Roberts wrote:
> Since set_ptes() was introduced, set_pte_at() has been implemented as a
> generic macro around set_ptes(..., 1). So this change should continue to
> generate the same code. However, making this change prepares us for the
> transparent contpte support. It means we can reroute set_ptes() to
> __set_ptes(). Since set_pte_at() is a generic macro, there will be no
> equivalent __set_pte_at() to reroute to.
> 
> Note that a couple of calls to set_pte_at() remain in the arch code.
> This is intentional, since those call sites are acting on behalf of
> core-mm and should continue to call into the public set_ptes() rather
> than the arch-private __set_ptes().
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 09/18] arm64/mm: Convert ptep_clear() to ptep_get_and_clear()

2024-02-15 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:56AM +, Ryan Roberts wrote:
> ptep_clear() is a generic wrapper around the arch-implemented
> ptep_get_and_clear(). We are about to convert ptep_get_and_clear() into
> a public version and private version (__ptep_get_and_clear()) to support
> the transparent contpte work. We won't have a private version of
> ptep_clear() so let's convert it to directly call ptep_get_and_clear().
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 10/18] arm64/mm: New ptep layer to manage contig bit

2024-02-15 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:57AM +, Ryan Roberts wrote:
> Create a new layer for the in-table PTE manipulation APIs. For now, The
> existing API is prefixed with double underscore to become the
> arch-private API and the public API is just a simple wrapper that calls
> the private API.
> 
> The public API implementation will subsequently be used to transparently
> manipulate the contiguous bit where appropriate. But since there are
> already some contig-aware users (e.g. hugetlb, kernel mapper), we must
> first ensure those users use the private API directly so that the future
> contig-bit manipulations in the public API do not interfere with those
> existing uses.
> 
> The following APIs are treated this way:
> 
>  - ptep_get
>  - set_pte
>  - set_ptes
>  - pte_clear
>  - ptep_get_and_clear
>  - ptep_test_and_clear_young
>  - ptep_clear_flush_young
>  - ptep_set_wrprotect
>  - ptep_set_access_flags
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 11/18] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-15 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:58AM +, Ryan Roberts wrote:
> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
> trailing DSB. Forthcoming "contpte" code will take advantage of this
> when clearing the young bit from a contiguous range of ptes.
> 
> Ordering between dsb and mmu_notifier_arch_invalidate_secondary_tlbs()
> has changed, but now aligns with the ordering of __flush_tlb_page(). It
> has been discussed that __flush_tlb_page() may be wrong though.
> Regardless, both will be resolved separately if needed.
> 
> Reviewed-by: David Hildenbrand 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-16 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote:
>  arch/arm64/mm/contpte.c  | 285 +++

Nitpick: I think most symbols in contpte.c can be EXPORT_SYMBOL_GPL().
We don't expect them to be used by random out of tree modules. In fact,
do we expect them to end up in modules at all? Most seem to be called
from the core mm code.

> +#define ptep_get_lockless ptep_get_lockless
> +static inline pte_t ptep_get_lockless(pte_t *ptep)
> +{
> + pte_t pte = __ptep_get(ptep);
> +
> + if (likely(!pte_valid_cont(pte)))
> + return pte;
> +
> + return contpte_ptep_get_lockless(ptep);
> +}
[...]
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> + /*
> +  * Gather access/dirty bits, which may be populated in any of the ptes
> +  * of the contig range. We may not be holding the PTL, so any contiguous
> +  * range may be unfolded/modified/refolded under our feet. Therefore we
> +  * ensure we read a _consistent_ contpte range by checking that all ptes
> +  * in the range are valid and have CONT_PTE set, that all pfns are
> +  * contiguous and that all pgprots are the same (ignoring access/dirty).
> +  * If we find a pte that is not consistent, then we must be racing with
> +  * an update so start again. If the target pte does not have CONT_PTE
> +  * set then that is considered consistent on its own because it is not
> +  * part of a contpte range.
> +*/

I can't get my head around this lockless API. Maybe it works fine (and
may have been discussed already) but we should document what the races
are, why it works, what the memory ordering requirements are. For
example, the generic (well, x86 PAE) ptep_get_lockless() only needs to
ensure that the low/high 32 bits of a pte are consistent and there are
some ordering rules on how these are updated.

Does the arm64 implementation only need to be correct w.r.t. the
access/dirty bits? Since we can read orig_ptep atomically, I assume the
only other updates from unfolding would set the dirty/access bits.

> +
> + pgprot_t orig_prot;
> + unsigned long pfn;
> + pte_t orig_pte;
> + pgprot_t prot;
> + pte_t *ptep;
> + pte_t pte;
> + int i;
> +
> +retry:
> + orig_pte = __ptep_get(orig_ptep);
> +
> + if (!pte_valid_cont(orig_pte))
> + return orig_pte;
> +
> + orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> + ptep = contpte_align_down(orig_ptep);
> + pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> + pte = __ptep_get(ptep);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));

We don't have any ordering guarantees in how the ptes in this range are
read or written in the contpte_set_ptes() and the fold/unfold functions.
We might not need them given all the other checks below but it's worth
adding a comment.

> +
> + if (!pte_valid_cont(pte) ||
> +pte_pfn(pte) != pfn ||
> +pgprot_val(prot) != pgprot_val(orig_prot))
> + goto retry;

I think this also needs some comment. I get the !pte_valid_cont() check
to attempt retrying when racing with unfolding. Are the other checks
needed to detect re-folding with different protection or pfn?

> +
> + if (pte_dirty(pte))
> + orig_pte = pte_mkdirty(orig_pte);
> +
> + if (pte_young(pte))
> + orig_pte = pte_mkyoung(orig_pte);
> + }

After writing the comments above, I think I figured out that the whole
point of this loop is to check that the ptes in the contig range are
still consistent and the only variation allowed is the dirty/young
state to be passed to the orig_pte returned. The original pte may have
been updated by the time this loop finishes but I don't think it
matters, it wouldn't be any different than reading a single pte and
returning it while it is being updated.

If you can make this easier to parse (in a few years time) with an
additional patch adding some more comments, that would be great. For
this patch:

Reviewed-by: Catalin Marinas 

-- 
Catalin


Re: [PATCH v6 13/18] arm64/mm: Implement new wrprotect_ptes() batch API

2024-02-16 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:32:00AM +, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the fork performance
> regression introduced by the initial contpte commit. Subsequent patches
> will solve it entirely.
> 
> During fork(), any private memory in the parent must be write-protected.
> Previously this was done 1 PTE at a time. But the core-mm supports
> batched wrprotect via the new wrprotect_ptes() API. So let's implement
> that API and for fully covered contpte mappings, we no longer need to
> unfold the contpte. This has 2 benefits:
> 
>   - reduced unfolding, reduces the number of tlbis that must be issued.
>   - The memory remains contpte-mapped ("folded") in the parent, so it
> continues to benefit from the more efficient use of the TLB after
> the fork.
> 
> The optimization to wrprotect a whole contpte block without unfolding is
> possible thanks to the tightening of the Arm ARM in respect to the
> definition and behaviour when 'Misprogramming the Contiguous bit'. See
> section D21194 at https://developer.arm.com/documentation/102105/ja-07/
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 14/18] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs

2024-02-16 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:32:01AM +, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the
> exit/munmap/dontneed performance regression introduced by the initial
> contpte commit. Subsequent patches will solve it entirely.
> 
> During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
> cleared. Previously this was done 1 PTE at a time. But the core-mm
> supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
> let's implement those APIs and for fully covered contpte mappings, we no
> longer need to unfold the contpte. This significantly reduces unfolding
> operations, reducing the number of tlbis that must be issued.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 16/18] arm64/mm: Implement pte_batch_hint()

2024-02-16 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:32:03AM +, Ryan Roberts wrote:
> When core code iterates over a range of ptes and calls ptep_get() for
> each of them, if the range happens to cover contpte mappings, the number
> of pte reads becomes amplified by a factor of the number of PTEs in a
> contpte block. This is because for each call to ptep_get(), the
> implementation must read all of the ptes in the contpte block to which
> it belongs to gather the access and dirty bits.
> 
> This causes a hotspot for fork(), as well as operations that unmap
> memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
> can fix this by implementing pte_batch_hint() which allows their
> iterators to skip getting the contpte tail ptes when gathering the batch
> of ptes to operate on. This results in the number of PTE reads returning
> to 1 per pte.
> 
> Acked-by: Mark Rutland 
> Reviewed-by: David Hildenbrand 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 17/18] arm64/mm: __always_inline to improve fork() perf

2024-02-16 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:32:04AM +, Ryan Roberts wrote:
> As set_ptes() and wrprotect_ptes() become a bit more complex, the
> compiler may choose not to inline them. But this is critical for fork()
> performance. So mark the functions, along with contpte_try_unfold()
> which is called by them, as __always_inline. This is worth ~1% on the
> fork() microbenchmark with order-0 folios (the common case).
> 
> Acked-by: Mark Rutland 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings

2024-02-16 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:32:05AM +, Ryan Roberts wrote:
> There are situations where a change to a single PTE could cause the
> contpte block in which it resides to become foldable (i.e. could be
> repainted with the contiguous bit). Such situations arise, for example,
> when user space temporarily changes protections, via mprotect, for
> individual pages, such can be the case for certain garbage collectors.
> 
> We would like to detect when such a PTE change occurs. However this can
> be expensive due to the amount of checking required. Therefore only
> perform the checks when an indiviual PTE is modified via mprotect
> (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
> when we are setting the final PTE in a contpte-aligned block.
> 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-16 Thread Catalin Marinas
On Fri, Feb 16, 2024 at 12:53:43PM +, Ryan Roberts wrote:
> On 16/02/2024 12:25, Catalin Marinas wrote:
> > On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote:
> >>  arch/arm64/mm/contpte.c  | 285 +++
> > 
> > Nitpick: I think most symbols in contpte.c can be EXPORT_SYMBOL_GPL().
> > We don't expect them to be used by random out of tree modules. In fact,
> > do we expect them to end up in modules at all? Most seem to be called
> > from the core mm code.
> 
> The problem is that the contpte_* symbols are called from the ptep_* inline
> functions. So where those inlines are called from modules, we need to make 
> sure
> the contpte_* symbols are available.
> 
> John Hubbard originally reported this problem against v1 and I enumerated all
> the drivers that call into the ptep_* inlines here:
> https://lore.kernel.org/linux-arm-kernel/b994ff89-1a1f-26ca-9479-b08c77f94...@arm.com/#t
> 
> So they definitely need to be exported. Perhaps we can tighten it to
> EXPORT_SYMBOL_GPL(), but I was being cautious as I didn't want to break 
> anything
> out-of-tree. I'm not sure what the normal policy is? arm64 seems to use ~equal
> amounts of both.

I don't think we are consistent here. For example set_pte_at() can't be
called from non-GPL modules because of __sync_icache_dcache. OTOH, such
driver is probably doing something dodgy. Same with
apply_to_page_range(), it's GPL-only (called from i915).

Let's see if others have any view over the next week or so, otherwise
I'd go for _GPL and relax it later if someone has a good use-case (can
be a patch on top adding _GPL).

> > If you can make this easier to parse (in a few years time) with an
> > additional patch adding some more comments, that would be great. For
> > this patch:
> 
> I already have a big block comment at the top, which was trying to explain it.
> Clearly not well enough though. I'll add more comments as a follow up patch 
> when
> I get back from holiday.

I read that comment but it wasn't immediately obvious what the atomicity
requirements are - basically we require a single PTE to be atomically
read (which it is), the rest is the dirty/young state being added on
top. I guess a sentence along these lines would do.

Enjoy your holiday!

-- 
Catalin


Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-19 Thread Catalin Marinas
On Fri, Feb 16, 2024 at 12:53:43PM +, Ryan Roberts wrote:
> On 16/02/2024 12:25, Catalin Marinas wrote:
> > On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote:
> >> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >> +{
> >> +  /*
> >> +   * Gather access/dirty bits, which may be populated in any of the ptes
> >> +   * of the contig range. We may not be holding the PTL, so any contiguous
> >> +   * range may be unfolded/modified/refolded under our feet. Therefore we
> >> +   * ensure we read a _consistent_ contpte range by checking that all ptes
> >> +   * in the range are valid and have CONT_PTE set, that all pfns are
> >> +   * contiguous and that all pgprots are the same (ignoring access/dirty).
> >> +   * If we find a pte that is not consistent, then we must be racing with
> >> +   * an update so start again. If the target pte does not have CONT_PTE
> >> +   * set then that is considered consistent on its own because it is not
> >> +   * part of a contpte range.
> >> +*/
[...]
> > After writing the comments above, I think I figured out that the whole
> > point of this loop is to check that the ptes in the contig range are
> > still consistent and the only variation allowed is the dirty/young
> > state to be passed to the orig_pte returned. The original pte may have
> > been updated by the time this loop finishes but I don't think it
> > matters, it wouldn't be any different than reading a single pte and
> > returning it while it is being updated.
> 
> Correct. The pte can be updated at any time, before after or during the reads.
> That was always the case. But now we have to cope with a whole contpte block
> being repainted while we are reading it. So we are just checking to make sure
> that all the ptes that we read from the contpte block are consistent with
> eachother and therefore we can trust that the access/dirty bits we gathered 
> are
> consistent.

I've been thinking a bit more about this - do any of the callers of
ptep_get_lockless() check the dirty/access bits? The only one that seems
to care is ptdump but in that case I'd rather see the raw bits for
debugging rather than propagating the dirty/access bits to the rest in
the contig range.

So with some clearer documentation on the requirements, I think we don't
need an arm64-specific ptep_get_lockless() (unless I missed something).

-- 
Catalin


  1   2   3   >