Re: [kernel,v2,1/2] powerpc/iommu: Stop using @current in mm_iommu_xxx

2016-07-22 Thread Nicholas Piggin
On Wed, 20 Jul 2016 14:34:30 +1000
Alexey Kardashevskiy  wrote:



>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -128,10 +129,17 @@ static long tce_iommu_register_pages(struct
> tce_container *container, ((vaddr + size) < vaddr))
>   return -EINVAL;
>  
> - ret = mm_iommu_get(vaddr, entries, &mem);
> + if (!container->mm) {
> + if (!current->mm)
> + return -ESRCH; /* process exited */

This shouldn't happen if we're a userspace process.

> +
> + atomic_inc(¤t->mm->mm_count);
> + container->mm = current->mm;
> + }
> +
> + ret = mm_iommu_get(container->mm, vaddr, entries, &mem);

Is it possible for processes (different mm) to be using the same
container? 


> @@ -354,6 +362,8 @@ static void tce_iommu_release(void *iommu_data)
>   tce_iommu_free_table(tbl);
>   }
>  
> + if (container->mm)
> + mmdrop(container->mm);
>   tce_iommu_disable(container);
>   mutex_destroy(&container->lock);

I'm wondering why keep the mm around at all. There is a bit of
locked_vm accounting there (which maybe doesn't exactly do the
right thing if we're talking about current task's rlimit if the
mm does not belong to current anyway).

The interesting cases are only the ones where a thread does
something with container->mm when current->mm != container->mm
(either a different process or a kernel thread). In what
situations does that happen?



Re: [kernel,v2,1/2] powerpc/iommu: Stop using @current in mm_iommu_xxx

2016-07-22 Thread Nicholas Piggin
On Wed, 20 Jul 2016 14:34:30 +1000
Alexey Kardashevskiy  wrote:



>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -128,10 +129,17 @@ static long tce_iommu_register_pages(struct
> tce_container *container, ((vaddr + size) < vaddr))
>   return -EINVAL;
>  
> - ret = mm_iommu_get(vaddr, entries, &mem);
> + if (!container->mm) {
> + if (!current->mm)
> + return -ESRCH; /* process exited */

This shouldn't happen if we're a userspace process.

> +
> + atomic_inc(¤t->mm->mm_count);
> + container->mm = current->mm;
> + }
> +
> + ret = mm_iommu_get(container->mm, vaddr, entries, &mem);

Is it possible for processes (different mm) to be using the same
container? 


> @@ -354,6 +362,8 @@ static void tce_iommu_release(void *iommu_data)
>   tce_iommu_free_table(tbl);
>   }
>  
> + if (container->mm)
> + mmdrop(container->mm);
>   tce_iommu_disable(container);
>   mutex_destroy(&container->lock);

I'm wondering why keep the mm around at all. There is a bit of
locked_vm accounting there (which maybe doesn't exactly do the
right thing if we're talking about current task's rlimit if the
mm does not belong to current anyway).

The interesting cases are only the ones where a thread does
something with container->mm when current->mm != container->mm
(either a different process or a kernel thread). In what
situations does that happen?

Thanks,
Nick


Re: [PATCH 3/3] mm/vmalloc: correct a few logic error in __insert_vmap_area()

2016-09-19 Thread Nicholas Piggin
On Tue, 20 Sep 2016 14:02:26 +0800
zijun_hu  wrote:

> From: zijun_hu 
> 
> correct a few logic error in __insert_vmap_area() since the else if
> condition is always true and meaningless
> 
> avoid endless loop under [un]mapping improper ranges whose boundary
> are not aligned to page
> 
> correct lazy_max_pages() return value if the number of online cpus
> is power of 2
> 
> improve performance for pcpu_get_vm_areas() via optimizing vmap_areas
> overlay checking algorithm and finding near vmap_areas by list_head
> other than rbtree
> 
> simplify /proc/vmallocinfo implementation via seq_file helpers
> for list_head
> 
> Signed-off-by: zijun_hu 
> Signed-off-by: zijun_hu 

Could you submit each of these changes as a separate patch? Would you
consider using capitalisation and punctuation in the changelog?

Did you measure any performance improvements, or do you have a workload
where vmalloc shows up in profiles?


> @@ -108,6 +108,9 @@ static void vunmap_page_range(unsigned long addr, 
> unsigned long end)
>   unsigned long next;
>  
>   BUG_ON(addr >= end);
> + WARN_ON(!PAGE_ALIGNED(addr | end));

I prefer to avoid mixing bitwise and arithmetic operations unless it's
necessary. Gcc should be able to optimise

WARN_ON(!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(end))

> + addr = round_down(addr, PAGE_SIZE);

I don't know if it's really necessary to relax the API like this for
internal vmalloc.c functions. If garbage is detected here, it's likely
due to a bug, and I'm not sure that rounding it would solve the problem.

For API functions perhaps it's reasonable -- in such cases you should
consider using WARN_ON_ONCE() or similar.

Thanks,
Nick


[PATCH] percpu: improve generic percpu modify-return implementation

2016-09-21 Thread Nicholas Piggin
Some architectures require an additional load to find the address of
percpu pointers. In some implemenatations, the C aliasing rules do not
allow the result of that load to be kept over the store that modifies
the percpu variable, which causes additional loads.

Work around this by finding the pointer first, then operating on that.

It's also possible to mark things as restrict and those kind of games,
but that can require larger and arch specific changes.

On powerpc, __this_cpu_inc_return compiles to:

ld 10,48(13)
ldx 9,3,10
addi 9,9,1
stdx 9,3,10
ld 9,48(13)
ldx 3,9,3

With this patch it saves 2 loads:

ld 10,48(13)
ldx 9,3,10
addi 9,9,1
stdx 9,3,10

Signed-off-by: Nicholas Piggin 
---
 include/asm-generic/percpu.h | 54 +---
 1 file changed, 31 insertions(+), 23 deletions(-)

diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index 4d9f233..3fe18fe 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -65,6 +65,11 @@ extern void setup_per_cpu_areas(void);
 #define PER_CPU_DEF_ATTRIBUTES
 #endif
 
+#define raw_cpu_generic_read(pcp)  \
+({ \
+   *raw_cpu_ptr(&(pcp));   \
+})
+
 #define raw_cpu_generic_to_op(pcp, val, op)\
 do {   \
*raw_cpu_ptr(&(pcp)) op val;\
@@ -72,34 +77,39 @@ do {
\
 
 #define raw_cpu_generic_add_return(pcp, val)   \
 ({ \
-   raw_cpu_add(pcp, val);  \
-   raw_cpu_read(pcp);  \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
+   \
+   *__p += val;\
+   *__p;   \
 })
 
 #define raw_cpu_generic_xchg(pcp, nval)
\
 ({ \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
typeof(pcp) __ret;  \
-   __ret = raw_cpu_read(pcp);  \
-   raw_cpu_write(pcp, nval);   \
+   __ret = *__p;   \
+   *__p = nval;\
__ret;  \
 })
 
 #define raw_cpu_generic_cmpxchg(pcp, oval, nval)   \
 ({ \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
typeof(pcp) __ret;  \
-   __ret = raw_cpu_read(pcp);  \
+   __ret = *__p;   \
if (__ret == (oval))\
-   raw_cpu_write(pcp, nval);   \
+   *__p = nval;\
__ret;  \
 })
 
 #define raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) 
\
 ({ \
+   typeof(&(pcp1)) __p1 = raw_cpu_ptr(&(pcp1));\
+   typeof(&(pcp2)) __p2 = raw_cpu_ptr(&(pcp2));\
int __ret = 0;  \
-   if (raw_cpu_read(pcp1) == (oval1) &&\
-raw_cpu_read(pcp2)  == (oval2)) {  \
-   raw_cpu_write(pcp1, nval1); \
-   raw_cpu_write(pcp2, nval2); \
+   if (*__p1 == (oval1) && *__p2  == (oval2)) {\
+   *__p1 = nval1;  \
+   *__p2 = nval2;  \
__ret = 1;  \
}   \
(__ret);   

Re: [PATCH] percpu: improve generic percpu modify-return implementation

2016-09-21 Thread Nicholas Piggin
On Wed, 21 Sep 2016 18:51:37 +1000
Nicholas Piggin  wrote:

> Some architectures require an additional load to find the address of
> percpu pointers. In some implemenatations, the C aliasing rules do not
> allow the result of that load to be kept over the store that modifies
> the percpu variable, which causes additional loads.

Sorry I picked up an old patch here. This one should be better.

From d0cb9052d6f4c31d24f999b7b0cecb34681eee9b Mon Sep 17 00:00:00 2001
From: Nicholas Piggin 
Date: Wed, 21 Sep 2016 18:23:43 +1000
Subject: [PATCH] percpu: improve generic percpu modify-return implementations

Some architectures require an additional load to find the address of
percpu pointers. In some implemenatations, the C aliasing rules do not
allow the result of that load to be kept over the store that modifies
the percpu variable, which causes additional loads.

Work around this by finding the pointer first, then operating on that.

It's also possible to mark things as restrict and those kind of games,
but that can require larger and arch specific changes.

On powerpc, __this_cpu_inc_return compiles to:

ld 10,48(13)
ldx 9,3,10
addi 9,9,1
stdx 9,3,10
ld 9,48(13)
ldx 3,9,3

With this patch it compiles to:

ld 10,48(13)
ldx 9,3,10
addi 9,9,1
stdx 9,3,10

Signed-off-by: Nicholas Piggin 
---
 include/asm-generic/percpu.h | 53 +---
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index 4d9f233..40e8870 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -65,6 +65,11 @@ extern void setup_per_cpu_areas(void);
 #define PER_CPU_DEF_ATTRIBUTES
 #endif
 
+#define raw_cpu_generic_read(pcp)  \
+({ \
+   *raw_cpu_ptr(&(pcp));   \
+})
+
 #define raw_cpu_generic_to_op(pcp, val, op)\
 do {   \
*raw_cpu_ptr(&(pcp)) op val;\
@@ -72,34 +77,39 @@ do {
\
 
 #define raw_cpu_generic_add_return(pcp, val)   \
 ({ \
-   raw_cpu_add(pcp, val);  \
-   raw_cpu_read(pcp);  \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
+   \
+   *__p += val;\
+   *__p;   \
 })
 
 #define raw_cpu_generic_xchg(pcp, nval)
\
 ({ \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
typeof(pcp) __ret;  \
-   __ret = raw_cpu_read(pcp);  \
-   raw_cpu_write(pcp, nval);   \
+   __ret = *__p;   \
+   *__p = nval;\
__ret;  \
 })
 
 #define raw_cpu_generic_cmpxchg(pcp, oval, nval)   \
 ({ \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
typeof(pcp) __ret;  \
-   __ret = raw_cpu_read(pcp);  \
+   __ret = *__p;   \
if (__ret == (oval))\
-   raw_cpu_write(pcp, nval);   \
+   *__p = nval;\
__ret;  \
 })
 
 #define raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) 
\
 ({ \
+   typeof(&(pcp1)) __p1 = raw_cpu_ptr(&(pcp1));\
+   typeof(&(pcp2)) __p2 = raw_cpu_ptr(&(pcp2));\
int __ret = 0;  \
-   if (raw_cpu_read(pcp1) == (oval1) &&\
-raw_cpu_read(pcp2)  == (oval2)) {  \
-   raw_cpu_write(

Re: [PATCH] percpu: improve generic percpu modify-return implementation

2016-09-21 Thread Nicholas Piggin
On Wed, 21 Sep 2016 15:16:25 -0500 (CDT)
Christoph Lameter  wrote:

> On Wed, 21 Sep 2016, Tejun Heo wrote:
> 
> > Hello, Nick.
> >
> > How have you been? :)
> >  
> 
> He is baack. Are we getting SL!B? ;-)
> 

Hey Christoph. Sure, why not.


Re: [PATCH] percpu: improve generic percpu modify-return implementation

2016-09-21 Thread Nicholas Piggin
On Wed, 21 Sep 2016 10:23:43 -0400
Tejun Heo  wrote:

> Hello, Nick.
> 
> How have you been? :)

Hey Tejun,

Well thank you, how about you?
 
> On Wed, Sep 21, 2016 at 08:57:11PM +1000, Nicholas Piggin wrote:
> > On Wed, 21 Sep 2016 18:51:37 +1000
> > Nicholas Piggin  wrote:
> >   
> > > Some architectures require an additional load to find the address of
> > > percpu pointers. In some implemenatations, the C aliasing rules do not
> > > allow the result of that load to be kept over the store that modifies
> > > the percpu variable, which causes additional loads.  
> > 
> > Sorry I picked up an old patch here. This one should be better.
> > 
> > From d0cb9052d6f4c31d24f999b7b0cecb34681eee9b Mon Sep 17 00:00:00 2001
> > From: Nicholas Piggin 
> > Date: Wed, 21 Sep 2016 18:23:43 +1000
> > Subject: [PATCH] percpu: improve generic percpu modify-return 
> > implementations
> > 
> > Some architectures require an additional load to find the address of
> > percpu pointers. In some implemenatations, the C aliasing rules do not
> > allow the result of that load to be kept over the store that modifies
> > the percpu variable, which causes additional loads.
> > 
> > Work around this by finding the pointer first, then operating on that.
> > 
> > It's also possible to mark things as restrict and those kind of games,
> > but that can require larger and arch specific changes.
> > 
> > On powerpc, __this_cpu_inc_return compiles to:
> > 
> > ld 10,48(13)
> > ldx 9,3,10
> > addi 9,9,1
> > stdx 9,3,10
> > ld 9,48(13)
> > ldx 3,9,3
> > 
> > With this patch it compiles to:
> > 
> > ld 10,48(13)
> > ldx 9,3,10
> > addi 9,9,1
> > stdx 9,3,10
> > 
> > Signed-off-by: Nicholas Piggin   
> 
> Patch looks good to me but seems QP encoded.  Can you please resend?
> 
> Thanks and it's great to see you again!
> 

Trying a new mail client, sorry. It *seems* to be working now, how's
this?

From d0cb9052d6f4c31d24f999b7b0cecb34681eee9b Mon Sep 17 00:00:00 2001
From: Nicholas Piggin 
Date: Wed, 21 Sep 2016 18:23:43 +1000
Subject: [PATCH] percpu: improve generic percpu modify-return implementations

Some architectures require an additional load to find the address of
percpu pointers. In some implemenatations, the C aliasing rules do not
allow the result of that load to be kept over the store that modifies
the percpu variable, which causes additional loads.

Work around this by finding the pointer first, then operating on that.

It's also possible to mark things as restrict and those kind of games,
but that can require larger and arch specific changes.

On powerpc, __this_cpu_inc_return compiles to:

ld 10,48(13)
ldx 9,3,10
addi 9,9,1
stdx 9,3,10
ld 9,48(13)
ldx 3,9,3

With this patch it compiles to:

ld 10,48(13)
ldx 9,3,10
addi 9,9,1
stdx 9,3,10

Signed-off-by: Nicholas Piggin 

To: Tejun Heo 
To: Christoph Lameter 
Cc: linux-kernel@vger.kernel.org
Cc: linux-a...@vger.kernel.org
---
 include/asm-generic/percpu.h | 53 +---
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index 4d9f233..40e8870 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -65,6 +65,11 @@ extern void setup_per_cpu_areas(void);
 #define PER_CPU_DEF_ATTRIBUTES
 #endif
 
+#define raw_cpu_generic_read(pcp)  \
+({ \
+   *raw_cpu_ptr(&(pcp));   \
+})
+
 #define raw_cpu_generic_to_op(pcp, val, op)\
 do {   \
*raw_cpu_ptr(&(pcp)) op val;\
@@ -72,34 +77,39 @@ do {
\
 
 #define raw_cpu_generic_add_return(pcp, val)   \
 ({ \
-   raw_cpu_add(pcp, val);  \
-   raw_cpu_read(pcp);  \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
+   \
+   *__p += val;\
+   *__p;   \
 })
 
 #de

Re: [PATCH 3/5] mm/vmalloc.c: correct lazy_max_pages() return value

2016-09-22 Thread Nicholas Piggin
On Fri, 23 Sep 2016 00:30:20 +0800
zijun_hu  wrote:

> On 2016/9/22 20:37, Michal Hocko wrote:
> > On Thu 22-09-16 09:13:50, zijun_hu wrote:  
> >> On 09/22/2016 08:35 AM, David Rientjes wrote:  
> > [...]  
> >>> The intent is as it is implemented; with your change, lazy_max_pages() is 
> >>> potentially increased depending on the number of online cpus.  This is 
> >>> only a heuristic, changing it would need justification on why the new
> >>> value is better.  It is opposite to what the comment says: "to be 
> >>> conservative and not introduce a big latency on huge systems, so go with
> >>> a less aggressive log scale."  NACK to the patch.
> >>>  
> >> my change potentially make lazy_max_pages() decreased not increased, i 
> >> seems
> >> conform with the comment
> >>
> >> if the number of online CPUs is not power of 2, both have no any difference
> >> otherwise, my change remain power of 2 value, and the original code rounds 
> >> up
> >> to next power of 2 value, for instance
> >>
> >> my change : (32, 64] -> 64
> >> 32 -> 32, 64 -> 64
> >> the original code: [32, 63) -> 64
> >>32 -> 64, 64 -> 128  
> > 
> > You still completely failed to explain _why_ this is an improvement/fix
> > or why it matters. This all should be in the changelog.
> >   
> 
> Hi npiggin,
> could you give some comments for this patch since lazy_max_pages() is 
> introduced
> by you
> 
> my patch is based on the difference between fls() and get_count_order() mainly
> the difference between fls() and get_count_order() will be shown below
> more MM experts maybe help to decide which is more suitable
> 
> if parameter > 1, both have different return value only when parameter is
> power of two, for example
> 
> fls(32) = 6 VS get_count_order(32) = 5
> fls(33) = 6 VS get_count_order(33) = 6
> fls(63) = 6 VS get_count_order(63) = 6
> fls(64) = 7 VS get_count_order(64) = 6
> 
> @@ -594,7 +594,9 @@ static unsigned long lazy_max_pages(void) 
> { 
> unsigned int log; 
> 
> -log = fls(num_online_cpus()); 
> +log = num_online_cpus(); 
> +if (log > 1) 
> +log = (unsigned int)get_count_order(log); 
> 
> return log * (32UL * 1024 * 1024 / PAGE_SIZE); 
> } 
> 

To be honest, I don't think I chose it with a lot of analysis.
It will depend on the kernel usage patterns, the arch code,
and the CPU microarchitecture, all of which would have changed
significantly.

I wouldn't bother changing it unless you do some benchmarking
on different system sizes to see where the best performance is.
(If performance is equal, fewer lazy pages would be better.)

Good to see you taking a look at this vmalloc stuff. Don't be
discouraged if you run into some dead ends.

Thanks,
Nick


Re: [PATCH] fs/select: add vmalloc fallback for select(2)

2016-09-23 Thread Nicholas Piggin
On Fri, 23 Sep 2016 14:42:53 +0800
"Hillf Danton"  wrote:

> > 
> > The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> > with the number of fds passed. We had a customer report page allocation
> > failures of order-4 for this allocation. This is a costly order, so it might
> > easily fail, as the VM expects such allocation to have a lower-order 
> > fallback.
> > 
> > Such trivial fallback is vmalloc(), as the memory doesn't have to be
> > physically contiguous. Also the allocation is temporary for the duration of 
> > the
> > syscall, so it's unlikely to stress vmalloc too much.
> > 
> > Note that the poll(2) syscall seems to use a linked list of order-0 pages, 
> > so
> > it doesn't need this kind of fallback.

How about something like this? (untested)

Eric isn't wrong about vmalloc sucking :)

Thanks,
Nick


---
 fs/select.c | 57 +++--
 1 file changed, 43 insertions(+), 14 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8ed9da5..3b4834c 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -555,6 +555,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
__user *outp,
void *bits;
int ret, max_fds;
unsigned int size;
+   size_t nr_bytes;
struct fdtable *fdt;
/* Allocate small arguments on the stack to save memory and be faster */
long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
@@ -576,21 +577,39 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
__user *outp,
 * since we used fdset we need to allocate memory in units of
 * long-words. 
 */
-   size = FDS_BYTES(n);
+   ret = -ENOMEM;
bits = stack_fds;
-   if (size > sizeof(stack_fds) / 6) {
-   /* Not enough space in on-stack array; must use kmalloc */
+   size = FDS_BYTES(n);
+   nr_bytes = 6 * size;
+
+   if (unlikely(nr_bytes > PAGE_SIZE)) {
+   /* Avoid multi-page allocation if possible */
ret = -ENOMEM;
-   bits = kmalloc(6 * size, GFP_KERNEL);
-   if (!bits)
-   goto out_nofds;
+   fds.in = kmalloc(size, GFP_KERNEL);
+   fds.out = kmalloc(size, GFP_KERNEL);
+   fds.ex = kmalloc(size, GFP_KERNEL);
+   fds.res_in = kmalloc(size, GFP_KERNEL);
+   fds.res_out = kmalloc(size, GFP_KERNEL);
+   fds.res_ex = kmalloc(size, GFP_KERNEL);
+
+   if (!(fds.in && fds.out && fds.ex &&
+   fds.res_in && fds.res_out && fds.res_ex))
+   goto out;
+   } else {
+   if (nr_bytes > sizeof(stack_fds)) {
+   /* Not enough space in on-stack array */
+   if (nr_bytes > PAGE_SIZE * 2)
+   bits = kmalloc(nr_bytes, GFP_KERNEL);
+   if (!bits)
+   goto out_nofds;
+   }
+   fds.in  = bits;
+   fds.out = bits +   size;
+   fds.ex  = bits + 2*size;
+   fds.res_in  = bits + 3*size;
+   fds.res_out = bits + 4*size;
+   fds.res_ex  = bits + 5*size;
}
-   fds.in  = bits;
-   fds.out = bits +   size;
-   fds.ex  = bits + 2*size;
-   fds.res_in  = bits + 3*size;
-   fds.res_out = bits + 4*size;
-   fds.res_ex  = bits + 5*size;
 
if ((ret = get_fd_set(n, inp, fds.in)) ||
(ret = get_fd_set(n, outp, fds.out)) ||
@@ -617,8 +636,18 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
__user *outp,
ret = -EFAULT;
 
 out:
-   if (bits != stack_fds)
-   kfree(bits);
+   if (unlikely(nr_bytes > PAGE_SIZE)) {
+   kfree(fds.in);
+   kfree(fds.out);
+   kfree(fds.ex);
+   kfree(fds.res_in);
+   kfree(fds.res_out);
+   kfree(fds.res_ex);
+   } else {
+   if (bits != stack_fds)
+   kfree(bits);
+   }
+
 out_nofds:
return ret;
 }
-- 
2.9.3



Re: [PATCH 3/5] mm/vmalloc.c: correct lazy_max_pages() return value

2016-09-23 Thread Nicholas Piggin
On Fri, 23 Sep 2016 13:00:35 +0800
zijun_hu  wrote:

> On 2016/9/23 11:30, Nicholas Piggin wrote:
> > On Fri, 23 Sep 2016 00:30:20 +0800
> > zijun_hu  wrote:
> >   
> >> On 2016/9/22 20:37, Michal Hocko wrote:  
> >>> On Thu 22-09-16 09:13:50, zijun_hu wrote:
> >>>> On 09/22/2016 08:35 AM, David Rientjes wrote:
> >>> [...]
> >>>>> The intent is as it is implemented; with your change, lazy_max_pages() 
> >>>>> is 
> >>>>> potentially increased depending on the number of online cpus.  This is 
> >>>>> only a heuristic, changing it would need justification on why the new
> >>>>> value is better.  It is opposite to what the comment says: "to be 
> >>>>> conservative and not introduce a big latency on huge systems, so go with
> >>>>> a less aggressive log scale."  NACK to the patch.
> >>>>>
> >>>> my change potentially make lazy_max_pages() decreased not increased, i 
> >>>> seems
> >>>> conform with the comment
> >>>>
> >>>> if the number of online CPUs is not power of 2, both have no any 
> >>>> difference
> >>>> otherwise, my change remain power of 2 value, and the original code 
> >>>> rounds up
> >>>> to next power of 2 value, for instance
> >>>>
> >>>> my change : (32, 64] -> 64
> >>>>   32 -> 32, 64 -> 64
> >>>> the original code: [32, 63) -> 64
> >>>>32 -> 64, 64 -> 128
> >>>
> >>> You still completely failed to explain _why_ this is an improvement/fix
> >>> or why it matters. This all should be in the changelog.
> >>> 
> >>
> >> Hi npiggin,
> >> could you give some comments for this patch since lazy_max_pages() is 
> >> introduced
> >> by you
> >>
> >> my patch is based on the difference between fls() and get_count_order() 
> >> mainly
> >> the difference between fls() and get_count_order() will be shown below
> >> more MM experts maybe help to decide which is more suitable
> >>
> >> if parameter > 1, both have different return value only when parameter is
> >> power of two, for example
> >>
> >> fls(32) = 6 VS get_count_order(32) = 5
> >> fls(33) = 6 VS get_count_order(33) = 6
> >> fls(63) = 6 VS get_count_order(63) = 6
> >> fls(64) = 7 VS get_count_order(64) = 6
> >>
> >> @@ -594,7 +594,9 @@ static unsigned long lazy_max_pages(void) 
> >> { 
> >> unsigned int log; 
> >>
> >> -log = fls(num_online_cpus()); 
> >> +log = num_online_cpus(); 
> >> +if (log > 1) 
> >> +log = (unsigned int)get_count_order(log); 
> >>
> >> return log * (32UL * 1024 * 1024 / PAGE_SIZE); 
> >> } 
> >>  
> > 
> > To be honest, I don't think I chose it with a lot of analysis.
> > It will depend on the kernel usage patterns, the arch code,
> > and the CPU microarchitecture, all of which would have changed
> > significantly.
> > 
> > I wouldn't bother changing it unless you do some bench marking
> > on different system sizes to see where the best performance is.
> > (If performance is equal, fewer lazy pages would be better.)
> > 
> > Good to see you taking a look at this vmalloc stuff. Don't be
> > discouraged if you run into some dead ends.
> > 
> > Thanks,
> > Nick
> >   
> thanks for your reply
> please don't pay attention to this patch any more since i don't have
> condition to do many test and comparison
> 
> i just feel my change maybe be consistent with operation of rounding up
> to power of 2

You could be right about that, but in cases like this, existing code
probably trumps original comments or intention.

What I mean is that it's a heuristic anyway, so the current behaviour
is what has been used and tested for a long time. Therefore any change
would need to be justified by showing it's an improvement.

I'm sure the existing size is not perfect, but we don't know whether
your change would be an improvement in practice. Does that make sense?

Thanks,
Nick


Re: [PATCH] percpu: improve generic percpu modify-return implementation

2016-09-23 Thread Nicholas Piggin
On Thu, 22 Sep 2016 12:07:49 -0400
Tejun Heo  wrote:

> Hello,
> 
> On Thu, Sep 22, 2016 at 02:35:00PM +1000, Nicholas Piggin wrote:
> > Well thank you, how about you?  
> 
> Heh, can't complain.  Hope to see you around sometime.  It's been
> forever.

Yeah, it has been. Hopefully I'll see you around.


> > Trying a new mail client, sorry. It *seems* to be working now, how's
> > this?  
> 
> Hmm... Still encoded.

It looks to be a helpful surprise feature of claws. Any line starting with
the word "From" make it go q-p due to some mail servers treating that
differently. Sigh.

> 
> > From d0cb9052d6f4c31d24f999b7b0cecb34681eee9b Mon Sep 17 00:00:00 2001
> > From: Nicholas Piggin 
> > Date: Wed, 21 Sep 2016 18:23:43 +1000
> > Subject: [PATCH] percpu: improve generic percpu modify-return 
> > implementations
> > 
> > Some architectures require an additional load to find the address of
> > percpu pointers. In some implemenatations, the C aliasing rules do not
> > allow the result of that load to be kept over the store that modifies
> > the percpu variable, which causes additional loads.
> > 
> > Work around this by finding the pointer first, then operating on that.
> > 
> > It's also possible to mark things as restrict and those kind of games,
> > but that can require larger and arch specific changes.  
> 
> QP-decoded and applied to percpu/for-4.9.
> 
> Thanks.
> 

Thanks!


Re: [PATCH] fs/select: add vmalloc fallback for select(2)

2016-09-27 Thread Nicholas Piggin
On Tue, 27 Sep 2016 10:44:04 +0200
Vlastimil Babka  wrote:

> On 09/23/2016 06:47 PM, Jason Baron wrote:
> > Hi,
> >
> > On 09/23/2016 03:24 AM, Nicholas Piggin wrote:  
> >> On Fri, 23 Sep 2016 14:42:53 +0800
> >> "Hillf Danton"  wrote:
> >>  
> >>>>
> >>>> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size 
> >>>> grows
> >>>> with the number of fds passed. We had a customer report page allocation
> >>>> failures of order-4 for this allocation. This is a costly order, so it 
> >>>> might
> >>>> easily fail, as the VM expects such allocation to have a lower-order 
> >>>> fallback.
> >>>>
> >>>> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> >>>> physically contiguous. Also the allocation is temporary for the duration 
> >>>> of the
> >>>> syscall, so it's unlikely to stress vmalloc too much.
> >>>>
> >>>> Note that the poll(2) syscall seems to use a linked list of order-0 
> >>>> pages, so
> >>>> it doesn't need this kind of fallback.  
> >>
> >> How about something like this? (untested)  
> 
> This pushes the limit further, but might just delay the problem. Could be an 
> optimization on top if there's enough interest, though.

What's your customer doing with those selects? If they care at all about
performance, I doubt they want select to attempt order-4 allocations, fail,
then use vmalloc :)



Re: [PATCH] fs/select: add vmalloc fallback for select(2)

2016-09-27 Thread Nicholas Piggin
On Tue, 27 Sep 2016 11:37:24 +
David Laight  wrote:

> From: Nicholas Piggin
> > Sent: 27 September 2016 12:25
> > On Tue, 27 Sep 2016 10:44:04 +0200
> > Vlastimil Babka  wrote:
> >   
> > > On 09/23/2016 06:47 PM, Jason Baron wrote:  
> > > > Hi,
> > > >
> > > > On 09/23/2016 03:24 AM, Nicholas Piggin wrote:  
> > > >> On Fri, 23 Sep 2016 14:42:53 +0800
> > > >> "Hillf Danton"  wrote:
> > > >>  
> > > >>>>
> > > >>>> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where 
> > > >>>> size grows
> > > >>>> with the number of fds passed. We had a customer report page 
> > > >>>> allocation
> > > >>>> failures of order-4 for this allocation. This is a costly order, so 
> > > >>>> it might
> > > >>>> easily fail, as the VM expects such allocation to have a lower-order 
> > > >>>> fallback.
> > > >>>>
> > > >>>> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> > > >>>> physically contiguous. Also the allocation is temporary for the 
> > > >>>> duration of the
> > > >>>> syscall, so it's unlikely to stress vmalloc too much.
> > > >>>>
> > > >>>> Note that the poll(2) syscall seems to use a linked list of order-0 
> > > >>>> pages, so
> > > >>>> it doesn't need this kind of fallback.  
> > > >>
> > > >> How about something like this? (untested)  
> > >
> > > This pushes the limit further, but might just delay the problem. Could be 
> > > an
> > > optimization on top if there's enough interest, though.  
> > 
> > What's your customer doing with those selects? If they care at all about
> > performance, I doubt they want select to attempt order-4 allocations, fail,
> > then use vmalloc :)  
> 
> If they care about performance they shouldn't be passing select() lists that
> are anywhere near that large.
> If the number of actual fd is small - use poll().

Right. Presumably it's some old app they're still using, no?


Re: [PATCH v2 1/3] powerpc: Factor out common code in setup-common.c

2016-07-31 Thread Nicholas Piggin
On Thu, 28 Jul 2016 16:07:16 -0700
Andrey Smirnov  wrote:

> Factor out a small bit of common code in machine_restart(),
> machine_power_off() and machine_halt().
> 
> Signed-off-by: Andrey Smirnov 
> ---
> 
> No changes compared to v1.
> 
>  arch/powerpc/kernel/setup-common.c | 23 ++-
>  1 file changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/setup-common.c
> b/arch/powerpc/kernel/setup-common.c index 714b4ba..5cd3283 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -130,15 +130,22 @@ void machine_shutdown(void)
>   ppc_md.machine_shutdown();
>  }
>  
> +static void machine_hang(void)
> +{
> + pr_emerg("System Halted, OK to turn off power\n");
> + local_irq_disable();
> + while (1)
> + ;
> +}

What's the intended semantics of this function? A default power
off handler when the platform supplies none? Would ppc_power_off()
be a good name? Should the smp_send_stop() call be moved here?

It seems like a reasonable cleanup though.

Thanks,
Nick


Re: [PATCH v2 2/3] powerpc: Call chained reset handlers during reset

2016-07-31 Thread Nicholas Piggin
On Thu, 28 Jul 2016 16:07:17 -0700
Andrey Smirnov  wrote:

> Call out to all restart handlers that were added via
> register_restart_handler() API when restarting the machine.
> 
> Signed-off-by: Andrey Smirnov 
> ---
> 
> No changes compared to v1
> 
>  arch/powerpc/kernel/setup-common.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/setup-common.c
> b/arch/powerpc/kernel/setup-common.c index 5cd3283..205d073 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -145,6 +145,10 @@ void machine_restart(char *cmd)
>   ppc_md.restart(cmd);
>  
>   smp_send_stop();
> +
> + do_kernel_restart(cmd);
> + mdelay(1000);
> +
>   machine_hang();
>  }
>  

Ah, I see why you don't move smp_send_stop(). 3 other architectures
call do_kernel_restart(). arm and arm64 call it with
local_irq_disabled(). arm and mips insert the 1s delay. All call it
after smp_send_stop(). I don't see the harm in the delay. Should we
call it with local interrupts disabled?



Re: [PATCH v2 3/3] powerpc: Convert fsl_rstcr_restart to a reset handler

2016-07-31 Thread Nicholas Piggin
On Thu, 28 Jul 2016 16:07:18 -0700
Andrey Smirnov  wrote:

> Convert fsl_rstcr_restart into a function to be registered with
> register_reset_handler().
> 
> Signed-off-by: Andrey Smirnov 
> ---
> 
> Changes since v1:
> 
>   - fsl_rstcr_restart is registered as a reset handler in
>   setup_rstcr, replacing per-board arch_initcall approach

Bear in mind I don't know much about the embedded or platform code!

The documentation for reset notifiers says that they are expected
to be registered from drivers, not arch code. That seems to only be
intended to mean that the standard ISA or platform reset would
normally be handled directly by the arch, whereas if you have an
arch specific driver for a reset hardware that just happens to live
under arch/, then fsl_rstcr_restart / mpc85xx_cds_restart would be
valid use of reset notifier.

So this change seems reasonable to me. One small question:


> +static int mpc85xx_cds_restart_register(void)
> +{
> + static struct notifier_block restart_handler;
> +
> + restart_handler.notifier_call = mpc85xx_cds_restart;
> + restart_handler.priority = 192;

Should there be a header with #define's for these priorities?

Thanks,
Nick


Re: Question on smp_mb__before_spinlock

2016-09-07 Thread Nicholas Piggin
On Mon, 5 Sep 2016 11:37:53 +0200
Peter Zijlstra  wrote:

> Hi all,
> 
> So recently I've had two separate issues that touched upon
> smp_mb__before_spinlock().
> 
> 
> Since its inception, our understanding of ACQUIRE, esp. as applied to
> spinlocks, has changed somewhat. Also, I wonder if, with a simple
> change, we cannot make it provide more.
> 
> The problem with the comment is that the STORE done by spin_lock isn't
> itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
> it and cross with any prior STORE, rendering the default WMB
> insufficient (pointed out by Alan).
> 
> Now, this is only really a problem on PowerPC and ARM64, the former of
> which already defined smp_mb__before_spinlock() as a smp_mb(), the
> latter does not, Will?
> 
> The second issue I wondered about is spinlock transitivity. All except
> powerpc have RCsc locks, and since Power already does a full mb, would
> it not make sense to put it _after_ the spin_lock(), which would provide
> the same guarantee, but also upgrades the section to RCsc.
> 
> That would make all schedule() calls fully transitive against one
> another.
> 
> 
> That is, would something like the below make sense?
> 
> (does not deal with mm_types.h and overlayfs using
> smp_mb__before_spnlock).
> 
> ---
>  arch/arm64/include/asm/barrier.h   |  2 ++
>  arch/powerpc/include/asm/barrier.h |  2 +-
>  include/linux/spinlock.h   | 41 
> +-
>  kernel/sched/core.c|  5 +++--
>  4 files changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/barrier.h 
> b/arch/arm64/include/asm/barrier.h
> index 4eea7f618dce..d5cc8b58f942 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -104,6 +104,8 @@ do {  
> \
>   VAL;\
>  })
>  
> +#define smp_mb__after_spinlock() smp_mb()
> +
>  #include 
>  
>  #endif   /* __ASSEMBLY__ */
> diff --git a/arch/powerpc/include/asm/barrier.h 
> b/arch/powerpc/include/asm/barrier.h
> index c0deafc212b8..23d64d7196b7 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -74,7 +74,7 @@ do {
> \
>   ___p1;  \
>  })
>  
> -#define smp_mb__before_spinlock()   smp_mb()
> +#define smp_mb__after_spinlock()   smp_mb()
>  
>  #include 
>  
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 47dd0cebd204..284616dad607 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -118,16 +118,39 @@ do {
> \
>  #endif
>  
>  /*
> - * Despite its name it doesn't necessarily has to be a full barrier.
> - * It should only guarantee that a STORE before the critical section
> - * can not be reordered with LOADs and STOREs inside this section.
> - * spin_lock() is the one-way barrier, this LOAD can not escape out
> - * of the region. So the default implementation simply ensures that
> - * a STORE can not move into the critical section, smp_wmb() should
> - * serialize it with another STORE done by spin_lock().
> + * This barrier must provide two things:
> + *
> + *   - it must guarantee a STORE before the spin_lock() is ordered against a
> + * LOAD after it, see the comments at its two usage sites.
> + *
> + *   - it must ensure the critical section is RCsc.
> + *
> + * The latter is important for cases where we observe values written by other
> + * CPUs in spin-loops, without barriers, while being subject to scheduling.
> + *
> + * CPU0  CPU1CPU2
> + * 
> + *   for (;;) {
> + * if (READ_ONCE(X))
> + *   break;
> + *   }
> + * X=1
> + *   
> + *   
> + *   r = X;
> + *
> + * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
> + * we get migrated and CPU2 sees X==0.
> + *
> + * Since most load-store architectures implement ACQUIRE with an smp_mb() 
> after
> + * the LL/SC loop, they need no further barriers. Similarly all our TSO
> + * architectures imlpy an smp_mb() for each atomic instruction and equally 
> don't
> + * need more.
> + *
> + * Architectures that can implement ACQUIRE better need to take care.
>   */
> -#ifndef smp_mb__before_spinlock
> -#define smp_mb__before_spinlock()smp_wmb()
> +#ifndef smp_mb__after_spinlock
> +#define smp_mb__after_spinlock() do { } while (0)
>  #endif

It seems okay, but why not make it a special sched-only function name
to prevent it being used in generic code?

I would not mind seeing responsibility for the switch barrier moved t

Re: Build failure in -next due to 'kbuild: allow archs to select link dead code/data elimination'

2016-09-12 Thread Nicholas Piggin
On Mon, 12 Sep 2016 20:17:30 -0700
Guenter Roeck  wrote:

> Hi Nicholas,
> 
> On 09/12/2016 07:00 PM, Nicholas Piggin wrote:
> > On Mon, 12 Sep 2016 15:24:43 -0700
> > Guenter Roeck  wrote:
> >  
> >> Hi,
> >>
> >> your commit 'kbuild: allow archs to select link dead code/data elimination'
> >> is causing the following build failure in -next when building 
> >> score:defconfig.
> >>
> >> arch/score/kernel/built-in.o: In function `work_resched':
> >> arch/score/kernel/entry.o:(.text+0xe84):
> >>relocation truncated to fit: R_SCORE_PC19 against `schedule'
> >>
> >> Reverting the commit fixes the problem.
> >>
> >> Please let me know if I can help tracking down the problem.
> >> In case you need a score toochain, you can find the one I use at
> >> http://server.roeck-us.net/toolchains/score.tgz.
> >>
> >> Thanks,
> >> Guenter  
> >
> > It's not supposed to have any real effect unless the
> > option is selected, but there are a few changes to linker
> > script which must be causing it.
> >
> > There are two changes to vmlinux.lds.h. One is to KEEP a
> > few input sections, that *should* be kept anyway. The other
> > is to bring in additional sections into their correct output
> > section.
> >
> > Could you try reverting those lines of vmlinux.lds.h that
> > change the latter, i.e.:
> >
> > -   *(.text.hot .text .text.fixup .text.unlikely)   \
> > +   *(.text.hot .text .text.fixup .text.unlikely .text.*)   \  
> 
> This is the culprit. After removing " .text.*" it builds fine.

Thanks for testing it. Some architectures have .text.x sections
not included here, I should have seen that. We should possibly
just revert that line and require implementing archs to do the
right thing.

[ Convention is to use '.' to avoid C identifiers, so we should
use .text..x for these things. But that's not done completely
throughout the kernel, and a little invasive to do with this
series. I'll revisit this and clean things up subsequently after
this round gets merged. ]

Thanks,
Nick


Re: linux-next: manual merge of the kbuild tree with Linus' tree

2016-09-12 Thread Nicholas Piggin
On Tue, 13 Sep 2016 14:02:57 +1000
Stephen Rothwell  wrote:

> Hi Michal,
> 
> [For the new cc's, we are discussing the "thin archives" and "link dead
> code/data elimination" patches in the kbuild tree.]
> 
> On Tue, 13 Sep 2016 09:39:45 +1000 Stephen Rothwell  
> wrote:
> >
> > On Mon, 12 Sep 2016 11:03:08 +0200 Michal Marek  wrote:  
> > >
> > > On 2016-09-12 04:53, Nicholas Piggin wrote:
> > > > Question, what is the best way to merge dependent patches? Considering
> > > > they will need a good amount of architecture testing, I think they will
> > > > have to go via arch trees. But it also does not make sense to merge 
> > > > these
> > > > kbuild changes upstream first, without having tested them.  
> > > 
> > > I think it makes sense to merge the kbuild changes via kbuild.git, even
> > > if they are unused and untested. Any follow-up fixes required to enable
> > > the first architecture can go through the respective architecture tree.
> > > Does that sound OK?
> > 
> > And if you guarantee not to rebase the kbuild tree (or at least the
> > subset containing these patches), then each of the architecture trees
> > can just merge your tree (or a tag?) and then implement any necessary
> > arch dependent changes.  I fixes are necessary, they can also be merged
> > into the architecture trees.  
> 
> Except, of course, the kbuild tree still has the asm EXPORT_SYMBOL
> patches that produce warnings on PowerPC :-( (And I am still reverting
> the PowerPC specific one of those patches).
> 

I'm working on a better patch to fix that (and to whitelist powerpc's
relocation checks to it does not get blamed for such breakage)
Although no guarantees about that yet.

However some of the enablement and subsequent patches I would like to
merge are quite architecture specific, and I would prefer them to go
via arch trees.

So I would like to see a kbuild branch with these 3 in it, if arch
maintainers (or specifically powerpc) would be willing to pull it in
their -next branches.

Thanks,
Nick



Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-12 Thread Nicholas Piggin
On Mon, 12 Sep 2016 21:06:49 -0700
Dan Williams  wrote:

> On Mon, Sep 12, 2016 at 6:31 PM, Nicholas Piggin  wrote:
> > On Mon, 12 Sep 2016 08:01:48 -0700  
> [..]
> > That said, a noop system call is on the order of 100 cycles nowadays,
> > so rushing to implement these APIs without seeing good numbers and
> > actual users ready to go seems premature. *This* is the real reason
> > not to implement new APIs yet.  
> 
> Yes, and harvesting the current crop of low hanging performance fruit
> in the filesystem-DAX I/O path remains on the todo list.
> 
> In the meantime we're pursuing this mm api, mincore+ or whatever we
> end up with, to allow userspace to distinguish memory address ranges
> that are backed by a filesystem requiring coordination of metadata
> updates + flushes for updates, vs something like device-dax that does
> not.

Yes, that's reasonable.

Do you need page/block granularity? Do you need a way to advise/request
the fs for a particular capability? Is it enough to request and check
success? Would the capability be likely to change, and if so, how would
you notify the app asynchronously?



Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-13 Thread Nicholas Piggin
On Tue, 13 Sep 2016 00:17:32 -0700
Christoph Hellwig  wrote:

> On Tue, Sep 13, 2016 at 11:53:11AM +1000, Nicholas Piggin wrote:
> > - Application mmaps a file, faults in block 0
> > - FS allocates block, creates mappings, syncs metadata, sets "no fsync"
> >   flag for that block, and completes the fault.
> > - Application writes some data to block 0, completes userspace flushes
> > 
> > * At this point, a crash must return with above data (or newer).
> > 
> > - Application starts writing more stuff into block 0
> > - Concurrently, fault in block 1
> > - FS starts to allocate, splits trees including mappings to block 0
> > 
> > * Crash
> > 
> > Is that right? How does your filesystem lose data before the sync
> > point?  
> 
> Witht all current file systems chances are your metadata hasn't been
> flushed out.  You could write all metadata synchronously from the

Yes, that's a possibility. Another would be an advise call to
request the capability for a given region.


> page fault handler, but that's basically asking for all kinds of
> deadlocks.

Such as?


> > If there is any huge complexity or unsolved problem, it is in XFS.
> > Conceptual problem is simple.  
> 
> Good to have you back and make all the hard thing simple :)

Thanks...? :)

I don't mean to say it's simple to add it to any filesystem or
that vfs and mm doesn't need any changes at all.

If we can agree on something, no new APIs should be added without
careful thought and justification and users. I only suggest not to
dismiss it out of hand.

Thanks,
Nick


Re: linux-next: manual merge of the kbuild tree with Linus' tree

2016-09-13 Thread Nicholas Piggin
On Tue, 13 Sep 2016 09:48:03 +0200
Arnd Bergmann  wrote:

> On Tuesday, September 13, 2016 2:02:57 PM CEST Stephen Rothwell wrote:
> > [For the new cc's, we are discussing the "thin archives" and "link dead
> > code/data elimination" patches in the kbuild tree.]
> > 
> > On Tue, 13 Sep 2016 09:39:45 +1000 Stephen Rothwell  
> > wrote:  
> > >
> > > On Mon, 12 Sep 2016 11:03:08 +0200 Michal Marek  wrote:  
> > > >
> > > > On 2016-09-12 04:53, Nicholas Piggin wrote:
> > > > > Question, what is the best way to merge dependent patches? Considering
> > > > > they will need a good amount of architecture testing, I think they 
> > > > > will
> > > > > have to go via arch trees. But it also does not make sense to merge 
> > > > > these
> > > > > kbuild changes upstream first, without having tested them.  
> > > > 
> > > > I think it makes sense to merge the kbuild changes via kbuild.git, even
> > > > if they are unused and untested. Any follow-up fixes required to enable
> > > > the first architecture can go through the respective architecture tree.
> > > > Does that sound OK?
> > > 
> > > And if you guarantee not to rebase the kbuild tree (or at least the
> > > subset containing these patches), then each of the architecture trees
> > > can just merge your tree (or a tag?) and then implement any necessary
> > > arch dependent changes.  I fixes are necessary, they can also be merged
> > > into the architecture trees.  
> > 
> > Except, of course, the kbuild tree still has the asm EXPORT_SYMBOL
> > patches that produce warnings on PowerPC  (And I am still reverting
> > the PowerPC specific one of those patches).  
> 
> Is that really powerpc specific? I have the same problem on ARM
> and I don't see how any architecture would not have it.
> 
> I prototyped the patch below, which fixes it for me, but I have
> not dared submit that workaround because it's butt ugly.

No it's not powerpc specific, it's just that powerpc build dies
if there are unresolved relocations.

Interesting approach. I have something different that may rival
yours for ugliness, but maybe keeps the muck a bit more contained.
I was just about to submit it, but now I'll wait to see if there is
a preference between the approaches:

(Note this patch alone does not resolve all export symbols, each
arch next needs to add C prototypes for their .S exports)

 scripts/Makefile.build | 71 +-
 1 file changed, 65 insertions(+), 6 deletions(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 11602e5..1e89908 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -158,7 +158,8 @@ cmd_cpp_i_c   = $(CPP) $(c_flags) -o $@ $<
 $(obj)/%.i: $(src)/%.c FORCE
$(call if_changed_dep,cpp_i_c)
 
-cmd_gensymtypes =   \
+# These mirror gensymtypes_S and co below, keep them in synch.
+cmd_gensymtypes_c = \
 $(CPP) -D__GENKSYMS__ $(c_flags) $< |   \
 $(GENKSYMS) $(if $(1), -T $(2)) \
  $(patsubst y,-s _,$(CONFIG_HAVE_UNDERSCORE_SYMBOL_PREFIX)) \
@@ -168,7 +169,7 @@ cmd_gensymtypes =   
\
 quiet_cmd_cc_symtypes_c = SYM $(quiet_modtag) $@
 cmd_cc_symtypes_c = \
 set -e; \
-$(call cmd_gensymtypes,true,$@) >/dev/null; \
+$(call cmd_gensymtypes_c,true,$@) >/dev/null;   \
 test -s $@ || rm -f $@
 
 $(obj)/%.symtypes : $(src)/%.c FORCE
@@ -197,9 +198,10 @@ else
 #   the actual value of the checksum generated by genksyms
 
 cmd_cc_o_c = $(CC) $(c_flags) -c -o $(@D)/.tmp_$(@F) $<
-cmd_modversions =  
\
+
+cmd_modversions_c =
\
if $(OBJDUMP) -h $(@D)/.tmp_$(@F) | grep -q __ksymtab; then 
\
-   $(call cmd_gensymtypes,$(KBUILD_SYMTYPES),$(@:.o=.symtypes))
\
+   $(call cmd_gensymtypes_c,$(KBUILD_SYMTYPES),$(@:.o=.symtypes))  
\
> $(@D)/.tmp_$(@F:.o=.ver); 
\

\
$(LD) $(LDFLAGS) -r -o $@ $(@D)/.tmp_$(@F)  
\
@@ -267,13

Re: Build failure in -next due to 'kbuild: allow archs to select link dead code/data elimination'

2016-09-13 Thread Nicholas Piggin
On Tue, 13 Sep 2016 13:25:22 -0700
Guenter Roeck  wrote:

> On 09/12/2016 08:51 PM, Nicholas Piggin wrote:
> > On Mon, 12 Sep 2016 20:17:30 -0700
> > Guenter Roeck  wrote:
> >  
> >> Hi Nicholas,
> >>
> >> On 09/12/2016 07:00 PM, Nicholas Piggin wrote:  
> >>> On Mon, 12 Sep 2016 15:24:43 -0700
> >>> Guenter Roeck  wrote:
> >>>  
> >>>> Hi,
> >>>>
> >>>> your commit 'kbuild: allow archs to select link dead code/data 
> >>>> elimination'
> >>>> is causing the following build failure in -next when building 
> >>>> score:defconfig.
> >>>>
> >>>> arch/score/kernel/built-in.o: In function `work_resched':
> >>>> arch/score/kernel/entry.o:(.text+0xe84):
> >>>>  relocation truncated to fit: R_SCORE_PC19 against `schedule'
> >>>>
> >>>> Reverting the commit fixes the problem.
> >>>>
> >>>> Please let me know if I can help tracking down the problem.
> >>>> In case you need a score toochain, you can find the one I use at
> >>>> http://server.roeck-us.net/toolchains/score.tgz.
> >>>>
> >>>> Thanks,
> >>>> Guenter  
> >>>
> >>> It's not supposed to have any real effect unless the
> >>> option is selected, but there are a few changes to linker
> >>> script which must be causing it.
> >>>
> >>> There are two changes to vmlinux.lds.h. One is to KEEP a
> >>> few input sections, that *should* be kept anyway. The other
> >>> is to bring in additional sections into their correct output
> >>> section.
> >>>
> >>> Could you try reverting those lines of vmlinux.lds.h that
> >>> change the latter, i.e.:
> >>>
> >>> - *(.text.hot .text .text.fixup .text.unlikely)   \
> >>> + *(.text.hot .text .text.fixup .text.unlikely .text.*)   \  
> >>
> >> This is the culprit. After removing " .text.*" it builds fine.  
> >
> > Thanks for testing it. Some architectures have .text.x sections
> > not included here, I should have seen that. We should possibly
> > just revert that line and require implementing archs to do the
> > right thing.
> >  
> I would call the build failure a regression, so it should either be
> reverted, or we'll need some other solution to fix the build failure.

Yes it's definitely a regression and that part should be reverted.
To confirm, the following patch fixes your build?

Thanks,
Nick

commit 0ae28be83b4d6cd03ef5b481487d042f2b91954e
Author: Nicholas Piggin 
Date:   Wed Sep 14 12:24:03 2016 +1000

kbuild: -ffunction-sections fix for archs with conflicting sections

Enabling -ffunction-sections modified the generic linker script to
pull .text.* sections into regular TEXT_TEXT section, conflicting
with some architectures. Revert that change and require archs that
enable the option to ensure they have no conflicting section names,
and do the appropriate merging.

Signed-off-by: Nicholas Piggin 

diff --git a/arch/Kconfig b/arch/Kconfig
index 6d5b631..8248037 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -487,7 +487,9 @@ config LD_DEAD_CODE_DATA_ELIMINATION
  This requires that the arch annotates or otherwise protects
  its external entry points from being discarded. Linker scripts
  must also merge .text.*, .data.*, and .bss.* correctly into
- output sections.
+ output sections. Care must be taken not to pull in unrelated
+ sections (e.g., '.text.init'). Typically '.' in section names
+ is used to distinguish them from label names / C identifiers.
 
 config HAVE_CONTEXT_TRACKING
bool
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index ad9d8f9..48dd44f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -198,9 +198,9 @@
 
 /*
  * .data section
- * -fdata-sections generates .data.identifier which needs to be pulled in
- * with .data, but don't want to pull in .data..stuff which has its own
- * requirements. Same for bss.
+ * LD_DEAD_CODE_DATA_ELIMINATION option enables -fdata-sections generates
+ * .data.identifier which needs to be pulled in with .data, but don't want to
+ * pull in .data..stuff which has its own requirements. Same for bss.
  */
 #define DATA_DATA  \
*(.data .data.[0-9a-zA-Z_]*)\
@@ -434,10 +434,15 @@
}
 
 /* .text 

Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Nicholas Piggin
On Wed, 14 Sep 2016 17:39:02 +1000
Dave Chinner  wrote:

> On Tue, Sep 13, 2016 at 11:53:11AM +1000, Nicholas Piggin wrote:
> > On Tue, 13 Sep 2016 07:34:36 +1000
> > Dave Chinner  wrote:
> > But let me understand your example in the absence of that.
> > 
> > - Application mmaps a file, faults in block 0
> > - FS allocates block, creates mappings, syncs metadata, sets "no fsync"
> >   flag for that block, and completes the fault.
> > - Application writes some data to block 0, completes userspace flushes
> > 
> > * At this point, a crash must return with above data (or newer).
> > 
> > - Application starts writing more stuff into block 0
> > - Concurrently, fault in block 1
> > - FS starts to allocate, splits trees including mappings to block 0
> > 
> > * Crash
> > 
> > Is that right?  
> 
> No.
> 
> - app write faults block 0, fs allocates
> < time passes while app does stuff to block 0 mapping >
> - fs syncs journal, block 0 metadata now persistent
> < time passes while app does stuff to block 0 mapping >
> - app structure grows, faults block 1, fs allocates
> - app adds pointers to data in block 1 from block 0, does
>   userspace pmem data sync.
> 
> *crash*
> 
> > How does your filesystem lose data before the sync
> > point?  
> 
> After recovery, file has a data in block 0, but no block 1 because
> the allocation transaction for block 1 was not flushed to the
> journal. Data in block 0 points to data in block 1, but block 1 does
> not exist. IOWs, the application has corrupt data because it never
> issued a data synchronisation request to the filesystem
> 
> 
> 
> Ok, looking back over your example, you seem to be suggesting a new
> page fault behaviour is required from filesystems that has not been
> described or explained, and that behaviour is triggered
> (persistently) somehow from userspace. You've also suggested
> filesystems store a persistent per-block "no fsync" flag
> in their extent map as part of the implementation. Right?

This is what we're talking about. Of course a filesystem can't just
start supporting the feature without any changes.


> Reading between the lines, I'm guessing that the "no fsync" flag has
> very specific update semantics, constraints and requirements.  Can
> you outline how you expect this flag to be set and updated, how it's
> used consistently between different applications (e.g. cp of a file
> vs the app using the file), behavioural constraints it implies for
> page faults vs non-mmap access to the data in the block, how
> you'd expect filesystems to deal with things like a hole punch
> landing in the middle of an extent marked with "no fsync", etc?

Well that's what's being discussed. An approach close to what I did is
to allow the app request a "no sync" type of mmap. Filesystem will
invalidate all such mappings before it does buffered IOs or hole punch,
and will sync metadata after allocating a new block but before returning
from a fault.

The app could query rather than request, but I found request seemed to
work better. The filesystem might be working with apps that don't use
the feature for example, and doesn't want to flush just in case any one
ever queried in the past.


> [snip]
> 
> > If there is any huge complexity or unsolved problem, it is in XFS.
> > Conceptual problem is simple.  
> 
> Play nice and be constructive, please?

So you agree that the persistent memory people who have come with some
requirements and ideas for an API should not be immediately shut down
with bogus handwaving.

Thanks,
Nick


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Nicholas Piggin
On Thu, 15 Sep 2016 12:31:33 +1000
Dave Chinner  wrote:

> On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:
> > On Wed, 14 Sep 2016 17:39:02 +1000
> > Dave Chinner  wrote:  
> > > Ok, looking back over your example, you seem to be suggesting a new
> > > page fault behaviour is required from filesystems that has not been
> > > described or explained, and that behaviour is triggered
> > > (persistently) somehow from userspace. You've also suggested
> > > filesystems store a persistent per-block "no fsync" flag
> > > in their extent map as part of the implementation. Right?  
> > 
> > This is what we're talking about. Of course a filesystem can't just
> > start supporting the feature without any changes.  
> 
> Sure, but one first has to describe the feature desired before all

The DAX people have been. They want to be able to get mappings
that can be synced without doing fsync. The *exact* extent of
those capabilities and what the API exactly looks like is up for
discussion.


> parties can discuss it. We need more than vague references and
> allusions from you to define the solution you are proposing.
> 
> Once everyone understands what is being describing, we might be able
> to work out how it can be implemented in a simple, generic manner
> rather than require every filesystem to change their on-disk
> formats. IOWs, we need you to describe /details/ of semantics,
> behaviour and data integrity constraints that are required, not
> describe an implementation of something we have no knwoledge about.

Well you said it was impossible already and Christoph told them
they were smoking crack :)

Anyway, there's a few questions. Implementation and API. Some
filesystems may never cope with it. Of course it should be as
generic as possible though.


> > > Reading between the lines, I'm guessing that the "no fsync" flag has
> > > very specific update semantics, constraints and requirements.  Can
> > > you outline how you expect this flag to be set and updated, how it's
> > > used consistently between different applications (e.g. cp of a file
> > > vs the app using the file), behavioural constraints it implies for
> > > page faults vs non-mmap access to the data in the block, how
> > > you'd expect filesystems to deal with things like a hole punch
> > > landing in the middle of an extent marked with "no fsync", etc?  
> > 
> > Well that's what's being discussed.  An approach close to what I did is
> > to allow the app request a "no sync" type of mmap.  
> 
> That's not an answer to the questions I asked about about the "no
> sync" flag you were proposing. You've redirected to the a different
> solution, one that 

No sync flag would do the same thing exactly in terms of consistency.
It would just do the no-sync sequence by default rather than being
asked for it. More of an API detail than implementation.

> 
> > Filesystem will
> > invalidate all such mappings before it does buffered IOs or hole punch,
> > and will sync metadata after allocating a new block but before returning
> > from a fault.  
> 
> ... requires synchronous metadata updates from page fault context,
> which we already know is not a good solution.  I'll quote one of
> Christoph's previous replies to save me the trouble:
> 
>   "You could write all metadata synchronously from the page
>   fault handler, but that's basically asking for all kinds of
>   deadlocks."
> So, let's redirect back to the "no sync" flag you were talking about
> - can you answer the questions I asked above? It would be especially
> important to highlight how the proposed feature would avoid requiring
> synchronous metadata updates in page fault contexts

Right. So what deadlocks are you concerned about?

There could be a scale of capabilities here, for different filesystems
that do things differently. 

Some filesystems could require fsync for metadata, but allow fdatasync
to be skipped. Users would need to have some knowledge of block size
or do preallocation and sync.

That might put more burden on libraries/applications if there are
concurrent operations, but that might be something they can deal with
-- fdatasync already requires some knowledge of concurrent operations
(or lack thereof).


> > > [snip]
> > >   
> > > > If there is any huge complexity or unsolved problem, it is in XFS.
> > > > Conceptual problem is simple.
> > > 
> > > Play nice and be constructive, please?  
> > 
> > So you agree that the persistent memory people who have

Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-15 Thread Nicholas Piggin
On Thu, 15 Sep 2016 20:32:10 +1000
Dave Chinner  wrote:

> On Thu, Sep 15, 2016 at 01:49:45PM +1000, Nicholas Piggin wrote:
> > On Thu, 15 Sep 2016 12:31:33 +1000
> > Dave Chinner  wrote:
> >   
> > > On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:  
> > > > On Wed, 14 Sep 2016 17:39:02 +1000  
> > > Sure, but one first has to describe the feature desired before all  
> > 
> > The DAX people have been.  
> 
> H. the only "DAX people" I know of are kernel developers who
> have been working on implementing DAX in the kernel - Willy, Ross,
> Dan, Jan, Christoph, Kirill, myelf and a few others around the
> fringes.
> 
> > They want to be able to get mappings
> > that can be synced without doing fsync.  
> 
> Oh, ok, the Intel Userspace PMEM library requirement. I though you
> had something more that this - whatever extra problem the per-block
> no fsync flag would solve?

Only the PMEM really. I don't want to add more complexity than
required.

> > The *exact* extent of
> > those capabilities and what the API exactly looks like is up for
> > discussion.  
> 
> Yup.
> 
> > Well you said it was impossible already and Christoph told them
> > they were smoking crack :)  
> 
> I have not said that. I have said bad things about bad
> proposals and called the PMEM library model broken, but I most
> definitely have not said that solving the problem is impossible.
>
> > > That's not an answer to the questions I asked about about the "no
> > > sync" flag you were proposing. You've redirected to the a different
> > > solution, one that   
> > 
> > No sync flag would do the same thing exactly in terms of consistency.
> > It would just do the no-sync sequence by default rather than being
> > asked for it. More of an API detail than implementation.  
> 
> You still haven't described anything about what a per-block flag
> design is supposed to look like :/

For the API, or implementation? I'm not quite sure what you mean
here. For implementation it's possible to carefully ensure metadata
is persistent when allocating blocks in page fault but before
mapping pages. Truncate or hole punch or such things can be made to
work by invalidating all such mappings and holding them off until
you can cope with them again. Not necessarily for a filesystem with
*all* capabilities of XFS -- I don't know -- but for a complete basic
one.

> > > > Filesystem will
> > > > invalidate all such mappings before it does buffered IOs or hole punch,
> > > > and will sync metadata after allocating a new block but before returning
> > > > from a fault.
> > > 
> > > ... requires synchronous metadata updates from page fault context,
> > > which we already know is not a good solution.  I'll quote one of
> > > Christoph's previous replies to save me the trouble:
> > > 
> > >   "You could write all metadata synchronously from the page
> > >   fault handler, but that's basically asking for all kinds of
> > >   deadlocks."
> > > So, let's redirect back to the "no sync" flag you were talking about
> > > - can you answer the questions I asked above? It would be especially
> > > important to highlight how the proposed feature would avoid requiring
> > > synchronous metadata updates in page fault contexts  
> > 
> > Right. So what deadlocks are you concerned about?  
> 
> It basically puts the entire journal checkpoint path under a page
> fault context. i.e. a whole new global locking context problem is

Yes there are potentially some new lock orderings created if you
do that, depending on what locks the filesystem does.

> created as this path can now be run both inside and outside the
> mmap_sem. Nothing ever good comes from running filesystem locking
> code both inside and outside the mmap_sem.

You mean that some cases journal checkpoint runs with mmap_sem
held, and others without mmap_sem held? Not that mmap_sem is taken
inside journal checkpoint. Then I don't really see why that's a
problem. I mean performance could suffer a bit, but with fault
retry you can almost always do the syncing outside mmap_sem in
practice.

Yes, I'll preemptively agree with you -- We don't want to add any
such burden if it is not needed and well justified.

> FWIW, We've never executed synchronous transactions inside page
> faults in XFS, and I think ext4 is in the same boat - it may be even
> worse because of the way it does ordered data dispatch through the
> journal. I don't really even want to think about

Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

2016-08-04 Thread Nicholas Piggin
On Thu, 4 Aug 2016 22:31:39 +1000
Nicholas Piggin  wrote:
> On Thu, 04 Aug 2016 14:09:02 +0200
> Arnd Bergmann  wrote:
> > Nicolas Pitre has done some related work, adding him to Cc. IIRC we have
> > actually had multiple implementations of -ffunction-sections/--gc-sections
> > in the past that people have used in production, but none of them
> > ever made it upstream.  

After some googling around it seems lto has been difficult to
get in and it was agreed this gc-sections should be done first
anyway (although it may indeed provide a superset of DCE, but
it's always going to be more costly and complicated). Lto would
have the same issue with liveness of entry points, which is
really the only thing you need change in the kernel as far as I
can see.

I didn't really see what problems people were having with it
though, so maybe it's architecture specific or something I
haven't run into yet.

Thanks,
Nick


Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

2016-08-05 Thread Nicholas Piggin
On Thu, 4 Aug 2016 12:06:41 -0500
Segher Boessenkool  wrote:

> On Thu, Aug 04, 2016 at 06:10:57PM +0200, Arnd Bergmann wrote:
> > On Thursday, August 4, 2016 9:47:13 PM CEST Nicholas Piggin wrote:
> >   
> > > + __used  \
> > > + __attribute__((section("___kentry" "+" #sym ",\"a\",@note #"), used)) \ 
> > >  
> > 
> > 
> > I've just started testing this, but the first problem I ran into
> > is that @ and # are special characters that have an architecture
> > specific meaning to the assembler. On ARM, you need "%note @" instead
> > of "@note #".  
> 
> That comment trick (I still feel guilty about it) causes more problems
> than it solves.  Please don't try to use it :-)

Yeah that's a funny hack. I don't think it's required though, but I'm just
running through some more tests.

I think I found an improvement with the thin archives as well -- we were
still building symbol table after removing the s option (that only avoids
index). "S" is required to not build symbol table.

I'll send out an RFC on a slightly more polished patch series shortly.

Thanks,
Nick


Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

2016-08-05 Thread Nicholas Piggin
On Fri, 05 Aug 2016 12:17:27 +0200
Arnd Bergmann  wrote:

> On Friday, August 5, 2016 6:41:08 PM CEST Nicholas Piggin wrote:
> > On Thu, 4 Aug 2016 12:06:41 -0500
> > Segher Boessenkool  wrote:
> >   
> > > On Thu, Aug 04, 2016 at 06:10:57PM +0200, Arnd Bergmann wrote:  
> > > > On Thursday, August 4, 2016 9:47:13 PM CEST Nicholas Piggin wrote:
> > > > 
> > > > > + __used  \
> > > > > + __attribute__((section("___kentry" "+" #sym ",\"a\",@note #"), 
> > > > > used)) \
> > > > 
> > > > 
> > > > I've just started testing this, but the first problem I ran into
> > > > is that @ and # are special characters that have an architecture
> > > > specific meaning to the assembler. On ARM, you need "%note @" instead
> > > > of "@note #".
> > > 
> > > That comment trick (I still feel guilty about it) causes more problems
> > > than it solves.  Please don't try to use it :-)  
> > 
> > Yeah that's a funny hack. I don't think it's required though, but I'm just
> > running through some more tests.
> > 
> > I think I found an improvement with the thin archives as well -- we were
> > still building symbol table after removing the s option (that only avoids
> > index). "S" is required to not build symbol table.
> > 
> > I'll send out an RFC on a slightly more polished patch series shortly.  
> 
> 
> I could not find Nico's patches, but based on the information in his
> presentation at
> 
> https://www.linuxplumbersconf.org/2015/ocw//system/presentations/3369/original/slides.html#(1)
> 
> I created a patch for ARM that mirrors what you have for powerpc, see
> below.

Great, thanks for jumping in. I posted another set which is a lot improved
you should pick up.


> I have successfully built normal-sized kernels with this (not tried
> running them). Unfortunately, the build time for "allyesconfig"
> kernel explodes, the final link time is now in the hours instead of
> minutes (no exact numbers unfortunately, it takes too long to
> reproduce),

That's becase we need to coalesce the new sections properly into the
output file. binutils does not cope with vast number of sections in
final linked file and spends all its time in hash lookup then explodes
usually.


> and I also get link errors for the .text.fixup section
> for any users of __put_user() in really large kernels:
> net/batman-adv/batman-adv.o:(.text.fixup+0x4): relocation truncated to fit: 
> R_ARM_JUMP24 against `.text.batadv_log_read'

This may be fixed by fixing the linker script to bring in the new
sections properly (see new patchset).

If not, then if you can combine the sections rather than have them
consecutive in the output, e.g.,:

*(.text .text.fixup)

Rather than

*(.text)
*(.text.fixup)

Then the linker has more freedom to rearrange them. I realize it's
not that simple with ARM's .text.fixup, but maybe that helps you
get it to work.

Thanks,
Nick


Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

2016-08-05 Thread Nicholas Piggin
On Fri, 05 Aug 2016 18:01:13 +0200
Arnd Bergmann  wrote:

> On Friday, August 5, 2016 10:26:25 PM CEST Nicholas Piggin wrote:
> > On Fri, 05 Aug 2016 12:17:27 +0200
> > Arnd Bergmann  wrote:  
> 
> > > and I also get link errors for the .text.fixup section
> > > for any users of __put_user() in really large kernels:
> > > net/batman-adv/batman-adv.o:(.text.fixup+0x4): relocation truncated to 
> > > fit: R_ARM_JUMP24 against `.text.batadv_log_read'  
> > 
> > This may be fixed by fixing the linker script to bring in the new
> > sections properly (see new patchset).
> > 
> > If not, then if you can combine the sections rather than have them
> > consecutive in the output, e.g.,:
> > 
> > *(.text .text.fixup)
> > 
> > Rather than
> > 
> > *(.text)
> > *(.text.fixup)
> > 
> > Then the linker has more freedom to rearrange them. I realize it's
> > not that simple with ARM's .text.fixup, but maybe that helps you
> > get it to work.  
> 
> This did the trick:
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index 0ec807d69f18..7a3ad269fa23 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -433,7 +433,7 @@
>   * during second ld run in second ld pass when generating System.map */
>  #define TEXT_TEXT\
>   ALIGN_FUNCTION();   \
> - *(.text.hot .text .text.fixup .text.unlikely)   \
> + *(.text.hot .text .text.* .text.fixup .text.unlikely)   \
>   *(.ref.text)\
>   MEM_KEEP(init.text) \
>   MEM_KEEP(exit.text) \
> 
> 
> It also got much faster again, the link time for an allyesconfig
> kernel is now 18 minutes instead of 10 hours, but it's still
> much worse than the 2 minutes I had earlier or the four minutes
> with the previous patch.

Are you using the patches I just sent? Either way, you also need
to do the same for data and bss sections as you are using
-fdata-sections too.

I've found virtually no build time regression on powerpc or x86
when those are taken care of properly (x86 numbers I sent are typo,
it's not 5m20, it's 5m02).

Thanks,
Nick


Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures

2016-08-06 Thread Nicholas Piggin
On Fri, 05 Aug 2016 21:16:00 +0200
Arnd Bergmann  wrote:

> On Saturday, August 6, 2016 2:16:42 AM CEST Nicholas Piggin wrote:
> > > 
> > > diff --git a/include/asm-generic/vmlinux.lds.h 
> > > b/include/asm-generic/vmlinux.lds.h
> > > index 0ec807d69f18..7a3ad269fa23 100644
> > > --- a/include/asm-generic/vmlinux.lds.h
> > > +++ b/include/asm-generic/vmlinux.lds.h
> > > @@ -433,7 +433,7 @@
> > >   * during second ld run in second ld pass when generating System.map */
> > >  #define TEXT_TEXT\
> > >   ALIGN_FUNCTION();   \
> > > - *(.text.hot .text .text.fixup .text.unlikely)   \
> > > + *(.text.hot .text .text.* .text.fixup .text.unlikely)   \
> > >   *(.ref.text)\
> > >   MEM_KEEP(init.text) \
> > >   MEM_KEEP(exit.text) \
> > > 
> > > 
> > > It also got much faster again, the link time for an allyesconfig
> > > kernel is now 18 minutes instead of 10 hours, but it's still
> > > much worse than the 2 minutes I had earlier or the four minutes
> > > with the previous patch.  
> > 
> > Are you using the patches I just sent?  
> 
> Not yet, I was still busy with the older version, and trying to
> figure out exactly what went wrong in ld.bfd. FWIW, I first tried
> to see if the hash tables were just too small, but as it turned
> out that was not the problem. When I tried to change the default
> hash table sizes, making them bigger only made things slower.
> 
> I also found the --hash-size=xxx option, which has a significant
> impact on runtime speed. Interestingly again, using sizes less
> than the default made things faster in practice. If we can
> work out the optimum size for the kernel build, that might
> shave a few minutes off the total build time.
> 
> > Either way, you also need
> > to do the same for data and bss sections as you are using
> > -fdata-sections too.  
> 
> Right.
> 
> > I've found virtually no build time regression on powerpc or x86
> > when those are taken care of properly (x86 numbers I sent are typo,
> > it's not 5m20, it's 5m02).  
> 
> Interesting. I wonder if it's got something to do with the
> generation of the branch trampolines on ARM, as we have a lot
> of them on an allyesconfig.

Powerpc generates quite a few branch trampolines as well, so
I'm not sure if that would be the issue. Can you get a profile
of the link?

Are you linking with archives? Do your input archives have a
symbol index built?


> Is the 5m20 the total build time for the kernel, the time for
> rebuilding after a trivial change, or the time to call 'ld.bfd'
> once?

5m02 was the total time for x86 defconfig. With the powerpc
allyesconfig build, the final link:

$ time ld -EL -m elf64lppc -pie --emit-relocs --build-id --gc-sections -X -o 
vmlinux -T ./arch/powerpc/kernel/vmlinux.lds --whole-archive built-in.o 
.tmp_kallsyms2.o

real0m15.556s
user0m13.288s
sys 0m2.240s

$ ls -lh vmlinux
-rwxrwxr-x 1 npiggin npiggin 279M Aug  6 14:02 vmlinux

Without -pie --emit-relocs it's 11.8s and 150M but I'm using
emit-relocs for a post-link step.


> Are you using ld.bfd on x86 or ld.gold? For me ld.gold either
> works and is really fast, or it crashes, depending on the
> configuration. I also don't think it supports big-endian ARM
> (which is what allyesconfig ends up using).

ld.bfd on both. Gold crashed on powerpc and I didn't try it on x86.

Thanks,
Nick


Re: linux-next: build warnings after merge of the kbuild tree

2016-08-25 Thread Nicholas Piggin
On Mon, 22 Aug 2016 20:47:58 +1000
Nicholas Piggin  wrote:

> On Fri, 19 Aug 2016 20:44:55 +1000
> Nicholas Piggin  wrote:
> 
> > On Fri, 19 Aug 2016 10:37:00 +0200
> > Michal Marek  wrote:
> >   
> > > On 2016-08-19 07:09, Stephen Rothwell wrote:
> 
> [snip]
> 
> > > > 
> > > > I may be missing something, but genksyms generates the crc's off the
> > > > preprocessed C source code and we don't have any for the asm files ...  
> > > > 
> > > 
> > > Of course you are right. Which means that we are losing type information
> > > for these exports for CONFIG_MODVERSIONS purposes. I guess it's
> > > acceptable, since the asm functions are pretty basic and their
> > > signatures do not change.
> > 
> > I don't completely agree. It would be nice to have the functionality
> > still there.
> > 
> > What happens if you just run cmd_modversions on the as rule? It relies on
> > !defined(__ASSEMBLY__), but we're feeding the result to genksyms, not as.
> > It would require the header be included in the .S file and be protected for
> > asm builds.  
> 
> 
> This seems like it *could* be made to work, but there's a few problems.
> 
> - .h files are not made for C consumption. Matter of manually adding the
> ifdef guards, which isn't terrible.
> 
> - .S files do not all include their .h where the C declaration is. Also
> will cause some churn but doable and maybe not completely unreasonable.
> 
> - genksyms parser barfs when it hits the assembly of the .S file. Best
> way to fix that seems just send the #include and EXPORT_SYMBOL lines
> from the .S to the preprocessor. That's a bit of a rabbit hole too, with
> some .S files being included, etc.
> 
> I'm not sure what to do here. If nobody cares and we lose CRCs for .S
> exports, then okay we can whitelist those relocs easily. If we don't want
> to lose the functionality, the above might work but it's a bit intrusive
> an is going to require another cycle of prep patches to go through arch
> code first.
> 
> Or suggestions for alternative approach?

Here is a quick patch that I think should catch missing CRCs in
architecture independent way. If we merge something like this, we
can whitelist the symbols in arch/powerpc so people get steered to
the right place.

Powerpc seems to be the only one really catching this, and it's
only as a side effect of a test run for CONFIG_RELOCATABLE kernels,
which means version failures probably slipped through other archs.

I'll clean it up, do some more testing, and submit it unless
anybody dislikes it or has a better way to do it.

Thanks,
Nick


diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index 4b8ffd3..1efc454 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -609,6 +609,7 @@ static void handle_modversions(struct module *mod, struct 
elf_info *info,
 {
unsigned int crc;
enum export export;
+   int is_crc = 0;
 
if ((!is_vmlinux(mod->name) || mod->is_dot_o) &&
strncmp(symname, "__ksymtab", 9) == 0)
@@ -618,6 +619,7 @@ static void handle_modversions(struct module *mod, struct 
elf_info *info,
 
/* CRC'd symbol */
if (strncmp(symname, CRC_PFX, strlen(CRC_PFX)) == 0) {
+   is_crc = 1;
crc = (unsigned int) sym->st_value;
sym_update_crc(symname + strlen(CRC_PFX), mod, crc,
export);
@@ -663,6 +665,10 @@ static void handle_modversions(struct module *mod, struct 
elf_info *info,
else
symname++;
 #endif
+   if (is_crc && !mod->is_dot_o) {
+   const char *e = is_vmlinux(mod->name) ?"":".ko";
+   warn("EXPORT symbol \"%s\" [%s%s] version generation 
failed, symbol will not be versioned.\n", symname + strlen(CRC_PFX), mod->name, 
e);
+   }
mod->unres = alloc_symbol(symname,
  ELF_ST_BIND(sym->st_info) == STB_WEAK,
  mod->unres);


Re: [BUG][next-20170619][347de24] PowerPC boot fails with Oops

2017-06-20 Thread Nicholas Piggin
On Tue, 20 Jun 2017 12:49:25 +0530
Abdul Haleem  wrote:

> Hi,
> 
> commit: 347de24 (powerpc/64s: implement arch-specific hardlockup
> watchdog)
> 
> linux-next fails to boot on PowerPC Bare-metal box.
> 
> Test: boot
> Machine type: Power 8 Bare-metal
> Kernel: 4.12.0-rc5-next-20170619
> gcc: version 4.8.5
> 
> 
> In file arch/powerpc/kernel/watchdog.c
> 
> void soft_nmi_interrupt(struct pt_regs *regs)
> {
> unsigned long flags;
> int cpu = raw_smp_processor_id();
> u64 tb;
> 
> if (!cpumask_test_cpu(cpu, &wd_cpus_enabled))
> return;
> 
> >>> nmi_enter();  

Thanks for the report.

This is due to emergency stacks not zeroing preempt_count, so they get
garbage here, and it just trips the BUG_ON(in_nmi()) check.

Don't think it's a bug in the proposed new powerpc watchdog. (at least
I was able to reproduce your bug and fix it by fixing the stack init).

Thanks,
Nick


Re: linux-next: build failure after merge of most trees

2017-06-21 Thread Nicholas Piggin
On Thu, 22 Jun 2017 15:24:41 +1000
Stephen Rothwell  wrote:

> Hi Dave,
> 
> After merging almost all the trees, today's linux-next build (sparc64
> defconfig) failed like this:
> 
> arch/sparc/lib/hweight.o: In function `__arch_hweight8':
> (.text+0x0): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
> `__sw_hweight8' defined in .text section in lib/hweight.o
> arch/sparc/lib/hweight.o: In function `__arch_hweight16':
> (.text+0xc): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
> `__sw_hweight16' defined in .text section in lib/hweight.o
> arch/sparc/lib/hweight.o: In function `__arch_hweight32':
> (.text+0x18): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
> `__sw_hweight32' defined in .text section in lib/hweight.o
> arch/sparc/lib/hweight.o: In function `__arch_hweight64':
> (.text+0x24): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
> `__sw_hweight64' defined in .text section in lib/hweight.o
> 
> I though it may have been caused by the thin archive changes, but
> disabling CONFIG_THIN_ARCHIVE did not fix it. I have cc'd Nick and
> Masahiro just in case.

It could be this

https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/commit/?h=thin-ar&id=ec2c9c20f0efab37ae31de44fe0617aa61283905

kbuild: handle libs-y archives separately from built-in.o archives

That touches the lib linking code regardless of CONFIG_THIN_ARCHIVE.
You should be able to revert it by itself (which will break a few
other archs, so you would also have to revert the default y patch
for thin archives to repair your tree if this is the cause).

I'll try to get around to it after I fix up some other arch breakage
caused by the series :)

Thanks,
Nick


Re: linux-next: build failure after merge of most trees

2017-06-22 Thread Nicholas Piggin
CC'ing Alan

On Thu, 22 Jun 2017 15:24:41 +1000
Stephen Rothwell  wrote:

> Hi Dave,
> 
> After merging almost all the trees, today's linux-next build (sparc64
> defconfig) failed like this:
> 
> arch/sparc/lib/hweight.o: In function `__arch_hweight8':
> (.text+0x0): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
> `__sw_hweight8' defined in .text section in lib/hweight.o
> arch/sparc/lib/hweight.o: In function `__arch_hweight16':
> (.text+0xc): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
> `__sw_hweight16' defined in .text section in lib/hweight.o
> arch/sparc/lib/hweight.o: In function `__arch_hweight32':
> (.text+0x18): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
> `__sw_hweight32' defined in .text section in lib/hweight.o
> arch/sparc/lib/hweight.o: In function `__arch_hweight64':
> (.text+0x24): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
> `__sw_hweight64' defined in .text section in lib/hweight.o

After a bit of digging, this is due to that thin archives patch, but only
because it changes the order of linker inputs around slightly. You can
reproduce it with upstream kernel:

This is the final link which succeeds:

sparc64-linux-gnu-ld -m elf64_sparc --build-id -o vmlinux -T 
./arch/sparc/kernel/vmlinux.lds arch/sparc/kernel/head_64.o init/built-in.o 
--start-group usr/built-in.o arch/sparc/built-in.o kernel/built-in.o 
certs/built-in.o mm/built-in.o fs/built-in.o ipc/built-in.o security/built-in.o 
crypto/built-in.o block/built-in.o lib/lib.a arch/sparc/prom/lib.a 
arch/sparc/lib/lib.a lib/built-in.o arch/sparc/prom/built-in.o 
arch/sparc/lib/built-in.o drivers/built-in.o sound/built-in.o 
firmware/built-in.o net/built-in.o virt/built-in.o --end-group .tmp_kallsyms2.o

Moving the lib.a files to the end:

sparc64-linux-gnu-ld -m elf64_sparc --build-id -o vmlinux -T 
./arch/sparc/kernel/vmlinux.lds arch/sparc/kernel/head_64.o init/built-in.o 
--start-group usr/built-in.o arch/sparc/built-in.o kernel/built-in.o 
certs/built-in.o mm/built-in.o fs/built-in.o ipc/built-in.o security/built-in.o 
crypto/built-in.o block/built-in.o lib/built-in.o arch/sparc/prom/built-in.o 
arch/sparc/lib/built-in.o drivers/built-in.o sound/built-in.o 
firmware/built-in.o net/built-in.o virt/built-in.o lib/lib.a 
arch/sparc/prom/lib.a arch/sparc/lib/lib.a --end-group .tmp_kallsyms2.o

arch/sparc/lib/lib.a(hweight.o): In function `__arch_hweight8':
(.text+0x0): relocation truncated to fit: R_SPARC_WDISP19 against symbol 
`__sw_hweight8' defined in .text section in lib/built-in.o

If we also move lib/built-in.o to the end, then the error goes away.

Is there any way for the linker to place the inputs to avoid unresolvable
relocations where possible?

A way to work around this is to make arch/sparc/lib/hweight.o an obj-y
rather than lib-y. That's a hack because it just serves to move the
input location, but not really any more of a hack than the current code
that also only works because of input locations...

Any thoughts?

Thanks,
Nick


Re: [PATCH] powerpc/watchdog: Convert timers to use timer_setup()

2017-10-16 Thread Nicholas Piggin
On Mon, 16 Oct 2017 16:47:10 -0700
Kees Cook  wrote:

> In preparation for unconditionally passing the struct timer_list pointer to
> all timer callbacks, switch to using the new timer_setup() and from_timer()
> to pass the timer pointer explicitly.
> 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: linuxppc-...@lists.ozlabs.org
> Signed-off-by: Kees Cook 

Looks fine to me. Is this intended to be merged via the powerpc tree
in the next merge window?

Acked-by: Nicholas Piggin 

> ---
>  arch/powerpc/kernel/watchdog.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
> index 15e209a37c2d..50797528b5e1 100644
> --- a/arch/powerpc/kernel/watchdog.c
> +++ b/arch/powerpc/kernel/watchdog.c
> @@ -263,9 +263,8 @@ static void wd_timer_reset(unsigned int cpu, struct 
> timer_list *t)
>   add_timer_on(t, cpu);
>  }
>  
> -static void wd_timer_fn(unsigned long data)
> +static void wd_timer_fn(struct timer_list *t)
>  {
> - struct timer_list *t = this_cpu_ptr(&wd_timer);
>   int cpu = smp_processor_id();
>  
>   watchdog_timer_interrupt(cpu);
> @@ -292,7 +291,7 @@ static void start_watchdog_timer_on(unsigned int cpu)
>  
>   per_cpu(wd_timer_tb, cpu) = get_tb();
>  
> - setup_pinned_timer(t, wd_timer_fn, 0);
> + timer_setup(t, wd_timer_fn, TIMER_PINNED);
>   wd_timer_reset(cpu, t);
>  }
>  



Re: [PATCH 4/4] watchdog: provide watchdog_reconfigure() for arch watchdogs

2017-05-25 Thread Nicholas Piggin
On Thu, 25 May 2017 10:08:33 -0400
Don Zickus  wrote:

> On Thu, May 25, 2017 at 06:28:56PM +1000, Nicholas Piggin wrote:
> > After reconfiguring watchdog sysctls etc., architecture specific
> > watchdogs may not get all their parameters updated.
> > 
> > watchdog_reconfigure() can be implemented to pull the new values
> > in and set the arch NMI watchdog.  
> 
> I understand the reason for this patch and I don't have any real objection
> on how it was implemented within the constraints of all the current logic.
> 
> I just wonder if the current logic should be adjusted to make the hardlockup
> detector, namely the perf implementation more separate so it can handle what
> you would like more cleanly.
> 
> The watchdog_nmi_reconfigure is sort of hackish, but it is hard to fault you
> based on how things are designed.  I am going to poke at it a little bit,
> but I will probably not find time to do much and accept what you have for
> now.

I actually agree with you. These patches are basically an initial bridge
to get us to decoupling hld-perf from hld-arch, but the code could
definitely use several more passes to clean things up.

One thing we want to be mindful of is some watchdogs are very light weight,
minimal, and some may not even want to call C code (at least from the NMI
and touch-watchdog paths). But having said that, it may not be a bad idea
to have implementations provide a watchdog driver struct with some of the
methods and reconfiguration they support. E.g., suspend/resume, stop/start
on CPUs, adjust timeouts, etc.).

I didn't want to go the whole hog and over-engineer something that doesn't
work though, so I'm hoping we can get the powerpc watchdog in, and then
keep working on the apis.

Let me know what you think after you poke at it though.

Thanks,
Nick


[PATCH v2] spin loop primitives for busy waiting

2017-05-28 Thread Nicholas Piggin
Current busy-wait loops are implemented by repeatedly calling cpu_relax()
to give an arch option for a low-latency option to improve power and/or
SMT resource contention.

This poses some difficulties for powerpc, which has SMT priority setting
instructions (priorities determine how ifetch cycles are apportioned).
powerpc's cpu_relax() is implemented by setting a low priority then
setting normal priority. This has several problems:

 - Changing thread priority can have some execution cost and potential
   impact to other threads in the core. It's inefficient to execute them
   every time around a busy-wait loop.

 - Depending on implementation details, a `low ; medium` sequence may
   not have much if any affect. Some software with similar pattern
   actually inserts a lot of nops between, in order to cause a few fetch
   cycles with the low priority.

 - The busy-wait loop runs with regular priority. This might only be a few
   fetch cycles, but if there are several threads running such loops, they
   could cause a noticable impact on a non-idle thread.

Implement spin_begin, spin_end primitives that can be used around busy
wait loops, which default to no-ops. And spin_cpu_relax which defaults to
cpu_relax.

This will allow architectures to hook the entry and exit of busy-wait
loops, and will allow powerpc to set low SMT priority at entry, and
normal priority at exit.

Suggested-by: Linus Torvalds 
Signed-off-by: Nicholas Piggin 
---

Since last time:
- Fixed spin_do_cond with initial test as suggested by Linus.
- Renamed it to spin_until_cond, which reads a little better.

 include/linux/processor.h | 70 +++
 1 file changed, 70 insertions(+)
 create mode 100644 include/linux/processor.h

diff --git a/include/linux/processor.h b/include/linux/processor.h
new file mode 100644
index ..da0c5e56ca02
--- /dev/null
+++ b/include/linux/processor.h
@@ -0,0 +1,70 @@
+/* Misc low level processor primitives */
+#ifndef _LINUX_PROCESSOR_H
+#define _LINUX_PROCESSOR_H
+
+#include 
+
+/*
+ * spin_begin is used before beginning a busy-wait loop, and must be paired
+ * with spin_end when the loop is exited. spin_cpu_relax must be called
+ * within the loop.
+ *
+ * The loop body should be as small and fast as possible, on the order of
+ * tens of instructions/cycles as a guide. It should and avoid calling
+ * cpu_relax, or any "spin" or sleep type of primitive including nested uses
+ * of these primitives. It should not lock or take any other resource.
+ * Violations of these guidelies will not cause a bug, but may cause sub
+ * optimal performance.
+ *
+ * These loops are optimized to be used where wait times are expected to be
+ * less than the cost of a context switch (and associated overhead).
+ *
+ * Detection of resource owner and decision to spin or sleep or guest-yield
+ * (e.g., spin lock holder vcpu preempted, or mutex owner not on CPU) can be
+ * tested within the loop body.
+ */
+#ifndef spin_begin
+#define spin_begin()
+#endif
+
+#ifndef spin_cpu_relax
+#define spin_cpu_relax() cpu_relax()
+#endif
+
+/*
+ * spin_cpu_yield may be called to yield (undirected) to the hypervisor if
+ * necessary. This should be used if the wait is expected to take longer
+ * than context switch overhead, but we can't sleep or do a directed yield.
+ */
+#ifndef spin_cpu_yield
+#define spin_cpu_yield() cpu_relax_yield()
+#endif
+
+#ifndef spin_end
+#define spin_end()
+#endif
+
+/*
+ * spin_until_cond can be used to wait for a condition to become true. It
+ * may be expected that the first iteration will true in the common case
+ * (no spinning), so that callers should not require a first "likely" test
+ * for the uncontended case before using this primitive.
+ *
+ * Usage and implementation guidelines are the same as for the spin_begin
+ * primitives, above.
+ */
+#ifndef spin_until_cond
+#define spin_until_cond(cond)  \
+do {   \
+   if (unlikely(!(cond))) {\
+   spin_begin();   \
+   do {\
+   spin_cpu_relax();   \
+   } while (!(cond));  \
+   spin_end(); \
+   }   \
+} while (0)
+
+#endif
+
+#endif /* _LINUX_PROCESSOR_H */
-- 
2.11.0



[PATCH 0/4][V3] Improve watchdog config for arch watchdogs

2017-05-29 Thread Nicholas Piggin
Since last time:

- Have the perf based hardlockup detector use arch_touch_nmi_watchdog()
  rather than hld_touch_nmi_watchdog(). This changes direction slightly
  to make the perf-based hard lockup detector an alternative that an
  arch may select, rather than standalone. This better reflects how the
  code works in practice).

- Hopefully fixed the Kconfig options. There's still a bit of ugliness
  that will require another pass or two over interfaces and config
  scheme, but the idea is to make a minimal change to get the powerpc
  HLD in, which gives a reasonable starting point to improve things
  further.

Nicholas Piggin (4):
  watchdog: remove unused declaration
  watchdog: Introduce arch_touch_nmi_watchdog()
  watchdog: Split up config options
  watchdog: Provide watchdog_reconfigure() for arch watchdogs

 arch/blackfin/include/asm/nmi.h|   2 +
 arch/blackfin/kernel/nmi.c |   2 +-
 arch/mn10300/include/asm/nmi.h |   2 +
 arch/mn10300/kernel/mn10300-watchdog-low.S |   8 +-
 arch/mn10300/kernel/mn10300-watchdog.c |   2 +-
 arch/powerpc/kernel/setup_64.c |   2 +-
 arch/sparc/include/asm/nmi.h   |   1 +
 arch/sparc/kernel/nmi.c|   6 +-
 arch/x86/kernel/apic/hw_nmi.c  |   2 +-
 include/linux/nmi.h|  57 ---
 kernel/Makefile|   2 +-
 kernel/sysctl.c|  18 +-
 kernel/watchdog.c  | 263 +++--
 kernel/watchdog_hld.c  |  37 +---
 lib/Kconfig.debug  |  29 +++-
 15 files changed, 263 insertions(+), 170 deletions(-)

-- 
2.11.0



[PATCH 2/4] watchdog: Introduce arch_touch_nmi_watchdog()

2017-05-29 Thread Nicholas Piggin
For architectures that define HAVE_NMI_WATCHDOG, instead of having
them provide the complete touch_nmi_watchdog() function, just have
them provide arch_touch_nmi_watchdog().

This gives the generic code more flexibility in implementing this
function, and arch implementations don't miss out on touching the
softlockup watchdog or other generic details.

Signed-off-by: Nicholas Piggin 
---
 arch/blackfin/include/asm/nmi.h|  2 ++
 arch/blackfin/kernel/nmi.c |  2 +-
 arch/mn10300/include/asm/nmi.h |  2 ++
 arch/mn10300/kernel/mn10300-watchdog-low.S |  8 
 arch/mn10300/kernel/mn10300-watchdog.c |  2 +-
 arch/sparc/include/asm/nmi.h   |  1 +
 arch/sparc/kernel/nmi.c|  6 ++
 include/linux/nmi.h| 27 ---
 kernel/watchdog_hld.c  |  5 ++---
 9 files changed, 31 insertions(+), 24 deletions(-)

diff --git a/arch/blackfin/include/asm/nmi.h b/arch/blackfin/include/asm/nmi.h
index b9caac4fcfd8..107d23705f46 100644
--- a/arch/blackfin/include/asm/nmi.h
+++ b/arch/blackfin/include/asm/nmi.h
@@ -9,4 +9,6 @@
 
 #include 
 
+extern void arch_touch_nmi_watchdog(void);
+
 #endif
diff --git a/arch/blackfin/kernel/nmi.c b/arch/blackfin/kernel/nmi.c
index 633c37083e87..1e714329fe8a 100644
--- a/arch/blackfin/kernel/nmi.c
+++ b/arch/blackfin/kernel/nmi.c
@@ -190,7 +190,7 @@ static int __init init_nmi_wdt(void)
 }
 device_initcall(init_nmi_wdt);
 
-void touch_nmi_watchdog(void)
+void arch_touch_nmi_watchdog(void)
 {
atomic_set(&nmi_touched[smp_processor_id()], 1);
 }
diff --git a/arch/mn10300/include/asm/nmi.h b/arch/mn10300/include/asm/nmi.h
index f3671cbbc117..b05627597b1b 100644
--- a/arch/mn10300/include/asm/nmi.h
+++ b/arch/mn10300/include/asm/nmi.h
@@ -11,4 +11,6 @@
 #ifndef _ASM_NMI_H
 #define _ASM_NMI_H
 
+extern void arch_touch_nmi_watchdog(void);
+
 #endif /* _ASM_NMI_H */
diff --git a/arch/mn10300/kernel/mn10300-watchdog-low.S 
b/arch/mn10300/kernel/mn10300-watchdog-low.S
index f2f5c9cfaabd..34f8773de7d0 100644
--- a/arch/mn10300/kernel/mn10300-watchdog-low.S
+++ b/arch/mn10300/kernel/mn10300-watchdog-low.S
@@ -50,9 +50,9 @@ watchdog_handler:
 #   we can't inline it)
 #
 ###
-   .globl  touch_nmi_watchdog
-   .type   touch_nmi_watchdog,@function
-touch_nmi_watchdog:
+   .globl  arch_touch_nmi_watchdog
+   .type   arch_touch_nmi_watchdog,@function
+arch_touch_nmi_watchdog:
clr d0
clr d1
mov watchdog_alert_counter, a0
@@ -63,4 +63,4 @@ touch_nmi_watchdog:
lne
ret [],0
 
-   .size   touch_nmi_watchdog,.-touch_nmi_watchdog
+   .size   arch_touch_nmi_watchdog,.-arch_touch_nmi_watchdog
diff --git a/arch/mn10300/kernel/mn10300-watchdog.c 
b/arch/mn10300/kernel/mn10300-watchdog.c
index a2d8e6938d67..0d5641beadf5 100644
--- a/arch/mn10300/kernel/mn10300-watchdog.c
+++ b/arch/mn10300/kernel/mn10300-watchdog.c
@@ -31,7 +31,7 @@ static unsigned int watchdog;
 static unsigned int watchdog_hz = 1;
 unsigned int watchdog_alert_counter[NR_CPUS];
 
-EXPORT_SYMBOL(touch_nmi_watchdog);
+EXPORT_SYMBOL(arch_touch_nmi_watchdog);
 
 /*
  * the best way to detect whether a CPU has a 'hard lockup' problem
diff --git a/arch/sparc/include/asm/nmi.h b/arch/sparc/include/asm/nmi.h
index 26ad2b2607c6..284eac3ffaf2 100644
--- a/arch/sparc/include/asm/nmi.h
+++ b/arch/sparc/include/asm/nmi.h
@@ -7,6 +7,7 @@ void nmi_adjust_hz(unsigned int new_hz);
 
 extern atomic_t nmi_active;
 
+void arch_touch_nmi_watchdog(void);
 void start_nmi_watchdog(void *unused);
 void stop_nmi_watchdog(void *unused);
 
diff --git a/arch/sparc/kernel/nmi.c b/arch/sparc/kernel/nmi.c
index 95e73c63c99d..048ad783ea3f 100644
--- a/arch/sparc/kernel/nmi.c
+++ b/arch/sparc/kernel/nmi.c
@@ -51,7 +51,7 @@ static DEFINE_PER_CPU(unsigned int, last_irq_sum);
 static DEFINE_PER_CPU(long, alert_counter);
 static DEFINE_PER_CPU(int, nmi_touch);
 
-void touch_nmi_watchdog(void)
+void arch_touch_nmi_watchdog(void)
 {
if (atomic_read(&nmi_active)) {
int cpu;
@@ -61,10 +61,8 @@ void touch_nmi_watchdog(void)
per_cpu(nmi_touch, cpu) = 1;
}
}
-
-   touch_softlockup_watchdog();
 }
-EXPORT_SYMBOL(touch_nmi_watchdog);
+EXPORT_SYMBOL(arch_touch_nmi_watchdog);
 
 static void die_nmi(const char *str, struct pt_regs *regs, int do_panic)
 {
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 5e2e57536d98..bd387ef8bccd 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -6,6 +6,9 @@
 
 #include 
 #include 
+#if defined(CONFIG_HAVE_NMI_WATCHDOG)
+#include 
+#endif
 
 #ifdef CONFIG_LOCKUP_DETECTOR
 extern void touch_softlockup_watchdog_sched(void);
@@ -58,6 +61,18 @@ static inline void reset_hung_task_detector(void)
 #define NMI_WATCHDOG_ENABLED  (1 << 

[PATCH 1/4] watchdog: remove unused declaration

2017-05-29 Thread Nicholas Piggin
Signed-off-by: Nicholas Piggin 
---
 include/linux/nmi.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index aa3cd0878270..5e2e57536d98 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -12,9 +12,6 @@ extern void touch_softlockup_watchdog_sched(void);
 extern void touch_softlockup_watchdog(void);
 extern void touch_softlockup_watchdog_sync(void);
 extern void touch_all_softlockup_watchdogs(void);
-extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
- void __user *buffer,
- size_t *lenp, loff_t *ppos);
 extern unsigned int  softlockup_panic;
 extern unsigned int  hardlockup_panic;
 void lockup_detector_init(void);
-- 
2.11.0



[PATCH 4/4] watchdog: Provide watchdog_reconfigure() for arch watchdogs

2017-05-29 Thread Nicholas Piggin
After reconfiguring watchdog sysctls etc., architecture specific
watchdogs may not get all their parameters updated.

watchdog_reconfigure() can be implemented to pull the new values
in and set the arch NMI watchdog.

Signed-off-by: Nicholas Piggin 
---
 kernel/watchdog.c | 29 +
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index deb010505646..5397c637db2d 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -120,6 +120,11 @@ void __weak watchdog_nmi_disable(unsigned int cpu)
 {
 }
 
+void __weak watchdog_nmi_reconfigure(void)
+{
+}
+
+
 #ifdef CONFIG_SOFTLOCKUP_DETECTOR
 
 /* Helper for online, unparked cpus. */
@@ -597,6 +602,12 @@ static void watchdog_disable_all_cpus(void)
}
 }
 
+static int watchdog_update_cpus(void)
+{
+   return smpboot_update_cpumask_percpu_thread(
+   &watchdog_threads, &watchdog_cpumask);
+}
+
 #else /* SOFTLOCKUP */
 static int watchdog_park_threads(void)
 {
@@ -616,6 +627,11 @@ static void watchdog_disable_all_cpus(void)
 {
 }
 
+static int watchdog_update_cpus(void)
+{
+   return 0;
+}
+
 static void set_sample_period(void)
 {
 }
@@ -648,6 +664,8 @@ int lockup_detector_suspend(void)
watchdog_enabled = 0;
}
 
+   watchdog_nmi_reconfigure();
+
mutex_unlock(&watchdog_proc_mutex);
 
return ret;
@@ -668,6 +686,8 @@ void lockup_detector_resume(void)
if (watchdog_running && !watchdog_suspended)
watchdog_unpark_threads();
 
+   watchdog_nmi_reconfigure();
+
mutex_unlock(&watchdog_proc_mutex);
put_online_cpus();
 }
@@ -693,6 +713,8 @@ static int proc_watchdog_update(void)
else
watchdog_disable_all_cpus();
 
+   watchdog_nmi_reconfigure();
+
return err;
 
 }
@@ -878,12 +900,11 @@ int proc_watchdog_cpumask(struct ctl_table *table, int 
write,
 * a temporary cpumask, so we are likely not in a
 * position to do much else to make things better.
 */
-#ifdef CONFIG_SOFTLOCKUP_DETECTOR
-   if (smpboot_update_cpumask_percpu_thread(
-   &watchdog_threads, &watchdog_cpumask) != 0)
+   if (watchdog_update_cpus() != 0)
pr_err("cpumask update failed\n");
-#endif
}
+
+   watchdog_nmi_reconfigure();
}
 out:
mutex_unlock(&watchdog_proc_mutex);
-- 
2.11.0



[PATCH 3/4] watchdog: Split up config options

2017-05-29 Thread Nicholas Piggin
Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.

LOCKUP_DETECTOR provides the boot, sysctl, and programming interfaces
for lockup detectors. An architecture that defines HAVE_NMI_WATCHDOG
need not use this this if it has a very basic watchdog or uses its own
options and interfaces (e.g., sparc). touch_nmi_watchdog() will
continue to call their arch_touch_nmi_watchdog().

HARDLOCKUP_DETECTOR_PERF is the perf-based lockup detector.

HARDLOCKUP_DETECTOR is the framework for arch NMI_WATCHDOG hard lockup
detectors that conform to the LOCKUP_DETECTOR interfaces.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/setup_64.c |   2 +-
 arch/x86/kernel/apic/hw_nmi.c  |   2 +-
 include/linux/nmi.h|  31 --
 kernel/Makefile|   2 +-
 kernel/sysctl.c|  18 ++--
 kernel/watchdog.c  | 238 ++---
 kernel/watchdog_hld.c  |  32 --
 lib/Kconfig.debug  |  29 +++--
 8 files changed, 211 insertions(+), 143 deletions(-)

diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index f35ff9dea4fb..ab650905f75a 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -727,7 +727,7 @@ struct ppc_pci_io ppc_pci_io;
 EXPORT_SYMBOL(ppc_pci_io);
 #endif
 
-#ifdef CONFIG_HARDLOCKUP_DETECTOR
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
 u64 hw_nmi_get_sample_period(int watchdog_thresh)
 {
return ppc_proc_freq * watchdog_thresh;
diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
index c73c9fb281e1..d6f387780849 100644
--- a/arch/x86/kernel/apic/hw_nmi.c
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -19,7 +19,7 @@
 #include 
 #include 
 
-#ifdef CONFIG_HARDLOCKUP_DETECTOR
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
 u64 hw_nmi_get_sample_period(int watchdog_thresh)
 {
return (u64)(cpu_khz) * 1000 * watchdog_thresh;
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index bd387ef8bccd..257e6d7a9e6a 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -11,13 +11,21 @@
 #endif
 
 #ifdef CONFIG_LOCKUP_DETECTOR
+void lockup_detector_init(void);
+#else
+static inline void lockup_detector_init(void)
+{
+}
+#endif
+
+#ifdef CONFIG_SOFTLOCKUP_DETECTOR
 extern void touch_softlockup_watchdog_sched(void);
 extern void touch_softlockup_watchdog(void);
 extern void touch_softlockup_watchdog_sync(void);
 extern void touch_all_softlockup_watchdogs(void);
 extern unsigned int  softlockup_panic;
-extern unsigned int  hardlockup_panic;
-void lockup_detector_init(void);
+extern int soft_watchdog_enabled;
+extern atomic_t watchdog_park_in_progress;
 #else
 static inline void touch_softlockup_watchdog_sched(void)
 {
@@ -31,9 +39,6 @@ static inline void touch_softlockup_watchdog_sync(void)
 static inline void touch_all_softlockup_watchdogs(void)
 {
 }
-static inline void lockup_detector_init(void)
-{
-}
 #endif
 
 #ifdef CONFIG_DETECT_HUNG_TASK
@@ -63,13 +68,16 @@ static inline void reset_hung_task_detector(void)
 
 #if defined(CONFIG_HARDLOCKUP_DETECTOR)
 extern void hardlockup_detector_disable(void);
+extern unsigned int hardlockup_panic;
 #else
 static inline void hardlockup_detector_disable(void) {}
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR) || defined(CONFIG_HAVE_NMI_WATCHDOG)
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
 extern void arch_touch_nmi_watchdog(void);
-#else
+#endif
+
+#if !defined(CONFIG_HARDLOCKUP_DETECTOR) && !defined(CONFIG_HAVE_NMI_WATCHDOG)
 static inline void arch_touch_nmi_watchdog(void) {}
 #endif
 
@@ -141,15 +149,18 @@ static inline bool trigger_single_cpu_backtrace(int cpu)
 }
 #endif
 
-#ifdef CONFIG_LOCKUP_DETECTOR
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
 u64 hw_nmi_get_sample_period(int watchdog_thresh);
+#endif
+
+#ifdef CONFIG_LOCKUP_DETECTOR
 extern int nmi_watchdog_enabled;
-extern int soft_watchdog_enabled;
 extern int watchdog_user_enabled;
 extern int watchdog_thresh;
 extern unsigned long watchdog_enabled;
+extern struct cpumask watchdog_cpumask;
 extern unsigned long *watchdog_cpumask_bits;
-extern atomic_t watchdog_park_in_progress;
+extern int __read_mostly watchdog_suspended;
 #ifdef CONFIG_SMP
 extern int sysctl_softlockup_all_cpu_backtrace;
 extern int sysctl_hardlockup_all_cpu_backtrace;
diff --git a/kernel/Makefile b/kernel/Makefile
index 72aa080f91f0..4cb8e8b23c6e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -82,7 +82,7 @@ obj-$(CONFIG_KPROBES) += kprobes.o
 obj-$(CONFIG_KGDB) += debug/
 obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
-obj-$(CONFIG_HARDLOCKUP_DETECTOR) += watchdog_hld.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4dfba1a76cc3..fa37faa89143 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@

Re: [PATCH 1/6] powernv:idle: Correctly initialize core_idle_state_ptr

2017-05-29 Thread Nicholas Piggin
On Tue, 16 May 2017 14:19:43 +0530
"Gautham R. Shenoy"  wrote:

> From: "Gautham R. Shenoy" 
> 
> The lower 8 bits of core_idle_state_ptr tracks the number of non-idle
> threads in the core. This is supposed to be initialized to bit-map
> corresponding to the threads_per_core. However, currently it is
> initialized to PNV_CORE_IDLE_THREAD_BITS (0xFF). This is correct for
> POWER8 which has 8 threads per core, but not for POWER9 which has 4
> threads per core.
> 
> As a result, on POWER9, core_idle_state_ptr gets initialized to
> 0xFF. In case when all the threads of the core are idle, the bits
> corresponding tracking the idle-threads are non-zero. As a result, the
> idle entry/exit code fails to save/restore per-core hypervisor state
> since it assumes that there are threads in the cores which are still
> active.
> 
> Fix this by correctly initializing the lower bits of the
> core_idle_state_ptr on the basis of threads_per_core.
> 
> Signed-off-by: Gautham R. Shenoy 

This looks good to me.

Until this patch series, we can't enable HV state loss idle modes
on POWER9, is that correct? And after your series does it work?

Reviewed-by: Nicholas Piggin 


Re: [PATCH 2/6] powernv:idle: Decouple Timebase restore & Per-core SPRs restore

2017-05-29 Thread Nicholas Piggin
On Tue, 16 May 2017 14:19:44 +0530
"Gautham R. Shenoy"  wrote:

> From: "Gautham R. Shenoy" 
> 
> On POWER8, in case of
>-  nap: both timebase and hypervisor state is retained.
>-  fast-sleep: timebase is lost. But the hypervisor state is retained.
>-  winkle: timebase and hypervisor state is lost.
> 
> Hence, the current code for handling exit from a idle state assumes
> that if the timebase value is retained, then so is the hypervisor
> state. Thus, the current code doesn't restore per-core hypervisor
> state in such cases.
> 
> But that is no longer the case on POWER9 where we do have stop states
> in which timebase value is retained, but the hypervisor state is
> lost. So we have to ensure that the per-core hypervisor state gets
> restored in such cases.
> 
> Fix this by ensuring that even in the case when timebase is retained,
> we explicitly check if we are waking up from a deep stop that loses
> per-core hypervisor state (indicated by cr4 being eq or gt), and if
> this is the case, we restore the per-core hypervisor state.
> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/kernel/idle_book3s.S | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 4898d67..afd029f 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -731,13 +731,14 @@ timebase_resync:
>* Use cr3 which indicates that we are waking up with atleast partial
>* hypervisor state loss to determine if TIMEBASE RESYNC is needed.
>*/
> - ble cr3,clear_lock
> + ble cr3,.Ltb_resynced
>   /* Time base re-sync */
>   bl  opal_resync_timebase;
>   /*
> -  * If waking up from sleep, per core state is not lost, skip to
> -  * clear_lock.
> +  * If waking up from sleep (POWER8), per core state
> +  * is not lost, skip to clear_lock.
>*/
> +.Ltb_resynced:
>   blt cr4,clear_lock
>  
>   /*

It's more that timebase was not lost, rather than resynced.
Can we just do a 'bgtl cr3,opal_resync_timebase'?

I guess cr4 branch will never be true on POWER9... I think
pnv_wakeup_tb_loss is going to end up clearer being split
into two between isa 207 and 300. But that can wait until
after POWER9 is working properly.

Reviewed-by: Nicholas Piggin 


Re: [PATCH 3/6] powernv:idle: Restore LPCR on wakeup from deep-stop

2017-05-29 Thread Nicholas Piggin
On Tue, 16 May 2017 14:19:45 +0530
"Gautham R. Shenoy"  wrote:

> From: "Gautham R. Shenoy" 
> 
> On wakeup from a deep stop state which is supposed to lose the
> hypervisor state, we don't restore the LPCR to the old value but set
> it to a "sane" value via cur_cpu_spec->cpu_restore().
> 
> The problem is that the "sane" value doesn't include UPRT and the HR
> bits which are required to run correctly in Radix mode.
> 
> Fix this on POWER9 onwards by restoring the LPCR value whatever it was
> before executing the stop instruction.
> 
> Signed-off-by: Gautham R. Shenoy 

Yes I think we need this. I have a plan to rework some of
that cpu_restore and early CPU init stuff, but for now we
need this.

Does the OCC restore LPCR properly then we just trash it
with ->cpu_restore(), or is it always junk?

Reviewed-by: Nicholas Piggin 


> ---
>  arch/powerpc/kernel/idle_book3s.S | 13 ++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index afd029f..6c9920d 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -31,6 +31,7 @@
>   * registers for winkle support.
>   */
>  #define _SDR1GPR3
> +#define _PTCRGPR3
>  #define _RPR GPR4
>  #define _SPURR   GPR5
>  #define _PURRGPR6
> @@ -39,7 +40,7 @@
>  #define _AMORGPR9
>  #define _WORTGPR10
>  #define _WORCGPR11
> -#define _PTCRGPR12
> +#define _LPCRGPR12
>  
>  #define PSSCR_EC_ESL_MASK_SHIFTED  (PSSCR_EC | PSSCR_ESL) >> 16
>  
> @@ -55,12 +56,14 @@ save_sprs_to_stack:
>* here since any thread in the core might wake up first
>*/
>  BEGIN_FTR_SECTION
> - mfspr   r3,SPRN_PTCR
> - std r3,_PTCR(r1)
>   /*
>* Note - SDR1 is dropped in Power ISA v3. Hence not restoring
>* SDR1 here
>*/
> + mfspr   r3,SPRN_PTCR
> + std r3,_PTCR(r1)
> + mfspr   r3,SPRN_LPCR
> + std r3,_LPCR(r1)
>  FTR_SECTION_ELSE
>   mfspr   r3,SPRN_SDR1
>   std r3,_SDR1(r1)
> @@ -813,6 +816,10 @@ no_segments:
>   mtctr   r12
>   bctrl
>  
> +BEGIN_FTR_SECTION
> + ld  r4,_LPCR(r1)
> + mtspr   SPRN_LPCR,r4
> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
>  hypervisor_state_restored:
>  
>   mtspr   SPRN_SRR1,r16



Re: [PATCH 5/6] powernv:idle: Use Requested Level for restoring state on P9 DD1

2017-05-29 Thread Nicholas Piggin
On Tue, 16 May 2017 14:19:47 +0530
"Gautham R. Shenoy"  wrote:

> From: "Gautham R. Shenoy" 
> 
> On Power9 DD1 due to a hardware bug the Power-Saving Level Status
> field (PLS) of the PSSCR for a thread waking up from a deep state can
> under-report if some other thread in the core is in a shallow stop
> state. The scenario in which this can manifest is as follows:
> 
>1) All the threads of the core are in deep stop.
>2) One of the threads is woken up. The PLS for this thread will
>   correctly reflect that it is waking up from deep stop.
>3) The thread that has woken up now executes a shallow stop.
>4) When some other thread in the core is woken, its PLS will reflect
>   the shallow stop state.
> 
> Thus, the subsequent thread for which the PLS is under-reporting the
> wakeup state will not restore the hypervisor resources.
> 
> Hence, on DD1 systems, use the Requested Level (RL) field as a
> workaround to restore the contents of the hypervisor resources on the
> wakeup from the stop state.
> 
> Signed-off-by: Gautham R. Shenoy 


Reviewed-by: Nicholas Piggin 


> ---
>  arch/powerpc/include/asm/paca.h   |  2 ++
>  arch/powerpc/kernel/asm-offsets.c |  1 +
>  arch/powerpc/kernel/idle_book3s.S | 13 -
>  3 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 1c09f8f..77f60a0 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -177,6 +177,8 @@ struct paca_struct {
>* to the sibling threads' paca.
>*/
>   struct paca_struct **thread_sibling_pacas;
> + /* The PSSCR value that the kernel requested before going to stop */
> + u64 requested_psscr;
>  #endif
>  
>  #ifdef CONFIG_PPC_STD_MMU_64
> diff --git a/arch/powerpc/kernel/asm-offsets.c 
> b/arch/powerpc/kernel/asm-offsets.c
> index 709e234..e15c178 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -742,6 +742,7 @@ int main(void)
>   OFFSET(PACA_THREAD_MASK, paca_struct, thread_mask);
>   OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask);
>   OFFSET(PACA_SIBLING_PACA_PTRS, paca_struct, thread_sibling_pacas);
> + OFFSET(PACA_REQ_PSSCR, paca_struct, requested_psscr);
>  #endif
>  
>   DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 6c9920d..98a6d07 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -379,6 +379,7 @@ _GLOBAL(power9_idle_stop)
>   mfspr   r5,SPRN_PSSCR
>   andcr5,r5,r4
>   or  r3,r3,r5
> + std r3, PACA_REQ_PSSCR(r13)
>   mtspr   SPRN_PSSCR,r3
>   LOAD_REG_ADDR(r5,power_enter_stop)
>   li  r4,1
> @@ -498,12 +499,22 @@ pnv_restore_hyp_resource_arch300:
>   LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state)
>   ld  r4,ADDROFF(pnv_first_deep_stop_state)(r5)
>  
> - mfspr   r5,SPRN_PSSCR
> +BEGIN_FTR_SECTION_NESTED(71)
> + /*
> +  * Assume that we are waking up from the state
> +  * same as the Requested Level (RL) in the PSSCR
> +  * which are Bits 60-63
> +  */
> + ld  r5,PACA_REQ_PSSCR(r13)
> + rldicl  r5,r5,0,60
> +FTR_SECTION_ELSE_NESTED(71)
>   /*
>* 0-3 bits correspond to Power-Saving Level Status
>* which indicates the idle state we are waking up from
>*/
> + mfspr   r5, SPRN_PSSCR
>   rldicl  r5,r5,4,60
> +ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_POWER9_DD1, 71)
>   cmpdcr4,r5,r4
>   bge cr4,pnv_wakeup_tb_loss /* returns to caller */
>  



Re: [PATCH 6/6] cpuidle-powernv: Allow Deep stop states that don't stop time

2017-05-30 Thread Nicholas Piggin
On Tue, 16 May 2017 14:19:48 +0530
"Gautham R. Shenoy"  wrote:

> From: "Gautham R. Shenoy" 
> 
> The current code in the cpuidle-powernv intialization only allows deep
> stop states (indicated by OPAL_PM_STOP_INST_DEEP) which lose timebase
> (indicated by OPAL_PM_TIMEBASE_STOP). This assumption goes back to
> POWER8 time where deep states used to lose the timebase. However, on
> POWER9, we do have stop states that are deep (they lose hypervisor
> state) but retain the timebase.
> 
> Fix the initialization code in the cpuidle-powernv driver to allow
> such deep states.
> 
> Further, there is a bug in cpuidle-powernv driver with
> CONFIG_TICK_ONESHOT=n where we end up incrementing the nr_idle_states
> even if a platform idle state which loses time base was not added to
> the cpuidle table.
> 
> Fix this by ensuring that the nr_idle_states variable gets incremented
> only when the platform idle state was added to the cpuidle table.

Should this be a separate patch? Stable?

> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  drivers/cpuidle/cpuidle-powernv.c | 16 ++--
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index 12409a5..45eaf06 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -354,6 +354,7 @@ static int powernv_add_idle_states(void)
>  
>   for (i = 0; i < dt_idle_states; i++) {
>   unsigned int exit_latency, target_residency;
> + bool stops_timebase = false;
>   /*
>* If an idle state has exit latency beyond
>* POWERNV_THRESHOLD_LATENCY_NS then don't use it
> @@ -381,6 +382,9 @@ static int powernv_add_idle_states(void)
>   }
>   }
>  
> + if (flags[i] & OPAL_PM_TIMEBASE_STOP)
> + stops_timebase = true;
> +
>   /*
>* For nap and fastsleep, use default target_residency
>* values if f/w does not expose it.
> @@ -392,8 +396,7 @@ static int powernv_add_idle_states(void)
>   add_powernv_state(nr_idle_states, "Nap",
> CPUIDLE_FLAG_NONE, nap_loop,
> target_residency, exit_latency, 0, 0);
> - } else if ((flags[i] & OPAL_PM_STOP_INST_FAST) &&
> - !(flags[i] & OPAL_PM_TIMEBASE_STOP)) {
> + } else if (has_stop_states && !stops_timebase) {
>   add_powernv_state(nr_idle_states, names[i],
> CPUIDLE_FLAG_NONE, stop_loop,
> target_residency, exit_latency,
> @@ -405,8 +408,8 @@ static int powernv_add_idle_states(void)
>* within this config dependency check.
>*/
>  #ifdef CONFIG_TICK_ONESHOT
> - if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
> - flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {
> + else if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
> +  flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {

Hmm, seems okay but readability is isn't the best with the ifdef and
mixing power8 and 9 cases IMO.

Particularly with the nice regular POWER9 states, we're not doing much
logic in this loop besides checking for the timebase stop flag, right?
Would it be clearer if it was changed to something like this (untested
quick hack)?

---
 drivers/cpuidle/cpuidle-powernv.c | 76 +++
 1 file changed, 37 insertions(+), 39 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 12409a519cc5..77291389f9ac 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -353,7 +353,9 @@ static int powernv_add_idle_states(void)
}
 
for (i = 0; i < dt_idle_states; i++) {
+   unsigned int cpuidle_flags = CPUIDLE_FLAG_NONE;
unsigned int exit_latency, target_residency;
+
/*
 * If an idle state has exit latency beyond
 * POWERNV_THRESHOLD_LATENCY_NS then don't use it
@@ -371,6 +373,16 @@ static int powernv_add_idle_states(void)
else
target_residency = 0;
 
+   if (flags[i] & OPAL_PM_TIMEBASE_STOP) {
+   /*
+* All cpuidle states with CPUIDLE_FLAG_TIMER_STOP set
+* depend on CONFIG_TICK_ONESHOT.
+*/
+   if (!IS_ENABLED(CONFIG_TICK_ONESHOT))
+   continue;
+   cpuidle_flags = CPUIDLE_FLAG_TIMER_STOP;
+   }
+
if (has_stop_states) {
int err = validate_psscr_val_mask(&psscr_val[i],
  &psscr_mask[i],

Re: [PATCH 6/6] cpuidle-powernv: Allow Deep stop states that don't stop time

2017-05-30 Thread Nicholas Piggin
On Tue, 30 May 2017 16:20:55 +0530
Gautham R Shenoy  wrote:

> On Tue, May 30, 2017 at 05:13:57PM +1000, Nicholas Piggin wrote:
> > On Tue, 16 May 2017 14:19:48 +0530
> > "Gautham R. Shenoy"  wrote:
> >   
> > > From: "Gautham R. Shenoy" 
> > > 
> > > The current code in the cpuidle-powernv intialization only allows deep
> > > stop states (indicated by OPAL_PM_STOP_INST_DEEP) which lose timebase
> > > (indicated by OPAL_PM_TIMEBASE_STOP). This assumption goes back to
> > > POWER8 time where deep states used to lose the timebase. However, on
> > > POWER9, we do have stop states that are deep (they lose hypervisor
> > > state) but retain the timebase.
> > > 
> > > Fix the initialization code in the cpuidle-powernv driver to allow
> > > such deep states.
> > > 
> > > Further, there is a bug in cpuidle-powernv driver with
> > > CONFIG_TICK_ONESHOT=n where we end up incrementing the nr_idle_states
> > > even if a platform idle state which loses time base was not added to
> > > the cpuidle table.
> > > 
> > > Fix this by ensuring that the nr_idle_states variable gets incremented
> > > only when the platform idle state was added to the cpuidle table.  
> > 
> > Should this be a separate patch? Stable?  
> 
> Ok. Will send it out separately.

Looks like mpe has merged this in next now. I just wonder if this
particular bit would be relevant for POWER8 and therefore be a
stable candidate? All the POWER9 idle fixes may not be suitable for
stable.


> > > Signed-off-by: Gautham R. Shenoy 
> > > ---
> > >  drivers/cpuidle/cpuidle-powernv.c | 16 ++--
> > >  1 file changed, 10 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> > > b/drivers/cpuidle/cpuidle-powernv.c
> > > index 12409a5..45eaf06 100644
> > > --- a/drivers/cpuidle/cpuidle-powernv.c
> > > +++ b/drivers/cpuidle/cpuidle-powernv.c
> > > @@ -354,6 +354,7 @@ static int powernv_add_idle_states(void)
> > >  
> > >   for (i = 0; i < dt_idle_states; i++) {
> > >   unsigned int exit_latency, target_residency;
> > > + bool stops_timebase = false;
> > >   /*
> > >* If an idle state has exit latency beyond
> > >* POWERNV_THRESHOLD_LATENCY_NS then don't use it
> > > @@ -381,6 +382,9 @@ static int powernv_add_idle_states(void)
> > >   }
> > >   }
> > >  
> > > + if (flags[i] & OPAL_PM_TIMEBASE_STOP)
> > > + stops_timebase = true;
> > > +
> > >   /*
> > >* For nap and fastsleep, use default target_residency
> > >* values if f/w does not expose it.
> > > @@ -392,8 +396,7 @@ static int powernv_add_idle_states(void)
> > >   add_powernv_state(nr_idle_states, "Nap",
> > > CPUIDLE_FLAG_NONE, nap_loop,
> > > target_residency, exit_latency, 0, 0);
> > > - } else if ((flags[i] & OPAL_PM_STOP_INST_FAST) &&
> > > - !(flags[i] & OPAL_PM_TIMEBASE_STOP)) {
> > > + } else if (has_stop_states && !stops_timebase) {
> > >   add_powernv_state(nr_idle_states, names[i],
> > > CPUIDLE_FLAG_NONE, stop_loop,
> > > target_residency, exit_latency,
> > > @@ -405,8 +408,8 @@ static int powernv_add_idle_states(void)
> > >* within this config dependency check.
> > >*/
> > >  #ifdef CONFIG_TICK_ONESHOT
> > > - if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
> > > - flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {
> > > + else if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
> > > +  flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {  
> > 
> > Hmm, seems okay but readability is isn't the best with the ifdef and
> > mixing power8 and 9 cases IMO.
> > 
> > Particularly with the nice regular POWER9 states, we're not doing much
> > logic in this loop besides checking for the timebase stop flag, right?
> > Would it be clearer if it was changed to something like this (untested
> > quick hack)?  
> 
> Yes, this is very much doable. Some comments below.
> 
> > 
> > ---
> >  drivers/cpuidle

Re: powerpc-linux-gnu-ld: warning: orphan section `.data.rel.ro' from `arch/powerpc/kernel/head_44x.o' being placed in section `.data.rel.ro'.

2017-11-11 Thread Nicholas Piggin
On Fri, 10 Nov 2017 23:41:49 +0800
kbuild test robot  wrote:

> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   1c9dbd4615fd751e5e0b99807a3c7c8612e28e20
> commit: cb87481ee89dbd6609e227afbf64900fb4e5c930 kbuild: linker script do not 
> match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured
> date:   3 months ago
> config: powerpc-fsp2_defconfig (attached as .config)
> compiler: powerpc-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> git checkout cb87481ee89dbd6609e227afbf64900fb4e5c930
> # save the attached .config to linux build tree
> make.cross ARCH=powerpc 
> 
> All warnings (new ones prefixed by >>):
> 
> >> powerpc-linux-gnu-ld: warning: orphan section `.data.rel.ro' from 
> >> `arch/powerpc/kernel/head_44x.o' being placed in section `.data.rel.ro'.
> >> powerpc-linux-gnu-ld: warning: orphan section `.data.rel.ro' from 
> >> `arch/powerpc/kernel/head_44x.o' being placed in section `.data.rel.ro'.
> >> powerpc-linux-gnu-ld: warning: orphan section `.data.rel.ro' from 
> >> `arch/powerpc/kernel/head_44x.o' being placed in section `.data.rel.ro'.  

Okay this is not caused by the above patch, it was just hiding it.
This should do the trick I think:
--
powerpc/32: Add .data.rel* sections explicitly

Match powerpc/64 and include .data.rel* input sections in the .data output
section explicitly.

This should solve the warning:

powerpc-linux-gnu-ld: warning: orphan section `.data.rel.ro' from 
`arch/powerpc/kernel/head_44x.o' being placed in section `.data.rel.ro'.

Reported-by: kbuild test robot 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/vmlinux.lds.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 0494e1566ee2..51e4ec92ade1 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -264,6 +264,7 @@ SECTIONS
 #ifdef CONFIG_PPC32
.data : AT(ADDR(.data) - LOAD_OFFSET) {
DATA_DATA
+   *(.data.rel*)
*(.sdata)
*(.sdata2)
*(.got.plt) *(.got)
-- 
2.15.0



Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-11 Thread Nicholas Piggin
On Fri, 10 Nov 2017 12:08:35 +
David Laight  wrote:

> From: Matthew Wilcox
> > Sent: 09 November 2017 19:44
> > 
> > On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:  
> > > So these semantics are what we're going with? Anything that does mmap() is
> > > guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> > > itself? Is intended to be cross-platform or just x86 and power specific?  
> > 
> > It is x86 and powerpc specific.  The arm64 people have apparently stumbled
> > across apps that expect to be able to use bit 48 for their own purposes.
> > And their address space is 48 bit by default.  Oops.  
> 
> (Do you mean 49bit?)

I think he meant bit 47, which makes sense because they were probably
ported from x86-64 with 47 bit address. That seems to be why x86-64
5-level and powerpc decided to limit to a 47 bit address space by
default.

> 
> Aren't such apps just doomed to be broken?

Well they're not portable but they are not broken if virtual address
is limited.

> 
> ISTR there is something on (IIRC) sparc64 that does a 'match'
> on the high address bits to make it much harder to overrun
> one area into another.

I'm not sure about that but I think the problem would be the app
masking out bits from the pointer for its own use before ever
dereferencing it.

Thanks,
Nick


Re: [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

2018-08-26 Thread Nicholas Piggin
On Fri, 24 Aug 2018 16:52:43 +0100
Will Deacon  wrote:

> From: Peter Zijlstra 
> 
> Some architectures require different TLB invalidation instructions
> depending on whether it is only the last-level of page table being
> changed, or whether there are also changes to the intermediate
> (directory) entries higher up the tree.
> 
> Add a new bit to the flags bitfield in struct mmu_gather so that the
> architecture code can operate accordingly if it's the intermediate
> levels being invalidated.
> 
> Signed-off-by: Peter Zijlstra 
> Signed-off-by: Will Deacon 

powerpc should be able to move right over to using this rather
than keeping the bit in need_flush_all.

powerpc may be able to use the unmap granule thing to improve
its page size dependent flushes, but it might prefer to go
a different way and track start-end for different page sizes.
I wonder how much of that stuff should go into generic code,
and whether it should instead go into a struct arch_mmu_gather.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-26 Thread Nicholas Piggin
On Fri, 24 Aug 2018 13:39:53 +0200
Peter Zijlstra  wrote:

> On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 24, 2018 at 10:47:17AM +0200, Peter Zijlstra wrote:  
> > > On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:  
> > > > The only problem with this approach is that we've lost track of the 
> > > > granule
> > > > size by the point we get to the tlb_flush(), so we can't adjust the 
> > > > stride of
> > > > the TLB invalidations for huge mappings, which actually works nicely in 
> > > > the
> > > > synchronous case (e.g. we perform a single invalidation for a 2MB 
> > > > mapping,
> > > > rather than iterating over it at a 4k granule).
> > > > 
> > > > One thing we could do is switch to synchronous mode if we detect a 
> > > > change in
> > > > granule (i.e. treat it like a batch failure).  
> > > 
> > > We could use tlb_start_vma() to track that, I think. Shouldn't be too
> > > hard.  
> > 
> > Hurm.. look at commit:
> > 
> >   e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force 
> > flush if page size change")  
> 
> Ah, good, it seems that already got cleaned up a lot. But it all moved
> into the power code.. blergh.

I lost track of what the problem is here?

For powerpc, tlb_start_vma is not the right API to use for this because
it wants to deal with different page sizes within a vma.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Nicholas Piggin
On Mon, 27 Aug 2018 09:47:01 +0200
Peter Zijlstra  wrote:

> On Mon, Aug 27, 2018 at 03:00:08PM +1000, Nicholas Piggin wrote:
> > On Fri, 24 Aug 2018 13:39:53 +0200
> > Peter Zijlstra  wrote:  
> > > On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote:  
> 
> > > > Hurm.. look at commit:
> > > > 
> > > >   e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and 
> > > > force flush if page size change")
> > > 
> > > Ah, good, it seems that already got cleaned up a lot. But it all moved
> > > into the power code.. blergh.  
> > 
> > I lost track of what the problem is here?  
> 
> Aside from the commit above being absolute crap (which did get fixed up,
> luckily) I would really like to get rid of all arch specific mmu_gather.
> 
> We can have opt-in bits to the generic code, but the endless back and
> forth between common and arch code is an utter pain in the arse.
> 
> And there's only like 4 architectures that still have a custom
> mmu_gather:
> 
>   - sh
>   - arm
>   - ia64
>   - s390
> 
> sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
> arch/ia64' leaving us with s390.

Well I don't see a big problem in having an arch_mmu_gather field
or small bits. powerpc would actually like that rather than trying
to add things it wants into generic code (and it wants more than
just a few flags bits, ideally).

> After that everyone uses the common code and we can clean up.
> 
> > For powerpc, tlb_start_vma is not the right API to use for this because
> > it wants to deal with different page sizes within a vma.  
> 
> Yes.. I see that. tlb_remove_check_page_size_change() really is a rather
> ugly thing, it can cause loads of TLB flushes. Do you really _have_ to
> do that? The way ARM and x86 work is that using INVLPG in a 4K stride is
> still correct for huge pages, inefficient maybe, but so is flushing
> every other page because 'sparse' transparant-huge-pages.

It could do that. It requires a tlbie that matches the page size,
so it means 3 sizes. I think possibly even that would be better
than current code, but we could do better if we had a few specific
fields in there.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Nicholas Piggin
On Mon, 27 Aug 2018 10:20:45 +0200
Peter Zijlstra  wrote:

> On Mon, Aug 27, 2018 at 06:09:50PM +1000, Benjamin Herrenschmidt wrote:
> 
> > Sadly our architecture requires a precise match between the page size
> > specified in the tlbie instruction and the entry in the TLB or it won't
> > be flushed.  
> 
> Argh.. OK I see. That is rather unfortunate and does seem to require
> something along the lines of tlb_remove_check_page_size_change().

Or we can do better with some more of our own data in mmu_gather,
but things that probably few or no other architectures want. I've
held off trying to put any crap in generic code because there's
other lower hanging fruit still, but I'd really rather just give
archs the ability to put their own data in there. I don't really
see a downside to it (divergence of course, but the existing
proliferation of code is much harder to follow than some data that
would be maintained and used purely by the arch, and beats having
to implement entirely your own mmu_gather).

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Nicholas Piggin
On Mon, 27 Aug 2018 18:09:50 +1000
Benjamin Herrenschmidt  wrote:

> On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote:
> > > Yes.. I see that. tlb_remove_check_page_size_change() really is a rather
> > > ugly thing, it can cause loads of TLB flushes. Do you really _have_ to
> > > do that? The way ARM and x86 work is that using INVLPG in a 4K stride is
> > > still correct for huge pages, inefficient maybe, but so is flushing
> > > every other page because 'sparse' transparant-huge-pages.  
> > 
> > It could do that. It requires a tlbie that matches the page size,
> > so it means 3 sizes. I think possibly even that would be better
> > than current code, but we could do better if we had a few specific
> > fields in there.  
> 
> More tlbies ? With the cost of the broadasts on the fabric ? I don't
> think so.. or I'm not understanding your point...

More tlbies are no good, but there will be some places where it works
out much better (and fewer tlbies). Worst possible case for current code
is a big unmap with lots of scattered page sizes. We _should_ get that
with just a single PID flush at the end, but what we will get today is
a bunch of PID and VA flushes.

I don't propose doing that though, I'd rather be explicit about
tracking start and end range of each page size. Still not "optimal"
but neither is existing single range for sparse mappings... anyway it
will need to be profiled, but my point is we don't really fit exactly
what x86/arm want.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Nicholas Piggin
On Mon, 27 Aug 2018 09:36:50 -0400
Rik van Riel  wrote:

> On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote:
> 
> > It could do that. It requires a tlbie that matches the page size,
> > so it means 3 sizes. I think possibly even that would be better
> > than current code, but we could do better if we had a few specific
> > fields in there.  
> 
> Would it cause a noticeable overhead to keep track
> of which page sizes were removed, and to simply flush
> the whole TLB in the (unlikely?) event that multiple
> page sizes were removed in the same munmap?
> 
> Once the unmap is so large that multiple page sizes
> were covered, you may already be looking at so many
> individual flush operations that a full flush might
> be faster.

It will take some profiling and measuring. unmapping a small number of
huge pages plus a small number of surrounding small pages may not be
uncommon if THP is working well. That could become a lot more expensive.

> 
> Is there a point on PPC where simply flushing the
> whole TLB, and having other things be reloaded later,
> is faster than flushing every individual page mapping
> that got unmapped?

There is. For local TLB flushes that point is well over 100 individual
invalidates though. We're generally better off flushing all page sizes
for that case.

Thanks,
Nick


[PATCH 1/3] mm/cow: don't bother write protectig already write-protected huge pages

2018-08-28 Thread Nicholas Piggin
This is the THP equivalent for 1b2de5d039c8 ("mm/cow: don't bother write
protecting already write-protected pages").

Explicit hugetlb pages don't get the same treatment because they don't
appear to have the right accessor functions.

Signed-off-by: Nicholas Piggin 
---
 mm/huge_memory.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9592cbd8530a..d9bae12978ef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -973,8 +973,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 
-   pmdp_set_wrprotect(src_mm, addr, src_pmd);
-   pmd = pmd_mkold(pmd_wrprotect(pmd));
+   if (pmd_write(pmd)) {
+   pmdp_set_wrprotect(src_mm, addr, src_pmd);
+   pmd = pmd_wrprotect(pmd);
+   }
+   pmd = pmd_mkold(pmd);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
ret = 0;
@@ -1064,8 +1067,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
/* No huge zero pud yet */
}
 
-   pudp_set_wrprotect(src_mm, addr, src_pud);
-   pud = pud_mkold(pud_wrprotect(pud));
+   if (pud_write(pud)) {
+   pudp_set_wrprotect(src_mm, addr, src_pud);
+   pud = pud_wrprotect(pud);
+   }
+   pud = pud_mkold(pud);
set_pud_at(dst_mm, addr, dst_pud, pud);
 
ret = 0;
-- 
2.18.0



[PATCH 0/3] mm: dirty/accessed pte optimisations

2018-08-28 Thread Nicholas Piggin
Here are some patches that didn't get much comment last time. It
looks like x86 might benefit too though, so that might get people
interested. 

I improved changelogs and added some comments, but no real logic
changes.

I hope I didn't get the x86 numbers wrong, they're more significant
than I expected so it could quite well be a problem with my test
(corrections welcome). Any data from other archs would be interesting
too.

Andrew perhaps if there aren't objections these could go in mm for
a while. 

Thanks,
Nick


Nicholas Piggin (3):
  mm/cow: don't bother write protectig already write-protected huge
pages
  mm/cow: optimise pte dirty/accessed bits handling in fork
  mm: optimise pte dirty/accessed bit setting by demand based pte
insertion

 mm/huge_memory.c | 24 +++-
 mm/memory.c  | 18 ++
 mm/vmscan.c  |  8 
 3 files changed, 33 insertions(+), 17 deletions(-)

-- 
2.18.0



[PATCH 2/3] mm/cow: optimise pte dirty/accessed bits handling in fork

2018-08-28 Thread Nicholas Piggin
fork clears dirty/accessed bits from new ptes in the child. This logic
has existed since mapped page reclaim was done by scanning ptes when
it may have been quite important. Today with physical based pte
scanning, there is less reason to clear these bits. Dirty bits are all
tested and cleared together and any dirty bit is the same as many
dirty bits. Any young bit is treated similarly to many young bits, but
not quite the same. A comment has been added where there is some
difference.

This eliminates a major source of faults powerpc/radix requires to set
dirty/accessed bits in ptes, speeding up a fork/exit microbenchmark by
about 5% on POWER9 (16600 -> 17500 fork/execs per second).

Skylake appears to have a micro-fault overhead too -- a test which
allocates 4GB anonymous memory, reads each page, then forks, and times
the child reading a byte from each page. The first pass over the pages
takes about 1000 cycles per page, the second pass takes about 27
cycles (TLB miss). With no additional minor faults measured due to
either child pass, and the page array well exceeding TLB capacity, the
large cost must be caused by micro faults caused by setting accessed
bit.

Signed-off-by: Nicholas Piggin 
---
 mm/huge_memory.c |  2 --
 mm/memory.c  | 10 +-
 mm/vmscan.c  |  8 
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9bae12978ef..5fb1a43e12e0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -977,7 +977,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_wrprotect(pmd);
}
-   pmd = pmd_mkold(pmd);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
ret = 0;
@@ -1071,7 +1070,6 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
pudp_set_wrprotect(src_mm, addr, src_pud);
pud = pud_wrprotect(pud);
}
-   pud = pud_mkold(pud);
set_pud_at(dst_mm, addr, dst_pud, pud);
 
ret = 0;
diff --git a/mm/memory.c b/mm/memory.c
index b616a69ad770..3d8bf8220bd0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1038,12 +1038,12 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct 
*src_mm,
}
 
/*
-* If it's a shared mapping, mark it clean in
-* the child
+* Child inherits dirty and young bits from parent. There is no
+* point clearing them because any cleaning or aging has to walk
+* all ptes anyway, and it will notice the bits set in the parent.
+* Leaving them set avoids stalls and even page faults on CPUs that
+* handle these bits in software.
 */
-   if (vm_flags & VM_SHARED)
-   pte = pte_mkclean(pte);
-   pte = pte_mkold(pte);
 
page = vm_normal_page(vma, addr, pte);
if (page) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7e7d25504651..52fe64af3d80 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1021,6 +1021,14 @@ static enum page_references page_check_references(struct 
page *page,
 * to look twice if a mapped file page is used more
 * than once.
 *
+* fork() will set referenced bits in child ptes despite
+* not having been accessed, to avoid micro-faults of
+* setting accessed bits. This heuristic is not perfectly
+* accurate in other ways -- multiple map/unmap in the
+* same time window would be treated as multiple references
+* despite same number of actual memory accesses made by
+* the program.
+*
 * Mark it and spare it for another trip around the
 * inactive list.  Another page table reference will
 * lead to its activation.
-- 
2.18.0



Re: [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

2018-08-28 Thread Nicholas Piggin
On Tue, 28 Aug 2018 15:46:38 +0200
Peter Zijlstra  wrote:

> On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote:
> 
> > powerpc may be able to use the unmap granule thing to improve
> > its page size dependent flushes, but it might prefer to go
> > a different way and track start-end for different page sizes.  
> 
> I don't really see how tracking multiple ranges would help much with
> THP. The ranges would end up being almost the same if there is a good
> mix of page sizes.

That's assuming quite large unmaps. But a lot of the time they are
going to go to a full PID flush.

> 
> But something like:
> 
> void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr)
> {
>   if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT))
>   tblie_pte(addr);
>   if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT))
>   tlbie_pmd(addr);
>   if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT))
>   tlbie_pud(addr);
> }
> 
> void tlb_flush_range(struct mmu_gather *tlb)
> {
>   unsigned long stride = 1UL << tlb_get_unmap_shift(tlb);
>   unsigned long addr;
> 
>   for (addr = tlb->start; addr < tlb->end; addr += stride)
>   tlb_flush_one(tlb, addr);
> 
>   ptesync();
> }
> 
> Should workd I think. You'll only issue multiple TLBIEs on the
> boundaries, not every stride.

Yeah we already do basically that today in the flush_tlb_range path,
just without the precise test for which page sizes

if (hflush) {
hstart = (start + PMD_SIZE - 1) & PMD_MASK;
hend = end & PMD_MASK;
if (hstart == hend)
hflush = false;
}

if (gflush) {
gstart = (start + PUD_SIZE - 1) & PUD_MASK;
gend = end & PUD_MASK;
if (gstart == gend)
gflush = false;
}

asm volatile("ptesync": : :"memory");
if (local) {
__tlbiel_va_range(start, end, pid, page_size, 
mmu_virtual_psize);
if (hflush)
__tlbiel_va_range(hstart, hend, pid,
PMD_SIZE, MMU_PAGE_2M);
if (gflush)
__tlbiel_va_range(gstart, gend, pid,
PUD_SIZE, MMU_PAGE_1G);
asm volatile("ptesync": : :"memory");

Thing is I think it's the smallish range cases you want to optimize
for. And for those we'll probably do something even smarter (like keep
a bitmap of pages to flush) because we really want to keep tlbies off
the bus whereas that's less important for x86.

Still not really seeing a reason not to implement a struct
arch_mmu_gather. A little bit of data contained to the arch is nothing
compared with the multitude of hooks and divergence of code.

Thanks,
Nick


Re: VirtIO console hangs

2018-08-28 Thread Nicholas Piggin
On Tue, 28 Aug 2018 12:54:08 +
Matteo Croce  wrote:

> With kernel 4.19.0-rc1 virtio_console hangs very often.
> I can always trigger the bug by pasting some characters in the
> terminal window, the console will stop receiving keypresses, but I can
> still see output from the console.
> Stangely, logging in the VM via SSH and sending lot of data to hvc0,
> like 'dmesg >/dev/hvc0' will fix the issue until the next paste.
> 
> I did a git bisect and I've found that this is the offending commit,
> reverting it fixes it.
> 
> Cheers,
> 
> commit ec97eaad1383ab2500fcf9a07ade6044fbcc67f5
> Author: Nicholas Piggin 
> Date:   Tue May 1 00:55:54 2018 +1000
> 
> tty: hvc: hvc_poll() break hv read loop

Thanks for the report. I can't immediately see what the problem
is. Can you try get a stack trace of where it is stuck?

Perhaps try this patch if you have time (it's a bit of a shot
in the dark).

Thanks,
Nick

diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
index 5414c4a87bea..ac45235832f1 100644
--- a/drivers/tty/hvc/hvc_console.c
+++ b/drivers/tty/hvc/hvc_console.c
@@ -635,6 +635,7 @@ static int __hvc_poll(struct hvc_struct *hp, bool may_sleep)
int read_total = 0;
int written_total = 0;
 
+again:
spin_lock_irqsave(&hp->lock, flags);
 
/* Push pending writes */
@@ -721,6 +722,10 @@ static int __hvc_poll(struct hvc_struct *hp, bool 
may_sleep)
poll_mask |= HVC_POLL_READ;
read_total = n;
 
+   spin_unlock_irqrestore(&hp->lock, flags);
+   cond_resched();
+   goto again;
+
  out:
/* Wakeup write queue if necessary */
if (hp->do_wakeup) {


Re: VirtIO console hangs

2018-08-28 Thread Nicholas Piggin
On Tue, 28 Aug 2018 15:00:14 +
Matteo Croce  wrote:

> On Tue, Aug 28, 2018 at 2:35 PM Nicholas Piggin  wrote:
> >
> > On Tue, 28 Aug 2018 12:54:08 +
> > Matteo Croce  wrote:
> >  
> > > With kernel 4.19.0-rc1 virtio_console hangs very often.
> > > I can always trigger the bug by pasting some characters in the
> > > terminal window, the console will stop receiving keypresses, but I can
> > > still see output from the console.
> > > Stangely, logging in the VM via SSH and sending lot of data to hvc0,
> > > like 'dmesg >/dev/hvc0' will fix the issue until the next paste.
> > >
> > > I did a git bisect and I've found that this is the offending commit,
> > > reverting it fixes it.
> > >
> > > Cheers,
> > >
> > > commit ec97eaad1383ab2500fcf9a07ade6044fbcc67f5
> > > Author: Nicholas Piggin 
> > > Date:   Tue May 1 00:55:54 2018 +1000
> > >
> > > tty: hvc: hvc_poll() break hv read loop  
> >
> > Thanks for the report. I can't immediately see what the problem
> > is. Can you try get a stack trace of where it is stuck?
> >  
> 
> I tried but didn't get one.
> 
> > Perhaps try this patch if you have time (it's a bit of a shot
> > in the dark).
> >  
> 
> Yes it seems to fix. Thanks!

Great thanks for testing it so quickly.

I would have thought the return code should cause another poll to
trigger. I'll have to read the code again and try to find out what
I missed.

Thanks,
Nick


Re: [PATCH 2/4] mm/tlb: Remove tlb_remove_table() non-concurrent condition

2018-08-22 Thread Nicholas Piggin
On Wed, 22 Aug 2018 17:30:14 +0200
Peter Zijlstra  wrote:

> Will noted that only checking mm_users is incorrect; we should also
> check mm_count in order to cover CPUs that have a lazy reference to
> this mm (and could do speculative TLB operations).

Why is that incorrect?

This shortcut has nothing to do with no TLBs -- not sure about x86, but
other CPUs can certainly have remaining TLBs here, speculative
operations or not (even if they don't have an mm_count ref they can
have TLBs here).

So that leaves speculative operations. I don't see where the problem is
with those either -- this shortcut needs to ensure there are no other
*non speculative* operations. mm_users is correct for that.

If there is a speculation security problem here it should be carefully
documented otherwise it's going to be re-introduced...

I actually have a patch to extend this optimisation further that I'm
going to send out again today. It's nice to avoid the double handling
of the pages.

Thanks,
Nick

> 
> If removing this turns out to be a performance issue, we can
> re-instate a more complete check, but in tlb_table_flush() eliding the
> call_rcu_sched().
> 
> Cc: sta...@kernel.org
> Cc: Nicholas Piggin 
> Cc: David Miller 
> Cc: Will Deacon 
> Cc: Martin Schwidefsky 
> Cc: Michael Ellerman 
> Fixes: 267239116987 ("mm, powerpc: move the RCU page-table freeing into 
> generic code")
> Reported-by: Will Deacon 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  mm/memory.c |9 -
>  1 file changed, 9 deletions(-)
> 
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -375,15 +375,6 @@ void tlb_remove_table(struct mmu_gather
>  {
>   struct mmu_table_batch **batch = &tlb->batch;
>  
> - /*
> -  * When there's less then two users of this mm there cannot be a
> -  * concurrent page-table walk.
> -  */
> - if (atomic_read(&tlb->mm->mm_users) < 2) {
> - __tlb_remove_table(table);
> - return;
> - }
> -
>   if (*batch == NULL) {
>   *batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | 
> __GFP_NOWARN);
>   if (*batch == NULL) {
> 
> 



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Nicholas Piggin
On Wed, 22 Aug 2018 17:55:27 +0200
Peter Zijlstra  wrote:

> On Wed, Aug 22, 2018 at 05:30:15PM +0200, Peter Zijlstra wrote:
> > ARM
> > which later used this put an explicit TLB invalidate in their
> > __p*_free_tlb() functions, and PowerPC-radix followed that example.  
> 
> > +/*
> > + * If we want tlb_remove_table() to imply TLB invalidates.
> > + */
> > +static inline void tlb_table_invalidate(struct mmu_gather *tlb)
> > +{
> > +#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
> > +   /*
> > +* Invalidate page-table caches used by hardware walkers. Then we still
> > +* need to RCU-sched wait while freeing the pages because software
> > +* walkers can still be in-flight.
> > +*/
> > +   __tlb_flush_mmu_tlbonly(tlb);
> > +#endif
> > +}  
> 
> 
> Nick, Will is already looking at using this to remove the synchronous
> invalidation from __p*_free_tlb() for ARM, could you have a look to see
> if PowerPC-radix could benefit from that too?

powerpc/radix has no such issue, it already does this tracking.

We were discussing this a couple of months ago, I wasn't aware of ARM's
issue but I suggested x86 could go the same way as powerpc. Would have
been good to get some feedback on some of the proposed approaches there.
Because it's not just pwc tracking but if you do this you don't need
those silly hacks in generic code to expand the TLB address range
either.

So powerpc has no fundamental problem with making this stuff generic.
If you need a fix for x86 and ARM for this merge window though, I would
suggest just copying what powerpc already has. Next time we can
consider consolidating them all into generic code.

Thanks,
Nick


> 
> Basically, using a patch like the below, would give your tlb_flush()
> information on if tables were removed or not.
> 
> ---
> 
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -96,12 +96,22 @@ struct mmu_gather {
>  #endif
>   unsigned long   start;
>   unsigned long   end;
> - /* we are in the middle of an operation to clear
> -  * a full mm and can make some optimizations */
> - unsigned intfullmm : 1,
> - /* we have performed an operation which
> -  * requires a complete flush of the tlb */
> - need_flush_all : 1;
> + /*
> +  * we are in the middle of an operation to clear
> +  * a full mm and can make some optimizations
> +  */
> + unsigned intfullmm : 1;
> +
> + /*
> +  * we have performed an operation which
> +  * requires a complete flush of the tlb
> +  */
> + unsigned intneed_flush_all : 1;
> +
> + /*
> +  * we have removed page directories
> +  */
> + unsigned intfreed_tables : 1;
>  
>   struct mmu_gather_batch *active;
>   struct mmu_gather_batch local;
> @@ -136,6 +146,7 @@ static inline void __tlb_reset_range(str
>   tlb->start = TASK_SIZE;
>   tlb->end = 0;
>   }
> + tlb->freed_tables = 0;
>  }
>  
>  static inline void tlb_remove_page_size(struct mmu_gather *tlb,
> @@ -269,6 +280,7 @@ static inline void tlb_remove_check_page
>  #define pte_free_tlb(tlb, ptep, address) \
>   do {\
>   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + tlb->freed_tables = 1;  \
>   __pte_free_tlb(tlb, ptep, address); \
>   } while (0)
>  #endif
> @@ -276,7 +288,8 @@ static inline void tlb_remove_check_page
>  #ifndef pmd_free_tlb
>  #define pmd_free_tlb(tlb, pmdp, address) \
>   do {\
> - __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + tlb->freed_tables = 1;  \
>   __pmd_free_tlb(tlb, pmdp, address); \
>   } while (0)
>  #endif
> @@ -286,6 +299,7 @@ static inline void tlb_remove_check_page
>  #define pud_free_tlb(tlb, pudp, address) \
>   do {\
>   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + tlb->freed_tables = 1;  \
>   __pud_free_tlb(tlb, pudp, address); \
>   } while (0)
>  #endif
> @@ -295,7 +309,8 @@ static inline void tlb_remove_check_page
>  #ifndef p4d_free_tlb
>  #define p4d_free_tlb(tlb, pudp, address) \
>   do {\
> - __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + tlb->freed_tables = 1;  \
>   __p4d_free_tlb(tlb, pudp, address); \
>   } while (0)
>  #endif



Re: [PATCH 2/4] mm/tlb: Remove tlb_remove_table() non-concurrent condition

2018-08-22 Thread Nicholas Piggin
On Wed, 22 Aug 2018 20:35:16 -0700
Linus Torvalds  wrote:

> On Wed, Aug 22, 2018 at 8:31 PM Nicholas Piggin  wrote:
> >
> >
> > So that leaves speculative operations. I don't see where the problem is
> > with those either -- this shortcut needs to ensure there are no other
> > *non speculative* operations. mm_users is correct for that.  
> 
> No. Because mm_users doesn't contain any lazy tlb users.
> 
> And yes, those lazy tlbs are all kernel threads, but they can still
> speculatively load user addresses.

So?

If the arch does not shoot those all down after the user page tables
are removed then it's buggy regardless of this short cut.

The only real problem I could see would be if a page walk cache still
points to the freed table, then the table gets re-allocated and used
elsewhere, and meanwhile a speculative access tries to load an entry
from the page that is an invalid form of page table that might cause
a machine check or something. That would be (u)arch specific, but if
that's what we're concerned with here it's a different issue and needs
to be documented as such.

I'll have a look at powerpc and see if we can cope with it. If so, I'll
make it an arch specific opt-in short cut.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Nicholas Piggin
On Wed, 22 Aug 2018 20:59:46 -0700
Linus Torvalds  wrote:

> On Wed, Aug 22, 2018 at 8:45 PM Nicholas Piggin  wrote:
> >
> > powerpc/radix has no such issue, it already does this tracking.  
> 
> Yeah, I now realize that this was why you wanted to add that hacky
> thing to the generic code, so that you can add the tlb_flush_pgtable()
> call.
> 
> I thought it was because powerpc had some special flush instruction
> for it, and the regular tlb flush didn't do it. But no.

powerpc/radix does have a special instruction for it, that is why I
posted the patch :)

> It was because
> the regular code had lost the tlb flush _entirely_, because powerpc
> didn't want it.

I think that was long before I started looking at the code.
powerpc/hash hardware has no idea about the page tables so yeah they
don't need it.

> 
> > We were discussing this a couple of months ago, I wasn't aware of ARM's
> > issue but I suggested x86 could go the same way as powerpc.  
> 
> The problem is that x86 _used_ to do this all correctly long long ago.
> 
> And then we switched over to the "generic" table flushing (which
> harkens back to the powerpc code).
> 
> Which actually turned out to be not generic at all, and did not flush
> the internal pages like x86 used to (back when x86 just used
> tlb_remove_page for everything).
> 
> So as a result, x86 had unintentionally lost the TLB flush we used to
> have, because tlb_remove_table() had lost the tlb flushing because of
> a powerpc quirk.
> 
> You then added it back as a hacky per-architecture hook (apparently
> having realized that you never did it at all), which didn't fix the

I think it was quite well understood and fixed here, a145abf12c9 but
again that was before I really started looking at it.

The hooks I added recently are for a different reason, and it's
actaully the opposite problem -- to work around the hacky generic code
that x86 foisted on other archs.

> unintentional lack of flushing on x86.
> 
> So now we're going to do it right.  No more "oh, powerpc didn't need
> to flush because the hash tables weren't in the tlb at all" thing in
> the generic code that then others need to work around.

I don't really understand what the issue you have with powerpc here.
powerpc hash has the page table flushing accessors which are just
no-ops, it's the generic code that fails to call them properly. Surely
there was no powerpc patch that removed those calls from generic code?

powerpc/radix yes it does some arch specific things to do its page walk
cache flushing, but it is a better design than the hacks x86 has in
generic code, surely. I thought you basically agreed and thought x86 /
generic code could move to that kind of model.

Thanks,
Nick


Re: [PATCH 2/4] mm/tlb: Remove tlb_remove_table() non-concurrent condition

2018-08-22 Thread Nicholas Piggin
On Wed, 22 Aug 2018 21:54:48 -0700
Linus Torvalds  wrote:

> On Wed, Aug 22, 2018 at 9:16 PM Nicholas Piggin  wrote:
> >
> > On Wed, 22 Aug 2018 20:35:16 -0700
> > Linus Torvalds  wrote:  
> > >
> > > And yes, those lazy tlbs are all kernel threads, but they can still
> > > speculatively load user addresses.  
> >
> > So?
> >
> > If the arch does not shoot those all down after the user page tables
> > are removed then it's buggy regardless of this short cut.  
> 
> Read the code.
> 
> That shortcut frees the pages *WITHOUT* doing that TLB flush.

I've read the code and I understand that. Hence I'm asking because I
did't think the changelog matched the change. But possibly it was
because I actually didn't read the changelog enough -- I guess it does
say the TLB operation is the problem. I may have got side tracked by
the word speculativee.

> It just
> does __tlb_remove_table(), which does *not* do that whole page
> queueing so that we can batch flush and then free later.
> 
> So the fast-case it's buggy, exactly because of the reason you state.

Okay sure, thanks for confirming it for me. I would ask for changelog
to be slightly expanded on, but maybe it's just my reading
comprehension needs improvement...

> > The only real problem I could see would be if a page walk cache still
> > points to the freed table, then the table gets re-allocated and used
> > elsewhere, and meanwhile a speculative access tries to load an entry
> > from the page that is an invalid form of page table that might cause
> > a machine check or something. That would be (u)arch specific, but if
> > that's what we're concerned with here it's a different issue and needs
> > to be documented as such.  
> 
> We've *seen* that case, we had exactly that when we were being
> aggressive about trying to avoid flushes for the lazy mmu case
> entirely, because "we can just flush when we activate the lazy mm
> state instead".
> 
> The logic was actually solid from a TLB case - who cares what garbage
> TLB entries we might speculatively fill, when we're going to flush
> them before they can possibly be used.
> 
> It turns out that that logic is solid, but hits another problem: at
> least some AMD CPU's will cause machine checks when the TLB entries
> are inconsistent with the machine setup. That can't happen with a
> *good* page table, but when you speculatively walk already freed
> tables, you can get any garbage.

Yeah that does make sense.

> 
> I forget what the exact trigger was, but this was actually reported.
> So you can't free page directory pages without flushing the tlb first
> (to make that internal tlb node caches are flushed).
> 
> So the logic for freeing leaf pages and freeing middle nodes has to be
> exactly the same: you make the modification to the page table to
> remove the node/leaf, you queue up the memory for freeing, you flush
> the tlb, and you then free the page.
> 
> That's the ordering that tlb_remove_page() has always honored, but
> that tlb_remove_tabl() didn't.
> 
> It honored it for the *normal* case, which is why it took so long to
> notice that the TLB shootdown had been broken on x86 when it moved to
> the "generic" code. The *normal* case does this all right, and batches
> things up, and then when the batch fills up it does a
> tlb_table_flush() which does the TLB flush and schedules the actual
> freeing.
> 
> But there were two cases that *didn't* do that. The special "I'm the
> only thread" fast case, and the "oops I ran out of memory, so now I'll
> fake it, and just synchronize with page twalkers manually, and then do
> that special direct remove without flushing the tlb".
> 
> NOTE! Jann triggered that one by
>  (a) forcing the machine low on memory
>  (b) force-poisoning the page tables immediately after free
> 
> I suspect it's really hard to trigger under normal loads, exactly
> because the *normal* case gets it right. It's literally the "oops, I
> can't batch up because I ran out of memory" case that gets it wrong.
> 
> (And the special single-thread + lazy or use_mm() case, but that's
> going to be entirely impossible to trigger, because in practice it's
> just a single thread, and you won't ever hit the magic timing needed
> that frees the page in the single thread at exactly the same time that
> some optimistic lazy mm on another cpu happens to speculatively load
> that address).
> 
> So the "atomic_read(&mm_users)" case is likely entirely impossible to
> trigger any sane way. But because Jann found the problem with the 'ran
> out of memory" case, we started looking at the more theoretical cases
> that matched the same kind of "no tlb flush before free" pattern.

Thanks for giving that background. In that case I'm happy with this fix.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Nicholas Piggin
On Wed, 22 Aug 2018 22:03:40 -0700
Linus Torvalds  wrote:

> On Wed, Aug 22, 2018 at 9:33 PM Nicholas Piggin  wrote:
> >
> > I think it was quite well understood and fixed here, a145abf12c9 but
> > again that was before I really started looking at it.  
> 
> You don't understand the problem.

More fundamentally I think I didn't understand this fix, I think
actually powerpc/radix does have a bug here. a145abf12c9 was really
just a replacement for x86's hack of expanding the TLB invalidation
range when freeing page table to capture page walk cache (powerpc/radix
needs a different instruction so that didn't work for us).

But I hadn't really looked at this fix closely rather Peter's follow up
post about making powerpc page walk cache flushing design a generic
concept.

My point in this reply was more that my patches from the other month
weren't a blundering issue to fix this bug without realising it, they
were purely about avoiding the x86 TLB range expanding hack (that won't
be needed if generic users all move over).

> 
> All the x86 people thought WE ALREADY DID THAT.
> 
> Because we had done this all correctly over a decade ago!
> 
> Nobody realized that it had been screwed up by the powerpc code, and

The powerpc/hash code is not screwed up though AFAIKS. You can't
take arch specific code and slap a "generic" label on it, least of all
the crazy powerpc/hash code, you of all people would agree with that :)

> the commit you point to was believed to be a new *powerpc* only issue,
> because the semantics on powerpc has changed because of the radix
> tree.
> 
> The semantics on x86 have never changed, they've always been the same.
> So why would the x86 people react to powerpc doing something that x86
> had already always done.
> 
> See?
> 
> Nobody cared one whit about commit a145abf12c9, because it just
> handles a new powerpc-specific case.
> 
> > I don't really understand what the issue you have with powerpc here.
> > powerpc hash has the page table flushing accessors which are just
> > no-ops, it's the generic code that fails to call them properly. Surely
> > there was no powerpc patch that removed those calls from generic code?  
> 
> Yes there was.
> 
> Look where the generic code *came* from.
> 
> It's powerpc code.
> 
> See commit 267239116987 ("mm, powerpc: move the RCU page-table freeing
> into generic code").
> 
> The powerpc code was made into the generic code, because the powerpc
> code had to handle all those special RCU freeing things etc that
> others didn't.
> 
> It's just that when x86 was then switched over to use the "generic"
> code, people didn't realize that the generic code didn't do the TLB
> invalidations for page tables, because they hadn't been needed on
> powerpc.

Sure, there was a minor bug in the port. Not that it was a closely
guarded secret that powerpc didn't flush page table pages, but it's a
relatively subtle issue in complex code. That happens.

> 
> So the powerpc code that was made generic, never really was. The new
> "generic" code had a powerpc-specific quirk.
> 
> That then very subtly broke x86, without the x86 people ever
> realizing. Because the old simple non-RCU x86 code had never had that
> issue, it just treated the leaf pages and the directory pages exactly
> the same.
> 
> See?
> 
> And THAT is why I talk about the powerpc code. Because what is
> "generic" code in 4.18 (and several releases before) oisn't actually
> generic.
> 
> And that's basically exactly the bug that the patches from PeterZ is
> fixing. Making the "tlb_remove_table()" code always flush the tlb, the
> way it should have when it was made generic.

It just sounded like you were blaming correct powerpc/hash code for
this. It's just a minor bug in taking that code into generic, not really
a big deal, right? Or are you saying powerpc devs or code could be doing
something better to play nicer with the rest of the archs?

Honestly trying to improve things here, and encouraged by x86 and ARM
looking to move over to a saner page walk cache tracking design and
sharing more code with powerpc/radix. I would help with reviewing
things or writing code or porting powerpc bits if I can.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Nicholas Piggin
On Thu, 23 Aug 2018 15:21:30 +1000
Benjamin Herrenschmidt  wrote:

> On Wed, 2018-08-22 at 22:11 -0700, Linus Torvalds wrote:
> > On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt  
> > wrote:  
> > > 
> > > 
> > > So we do need a different flush instruction for the page tables vs. the
> > > normal TLB pages.  
> > 
> > Right. ARM wants it too. x86 is odd in that a regular "invlpg" already
> > invalidates all the internal tlb cache nodes.
> > 
> > So the "new world order" is exactly that patch that PeterZ sent you, that 
> > adds a
> > 
> > +   unsigned intfreed_tables : 1;
> >   
> 
>  .../...
> 
> > So instead, when you get to the actual "tlb_flush(tlb)", you do
> > exactly that - flush the tlb. And the mmu_gather structure shows you
> > how much you need to flush. If you see that "freed_tables" is set,
> > then you know that you need to also do the special instruction to
> > flush the inner level caches. The range continues to show the page
> > range.  
> 
> Yup. That looks like a generic version of the "need_flush_all" flag we
> have, which is fine by us.
> 
> Just don't blame powerpc for all the historical crap :-)

And yes we very much want to remove the x86 hacks from generic code and
have them use the sane powerpc/radix page walk cache flushing model.
That would indeed allow us to stop overriding those macros and start
sharing more code with other archs. We can help write or review code to
make sure bugs don't creep in when moving it to generic implementation.
It will be much more relevant to us now because radix is very similar to
x86.

The hack is not the powerpc override macros though, let's be clear
about that. Even x86 will be helped out by removing that crap because
it won't have to do a full TLB flush caused by massive TLB range if it
frees 0..small number of pages that happen to also free some page
tables.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-23 Thread Nicholas Piggin
On Wed, 22 Aug 2018 17:30:15 +0200
Peter Zijlstra  wrote:

> Jann reported that x86 was missing required TLB invalidates when he
> hit the !*batch slow path in tlb_remove_table().
> 
> This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
> invalidates, the PowerPC-hash where this code originated and the
> Sparc-hash where this was subsequently used did not need that. ARM
> which later used this put an explicit TLB invalidate in their
> __p*_free_tlb() functions, and PowerPC-radix followed that example.

So this is interesting, I _think_ a145abf12c did fix this bug for
powerpc, but then it seem to have been re-broken by a46cc7a90f
because that one defers the flush back to tlb_flush time. There
was quite a lot of churn getting the radix MMU off the ground at
that point though so I'm not 100% sure.

But AFAIKS powerpc today has this same breakage, and this patch
should fix it.

I have a couple of patches that touch the same code I'll send, you
might have some opinions on them.

Thanks,
Nick



[RFC PATCH 2/2] mm: mmu_notifier fix for tlb_end_vma

2018-08-23 Thread Nicholas Piggin
The generic tlb_end_vma does not call invalidate_range mmu notifier,
and it resets resets the mmu_gather range, which means the notifier
won't be called on part of the range in case of an unmap that spans
multiple vmas.

ARM64 seems to be the only arch I could see that has notifiers and
uses the generic tlb_end_vma. I have not actually tested it.

Signed-off-by: Nicholas Piggin 
---
 include/asm-generic/tlb.h | 17 +
 mm/memory.c   | 10 --
 2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e811ef7b8350..79cb76b89c26 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -15,6 +15,7 @@
 #ifndef _ASM_GENERIC__TLB_H
 #define _ASM_GENERIC__TLB_H
 
+#include 
 #include 
 #include 
 #include 
@@ -138,6 +139,16 @@ static inline void __tlb_reset_range(struct mmu_gather 
*tlb)
}
 }
 
+static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+{
+   if (!tlb->end)
+   return;
+
+   tlb_flush(tlb);
+   mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
+   __tlb_reset_range(tlb);
+}
+
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, int page_size)
 {
@@ -186,10 +197,8 @@ static inline void 
tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 
 #define __tlb_end_vma(tlb, vma)\
do {\
-   if (!tlb->fullmm && tlb->end) { \
-   tlb_flush(tlb); \
-   __tlb_reset_range(tlb); \
-   }   \
+   if (!tlb->fullmm)   \
+   tlb_flush_mmu_tlbonly(tlb); \
} while (0)
 
 #ifndef tlb_end_vma
diff --git a/mm/memory.c b/mm/memory.c
index 7c58310734eb..c4a7b6cb17c8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -238,16 +238,6 @@ void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct 
mm_struct *mm,
__tlb_reset_range(tlb);
 }
 
-static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
-{
-   if (!tlb->end)
-   return;
-
-   tlb_flush(tlb);
-   mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
-   __tlb_reset_range(tlb);
-}
-
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
struct mmu_gather_batch *batch;
-- 
2.17.0



[RFC PATCH 1/2] mm: move tlb_table_flush to tlb_flush_mmu_free

2018-08-23 Thread Nicholas Piggin
There is no need to call this from tlb_flush_mmu_tlbonly, it
logically belongs with tlb_flush_mmu_free. This allows some
code consolidation with a subsequent fix.

Signed-off-by: Nicholas Piggin 
---
 mm/memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 19f47d7b9b86..7c58310734eb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -245,9 +245,6 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 
tlb_flush(tlb);
mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-   tlb_table_flush(tlb);
-#endif
__tlb_reset_range(tlb);
 }
 
@@ -255,6 +252,9 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
struct mmu_gather_batch *batch;
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+   tlb_table_flush(tlb);
+#endif
for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
free_pages_and_swap_cache(batch->pages, batch->nr);
batch->nr = 0;
-- 
2.17.0



[RFC PATCH 0/2] minor mmu_gather patches

2018-08-23 Thread Nicholas Piggin
These are split from some patches I posted a while back, I was going
to take a look and revive the series again after your fixes go in,
but having another look, it may be that your "[PATCH 3/4] mm/tlb,
x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE" becomes
easier after my patch 1.

And I'm not convinced patch 2 is not a real bug at least for ARM64,
so it may be possible to squeeze it in if it's reviewed very
carefully (I need to actually reproduce and trace it).

So not signed off by yet, but if you think it might be worth doing
these with your changes, it could be a slightly cleaner end result?

Thanks,
Nick


Nicholas Piggin (2):
  mm: move tlb_table_flush to tlb_flush_mmu_free
  mm: mmu_notifier fix for tlb_end_vma

 include/asm-generic/tlb.h | 17 +
 mm/memory.c   | 14 ++
 2 files changed, 15 insertions(+), 16 deletions(-)

-- 
2.17.0



Re: [RFC PATCH 0/2] minor mmu_gather patches

2018-08-23 Thread Nicholas Piggin
On Thu, 23 Aug 2018 12:15:37 -0700
Linus Torvalds  wrote:

> On Thu, Aug 23, 2018 at 1:47 AM Nicholas Piggin  wrote:
> >
> > These are split from some patches I posted a while back, I was going
> > to take a look and revive the series again after your fixes go in,
> > but having another look, it may be that your "[PATCH 3/4] mm/tlb,
> > x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE" becomes
> > easier after my patch 1.
> >
> > And I'm not convinced patch 2 is not a real bug at least for ARM64,
> > so it may be possible to squeeze it in if it's reviewed very
> > carefully (I need to actually reproduce and trace it).
> >
> > So not signed off by yet, but if you think it might be worth doing
> > these with your changes, it could be a slightly cleaner end result?  
> 
> Actually, you did have sign-offs, and yes, that patch 1/2 does
> actually clean up and simplify the HAVE_RCU_TABLE_INVALIDATE fix, so I
> decided to mix these in with PeterZ's series.
> 
> And since it turns out that patch doesn't apparently matter for
> correctness and doesn't need to be backported to stable, I put it at
> the end of the series together with the x86 cleanup patch to avoid the
> unnecessary RCU-delayed freeing entirely for the non-PV case.
> 
> So right now my "tlb-fixes" branch looks like this:
> 
> x86/mm/tlb: Revert the recent lazy TLB patches
>  *  mm: move tlb_table_flush to tlb_flush_mmu_free
>  *  mm/tlb: Remove tlb_remove_table() non-concurrent condition
>  *  mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
> mm: mmu_notifier fix for tlb_end_vma
> x86/mm: Only use tlb_remove_table() for paravirt
> 
> where the three starred patches are marked for stable.
> 
> The initial revert is for this merge window only, and the two last
> patches are cleanups and fixes but shouldn't matter for correctness in
> stable.
> 
> PeterZ - your "mm/tlb, x86/mm: Support invalidating TLB caches for
> RCU_TABLE_FREE" patch looks exactly the same, but it now no longer has
> the split of tlb_flush_mmu_tlbonly(), since with Nick's patch to move
> the call to tlb_table_flush(tlb) into tlb_flush_mmu_free, there's no
> need for the separate double-underscore version.
> 
> I hope nothing I did screwed things up. It all looks sane to me.
> Famous last words.
> 
> I'll do a few more test builds and boots, but I think I'm going to
> merge it in this cleaned-up and re-ordered form.

I think the end result looks okay, modulo my build screw up --
at least the generic code. Thanks for putting it together.

Thanks,
Nick


Re: [RFC PATCH 0/2] minor mmu_gather patches

2018-08-23 Thread Nicholas Piggin
On Fri, 24 Aug 2018 00:27:05 +0100
Will Deacon  wrote:

> Hi Linus,
> 
> On Thu, Aug 23, 2018 at 12:37:58PM -0700, Linus Torvalds wrote:
> > On Thu, Aug 23, 2018 at 12:15 PM Linus Torvalds
> >  wrote:  
> > >
> > > So right now my "tlb-fixes" branch looks like this:
> > > [..]
> > >
> > > I'll do a few more test builds and boots, but I think I'm going to
> > > merge it in this cleaned-up and re-ordered form.  
> > 
> > In the meantime, I decided to push out that branch in case anybody
> > wants to look at it.
> > 
> > I may rebase it if I - or anybody else - find anything bad there, so
> > consider it non-stable, but I think it's in its final shape modulo
> > issues.  
> 
> Unfortunately, that branch doesn't build for arm64 because of Nick's patch
> moving tlb_flush_mmu_tlbonly() into tlb.h (which I acked!). It's a static
> inline which calls tlb_flush(), which in our case is also a static inline
> but one that is defined in our asm/tlb.h after including asm-generic/tlb.h.
> 
> Ah, just noticed you've pushed this to master! Please could you take the
> arm64 patch below on top, in order to fix the build?

Sorry yeah I had the sign offs but should have clear I hadn't fully
build tested them. It's reasonable for reviewers to assume basic
diligence was done and concentrate on the logic rather than build
trivialities. So that's my bad. Thanks for the fix.

Thanks,
Nick

> 
> Cheers,
> 
> Will
> 
> --->8  
> 
> From 48ea452db90a91ff2b62a94277daf565e715a126 Mon Sep 17 00:00:00 2001
> From: Will Deacon 
> Date: Fri, 24 Aug 2018 00:23:04 +0100
> Subject: [PATCH] arm64: tlb: Provide forward declaration of tlb_flush() before
>  including tlb.h
> 
> As of commit fd1102f0aade ("mm: mmu_notifier fix for tlb_end_vma"),
> asm-generic/tlb.h now calls tlb_flush() from a static inline function,
> so we need to make sure that it's declared before #including the
> asm-generic header in the arch header.
> 
> Signed-off-by: Will Deacon 
> ---
>  arch/arm64/include/asm/tlb.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
> index 0ad1cf233470..a3233167be60 100644
> --- a/arch/arm64/include/asm/tlb.h
> +++ b/arch/arm64/include/asm/tlb.h
> @@ -33,6 +33,8 @@ static inline void __tlb_remove_table(void *_table)
>  #define tlb_remove_entry(tlb, entry) tlb_remove_page(tlb, entry)
>  #endif /* CONFIG_HAVE_RCU_TABLE_FREE */
>  
> +static void tlb_flush(struct mmu_gather *tlb);
> +
>  #include 
>  
>  static inline void tlb_flush(struct mmu_gather *tlb)



Re: [PATCH] powernv:idle: Set NAPSTATELOST after recovering paca on P9 DD1

2017-05-12 Thread Nicholas Piggin
On Fri, 12 May 2017 14:52:06 +0530
"Gautham R. Shenoy"  wrote:

> From: "Gautham R. Shenoy" 
> 
> commit 17ed4c8f81da ("powerpc/powernv: Recover correct PACA on wakeup
> from a stop on P9 DD1") promises to set the NAPSTATELOST bit in paca
> after recovering the correct paca for the thread waking up from stop1
> on DD1, so that the GPRs can be correctly restored on the stop exit
> path. However, it loads the value 1 into r3, but stores the value in
> r0 into NAPSTATELOST(r13).
> 
> Fix this by correctly set the NAPSTATELOST bit in paca after
> recovering the paca on POWER9 DD1.
> 
> Fixes: commit 17ed4c8f81da ("powerpc/powernv: Recover correct PACA on wakeup
> from a stop on P9 DD1")
> 
> Signed-off-by: Gautham R. Shenoy 

This looks obviously correct to me.

Reviewed-by: Nicholas Piggin 

> ---
>  arch/powerpc/kernel/idle_book3s.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 07d4e0a..4898d67 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -416,7 +416,7 @@ power9_dd1_recover_paca:
>* which needs to be restored from the stack.
>*/
>   li  r3, 1
> - stb r0,PACA_NAPSTATELOST(r13)
> + stb r3,PACA_NAPSTATELOST(r13)
>   blr
>  
>  /*



Re: rcu_process_callbacks irqsoff latency caused by taking spinlock with irqs disabled

2018-04-06 Thread Nicholas Piggin
On Thu, 5 Apr 2018 08:53:20 -0700
"Paul E. McKenney"  wrote:

> On Thu, Apr 05, 2018 at 10:45:12AM +1000, Nicholas Piggin wrote:
> > On Wed, 4 Apr 2018 17:13:58 -0700
> > "Paul E. McKenney"  wrote:
> >   
> > > On Thu, Apr 05, 2018 at 09:34:14AM +1000, Nicholas Piggin wrote:  
> > > > Hi Paul,
> > > > 
> > > > Just looking at latencies, and RCU showed up as one of the maximums.
> > > > This is a 2 socket system with (176 CPU threads). Just doing a
> > > > `make -j 352` kernel build. Got a max latency of 3ms. I don't think
> > > > that's anything to worry about really, but I wanted to check the
> > > > cause.
> > > 
> > > Well, that 3 milliseconds would cause serious problems for some 
> > > workloads...  
> > 
> > True.
> >   
> > > > # tracer: irqsoff
> > > > #
> > > > # irqsoff latency trace v1.1.5 on 4.16.0-01530-g43d1859f0994
> > > > # 
> > > > # latency: 3055 us, #19/19, CPU#135 | (M:server VP:0, KP:0, SP:0 HP:0 
> > > > #P:176)
> > > > #-
> > > > #| task: cc1-58689 (uid:1003 nice:0 policy:0 rt_prio:0)
> > > > #-
> > > > #  => started at: rcu_process_callbacks
> > > > #  => ended at:   _raw_spin_unlock_irqrestore
> > > > #
> > > > #
> > > > #  _--=> CPU#
> > > > # / _-=> irqs-off
> > > > #| / _=> need-resched
> > > > #|| / _---=> hardirq/softirq 
> > > > #||| / _--=> preempt-depth   
> > > > # / delay
> > > > #  cmd pid   | time  |   caller  
> > > > # \   /  |  \|   / 
> > > ><...>-58689 135d.s.0us : rcu_process_callbacks
> > > ><...>-58689 135d.s.1us : cpu_needs_another_gp 
> > > > <-rcu_process_callbacks
> > > ><...>-58689 135d.s.2us : rcu_segcblist_future_gp_needed 
> > > > <-cpu_needs_another_gp
> > > ><...>-58689 135d.s.3us#: _raw_spin_lock <-rcu_process_callbacks
> > > ><...>-58689 135d.s. 3047us : rcu_start_gp <-rcu_process_callbacks
> > > ><...>-58689 135d.s. 3048us : rcu_advance_cbs <-rcu_start_gp
> > > ><...>-58689 135d.s. 3049us : rcu_segcblist_pend_cbs <-rcu_advance_cbs
> > > ><...>-58689 135d.s. 3049us : rcu_segcblist_advance <-rcu_advance_cbs
> > > ><...>-58689 135d.s. 3050us : rcu_accelerate_cbs <-rcu_start_gp
> > > ><...>-58689 135d.s. 3050us : rcu_segcblist_pend_cbs 
> > > > <-rcu_accelerate_cbs
> > > ><...>-58689 135d.s. 3051us : rcu_segcblist_accelerate 
> > > > <-rcu_accelerate_cbs
> > > ><...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 
> > > > <-rcu_accelerate_cbs
> > > ><...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 
> > > > <-rcu_accelerate_cbs
> > > ><...>-58689 135d.s. 3053us : rcu_start_gp_advanced.isra.35 
> > > > <-rcu_start_gp
> > > ><...>-58689 135d.s. 3053us : cpu_needs_another_gp 
> > > > <-rcu_start_gp_advanced.isra.35
> > > ><...>-58689 135d.s. 3054us : _raw_spin_unlock_irqrestore 
> > > > <-rcu_process_callbacks
> > > ><...>-58689 135d.s. 3055us : _raw_spin_unlock_irqrestore
> > > ><...>-58689 135d.s. 3056us : trace_hardirqs_on 
> > > > <-_raw_spin_unlock_irqrestore
> > > ><...>-58689 135d.s. 3061us : 
> > > > 
> > > > So it's taking a contende lock with interrupts disabled:
> > > > 
> > > > static void
> > > > __rcu_process_callbacks(struct rcu_state *rsp)
> > > > {
> > > > unsigned long flags;
> > > > bool needwake;
> > > > struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
> > > > 
> > > > WARN_ON_ONCE(!rdp->beenonline);
> > > > 
> > > > /* Update RCU state based on any recent quiescent states. */
> > > > rcu_check_quiescent_state(rsp, rdp);
> > > > 
> > > >   

Re: [PATCH AUTOSEL for 3.18 059/101] x86/um: thin archives build fix

2018-04-08 Thread Nicholas Piggin
On Mon, 9 Apr 2018 00:41:22 +
Sasha Levin  wrote:

> From: Nicholas Piggin 
> 
> [ Upstream commit 827880ec260ba048f95fe646b96a205c394fa0f0 ]
> 
> The linker does not like vdso-syms.lds in input archive files.
> Make it an extra-y instead.

I wouldn't say these should be needed on kernels without thin
archives build.

It shouldn't hurt, but no point risking stable breakage.

Thanks,
Nick


Occasionally losing the tick_sched_timer

2018-04-09 Thread Nicholas Piggin
We are seeing rare hard lockup watchdog timeouts, a CPU seems to have no
more timers scheduled, despite hard and soft lockup watchdogs should have
their heart beat timers and probably many others.

The reproducer we have is running a KVM workload. The lockup is in the
host kernel, quite rare but we may be able to slowly test things.

I have a sysrq+q snippet. CPU3 is the stuck one, you can see its tick has
stopped for a long time and no hrtimer active. Included CPU4 for what the
other CPUs look like.

Thomas do you have any ideas on what we might look for, or if we can add
some BUG_ON()s to catch this at its source?

- CPU3 is sitting in its cpuidle loop (polling idle with all other idle
  states disabled).

- `taskset -c 3 ls` basically revived the CPU and got timers running again.

- May not be a new bug because we have just in the past few releases
  enabled the hard lockup detector by default

- KVM is being used. This switches various registers like timebase and
  decrementer. Possibly that's involved, but we can't say the bug does
  not happen without KVM.

cpu: 3
 clock 0:
  .base:   df30f5ab
  .index:  0
  .resolution: 1 nsecs
  .get_time:   
ktime_get

  .offset: 0 nsecs
active timers:
 clock 1:
  .base:   520cc304
  .index:  1
  .resolution: 1 nsecs
  .get_time:   
ktime_get_real

  .offset: 1523263049759155857 nsecs
active timers:
 clock 2:
  .base:   706e6277
  .index:  2
  .resolution: 1 nsecs
  .get_time:   
ktime_get_boottime

  .offset: 0 nsecs
active timers:
 clock 3:
  .base:   e2ae2811
  .index:  3
  .resolution: 1 nsecs
  .get_time:   
ktime_get_clocktai

  .offset: 1523263049759155857 nsecs
active timers:
 clock 4:
  .base:   c93e2f8e
  .index:  4
  .resolution: 1 nsecs
  .get_time:   
ktime_get

  .offset: 0 nsecs
active timers:
 clock 5:
  .base:   7b726c6a
  .index:  5
  .resolution: 1 nsecs
  .get_time:   
ktime_get_real

  .offset: 1523263049759155857 nsecs
active timers:
 clock 6:
  .base:   f17c2d4f
  .index:  6
  .resolution: 1 nsecs
  .get_time:   
ktime_get_boottime

  .offset: 0 nsecs
active timers:
 clock 7:
  .base:   6c57ef89
  .index:  7
  .resolution: 1 nsecs
  .get_time:   
ktime_get_clocktai

  .offset: 1523263049759155857 nsecs
active timers:
  .expires_next   : 9223372036854775807 nsecs
  .hres_active: 1
  .nr_events  : 1446533
  .nr_retries : 1434
  .nr_hangs   : 0
  .max_hang_time  : 0
  .nohz_mode  : 2
  .last_tick  : 1776312000 nsecs
  .tick_stopped   : 1
  .idle_jiffies   : 4296713609
  .idle_calls : 2573133
  .idle_sleeps: 1957794
  .idle_entrytime : 1776312625 nsecs
  .idle_waketime  : 59550238131639 nsecs
  .idle_exittime  : 17763110009176 nsecs
  .idle_sleeptime : 17504617295679 nsecs
  .iowait_sleeptime: 719978688 nsecs
  .last_jiffies   : 4296713608
  .next_timer : 1776313000
  .idle_expires   : 1776313000 nsecs
jiffies: 4300892324

cpu: 4
 clock 0:
  .base:   07d8226b
  .index:  0
  .resolution: 1 nsecs
  .get_time:   
ktime_get

  .offset: 0 nsecs
active timers:
 #0: 

, 
tick_sched_timer
, S:01

 # expires at 5955295000-5955295000 nsecs [in 2685654802 to 2685654802 
nsecs]
 #1: 
<9b4a3b88>
, 
hrtimer_wakeup
, S:01

 # expires at 59602585423025-59602642458243 nsecs [in 52321077827 to 
52378113045 nsecs]
 clock 1:
  .base:   d2ae50c4
  .index:  1
  .resolution: 1 nsecs
  .get_time:   
ktime_get_real

  .offset: 1523263049759155857 nsecs
active timers:
 clock 2:
  .base:   1a80e123
  .index:  2
  .resolution: 1 nsecs
  .get_time:   
ktime_get_boottime

  .offset: 0 nsecs
active timers:
 clock 3:
  .base:   5c97ab69
  .index:  3
  .resolution: 1 nsecs
  .get_time:   
ktime_get_clocktai

  .offset: 1523263049759155857 nsecs
active timers:
 clock 4:
  .base:   75ac8f03
  .index:  4
  .resolution: 1 nsecs
  .get_time:   
ktime_get

  .offset: 0 nsecs
active timers:
 clock 5:
  .base:   db06f6ce
  .index:  5
  .resolution: 1 nsecs
  .get_time:   
ktime_get_real

  .offset: 1523263049759155857 nsecs
active timers:
 clock 6:
  .base:   fa63fbce
  .index:  6
  .resolution: 1 nsecs
  .get_time:   
ktime_get_boottime

  .offset: 0 nsecs
active timers:
 clock 7:
  .base:   41de439c
  .index:  7
  .resolution: 1 nsecs
  .get_time:   
ktime_get_clocktai

  .offset: 1523263049759155857 nsecs
active timers:
  .expires_next   : 5955295000 nsecs
  .hres_active: 1
  .nr_events  : 294282
  .nr_retries : 16138
  .nr_hangs   : 0
  .max_hang_time  : 0
  .nohz_mode  : 2
  .last_tick  : 5954562000 nsecs
  .tick_stopped   : 1
  .idle_jiffies   : 4300891859
  .idle_calls : 553259
  .idle_sleeps: 536396
  .idle_entrytime : 59547990019145 nsecs

Re: Occasionally losing the tick_sched_timer

2018-04-10 Thread Nicholas Piggin
On Tue, 10 Apr 2018 09:42:29 +0200 (CEST)
Thomas Gleixner  wrote:

> Nick,
> 
> On Tue, 10 Apr 2018, Nicholas Piggin wrote:
> > We are seeing rare hard lockup watchdog timeouts, a CPU seems to have no
> > more timers scheduled, despite hard and soft lockup watchdogs should have
> > their heart beat timers and probably many others.
> >
> > The reproducer we have is running a KVM workload. The lockup is in the
> > host kernel, quite rare but we may be able to slowly test things.
> > 
> > I have a sysrq+q snippet. CPU3 is the stuck one, you can see its tick has
> > stopped for a long time and no hrtimer active. Included CPU4 for what the
> > other CPUs look like.
> > 
> > Thomas do you have any ideas on what we might look for, or if we can add
> > some BUG_ON()s to catch this at its source?  
> 
> Not really. Tracing might be a more efficient tool that random BUG_ONs.

Sure, we could try that. Any suggestions? timer events?

> 
> > - CPU3 is sitting in its cpuidle loop (polling idle with all other idle
> >   states disabled).
> > 
> > - `taskset -c 3 ls` basically revived the CPU and got timers running again. 
> >  
> 
> Which is not surprising because that kicks the CPU out of idle and starts
> the tick timer again.

Yep.
 
> Does this restart the watchdog timers as well?

I think so, but now you ask I'm not 100% sure we directly observed it.
We can check that next time it locks up.

> > cpu: 3
> >  clock 0:
> >   .base:   df30f5ab
> >   .index:  0
> >   .resolution: 1 nsecs
> >   .get_time:   ktime_get
> >   .offset: 0 nsecs
> > active timers:  
> 
> So in theory the soft lockup watchdog hrtimer should be queued here.
> 
> >   .expires_next   : 9223372036854775807 nsecs
> >   .hres_active: 1
> >   .nr_events  : 1446533
> >   .nr_retries : 1434
> >   .nr_hangs   : 0
> >   .max_hang_time  : 0
> >   .nohz_mode  : 2
> >   .last_tick  : 1776312000 nsecs
> >   .tick_stopped   : 1
> >   .idle_jiffies   : 4296713609
> >   .idle_calls : 2573133
> >   .idle_sleeps: 1957794  
> 
> >   .idle_waketime  : 59550238131639 nsecs
> >   .idle_sleeptime : 17504617295679 nsecs
> >   .iowait_sleeptime: 719978688 nsecs
> >   .last_jiffies   : 4296713608  
> 
> So this was the last time when the CPU came out of idle:
> 
> >   .idle_exittime  : 17763110009176 nsecs  
> 
> Here it went back into idle:
> 
> >   .idle_entrytime : 1776312625 nsecs  
> 
> And this was the next timer wheel timer due for expiry:
> 
> >   .next_timer : 1776313000
> >   .idle_expires   : 1776313000 nsecs  
> 
> which makes no sense whatsoever, but this might be stale information. Can't
> tell.

Wouldn't we expect to see that if there is a timer that was missed
somehow because the tick_sched_timer was not set?

> 
> > cpu: 4
> >  clock 0:
> >   .base:   07d8226b
> >   .index:  0
> >   .resolution: 1 nsecs
> >   .get_time:   ktime_get
> >   .offset: 0 nsecs
> > active timers: #0: <a73e543a>, tick_sched_timer, S:01
> >  # expires at 5955295000-5955295000 nsecs [in 
> > 2685654802 to 2685654802 nsecs]  
> 
> The tick timer is scheduled because the next timer wheel timer is due at
> 5955295000, which might be the hard watchdog timer
> 
> >  #1: <9b4a3b88>, hrtimer_wakeup, S:01
> >  # expires at 59602585423025-59602642458243 nsecs [in 
> > 52321077827 to 52378113045 nsecs]  
> 
> That might be the soft lockup hrtimer.
> 
> I'd try to gather more information about the chain of events via tracing
> and stop the tracer once the lockup detector hits.

Okay will do, thanks for taking a look.

Thanks,
Nick


Re: OK to merge via powerpc? (was Re: [PATCH 05/14] mm: make memblock_alloc_base_nid non-static)

2018-03-13 Thread Nicholas Piggin
On Tue, 13 Mar 2018 12:41:28 -0700
Andrew Morton  wrote:

> On Tue, 13 Mar 2018 23:06:35 +1100 Michael Ellerman  
> wrote:
> 
> > Anyone object to us merging the following patch via the powerpc tree?
> > 
> > Full series is here if anyone's interested:
> >   
> > http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=28377&state=*
> >   
> 
> Yup, please go ahead.
> 
> I assume the change to the memblock_alloc_range() declaration was an
> unrelated, unchangelogged cleanup.
> 

It is. I'm trying to get better at that. Michael might drop that bit if
he's not already sick of fixing up my patches...

Thanks,
Nick


Re: mm,tlb: revert 4647706ebeee?

2018-07-07 Thread Nicholas Piggin
On Fri, 06 Jul 2018 13:03:55 -0400
Rik van Riel  wrote:

> Hello,
> 
> It looks like last summer, there were 2 sets of patches
> in flight to fix the issue of simultaneous mprotect/madvise
> calls unmapping PTEs, and some pages not being flushed from
> the TLB before returning to userspace.
> 
> Minchan posted these patches:
> 56236a59556c ("mm: refactor TLB gathering API")
> 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
> 
> Around the same time, Mel posted:
> 4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range")
> 
> They both appear to solve the same bug.
> 
> Only one of the two solutions is needed.
> 
> However, 4647706ebeee appears to introduce extra TLB
> flushes - one per VMA, instead of one over the entire
> range unmapped, and also extra flushes when there are
> no simultaneous unmappers of the same mm.
> 
> For that reason, it seems like we should revert
> 4647706ebeee and keep only Minchan's solution in
> the kernel.
> 
> Am I overlooking any reason why we should not revert
> 4647706ebeee?

Yes I think so. Discussed here recently:

https://marc.info/?l=linux-mm&m=152878780528037&w=2

Actually we realized that powerpc does not implement the mmu
gather flushing quite right so it needs a fix before this
revert. But I propose the revert for next merge window.

Thanks,
Nick


Re: [PATCH 0/2] mm/fs: put_user_page() proposal

2018-07-09 Thread Nicholas Piggin
On Mon,  9 Jul 2018 01:05:52 -0700
john.hubb...@gmail.com wrote:

> From: John Hubbard 
> 
> Hi,
> 
> With respect to tracking get_user_pages*() pages with page->dma_pinned*
> fields [1], I spent a few days retrofitting most of the get_user_pages*()
> call sites, by adding calls to a new put_user_page() function, in place
> of put_page(), where appropriate. This will work, but it's a large effort.
> 
> Design note: I didn't see anything that hinted at a way to fix this
> problem, without actually changing all of the get_user_pages*() call sites,
> so I think it's reasonable to start with that.
> 
> Anyway, it's still incomplete, but because this is a large, tree-wide
> change (that will take some time and testing), I'd like to propose a plan,
> before spamming zillions of people with put_user_page() conversion patches.
> So I picked out the first two patches to show where this is going.
> 
> Proposed steps:
> 
> Step 1:
> 
> Start with the patches here, then continue with...dozens more.
> This will eventually convert all of the call sites to use put_user_page().
> This is easy in some places, but complex in others, such as:
> 
> -- drivers/gpu/drm/amd
> -- bio
> -- fuse
> -- cifs
> -- anything from:
>git grep  iov_iter_get_pages | cut -f1 -d ':' | sort | uniq
> 
> The easy ones can be grouped into a single patchset, perhaps, and the
> complex ones probably each need a patchset, in order to get the in-depth
> review they'll need.
> 
> Furthermore, some of these areas I hope to attract some help on, once
> this starts going.
> 
> Step 2:
> 
> In parallel, tidy up the core patchset that was discussed in [1], (version
> 2 has already been reviewed, so I know what to do), and get it perfected
> and reviewed. Don't apply it until step 1 is all done, though.
> 
> Step 3:
> 
> Activate refcounting of dma-pinned pages (essentially, patch #5, which is
> [1]), but don't use it yet. Place a few WARN_ON_ONCE calls to start
> mopping up any missed call sites.
> 
> Step 4:
> 
> After some soak time, actually connect it up (patch #6 of [1]) and start
> taking action based on the new page->dma_pinned* fields.

You can use my decade old patch!

https://lkml.org/lkml/2009/2/17/113

The problem with blocking in clear_page_dirty_for_io is that the fs is
holding the page lock (or locks) and possibly others too. If you
expect to have a bunch of long term references hanging around on the
page, then there will be hangs and deadlocks everywhere. And if you do
not have such log term references, then page lock (or some similar lock
bit) for the duration of the DMA should be about enough?

I think it has to be more fundamental to the filesystem. Filesystem
would get callbacks to register such long term dirtying on its files.
Then it can do locking, resource allocation, -ENOTSUPP, etc.

Thanks,
Nick


Re: mm,tlb: revert 4647706ebeee?

2018-07-09 Thread Nicholas Piggin
On Mon, 9 Jul 2018 17:13:56 -0700
Andrew Morton  wrote:

> On Sun, 8 Jul 2018 01:25:38 +1000 Nicholas Piggin  wrote:
> 
> > On Fri, 06 Jul 2018 13:03:55 -0400
> > Rik van Riel  wrote:
> >   
> > > Hello,
> > > 
> > > It looks like last summer, there were 2 sets of patches
> > > in flight to fix the issue of simultaneous mprotect/madvise
> > > calls unmapping PTEs, and some pages not being flushed from
> > > the TLB before returning to userspace.
> > > 
> > > Minchan posted these patches:
> > > 56236a59556c ("mm: refactor TLB gathering API")
> > > 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
> > > 
> > > Around the same time, Mel posted:
> > > 4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range")
> > > 
> > > They both appear to solve the same bug.
> > > 
> > > Only one of the two solutions is needed.
> > > 
> > > However, 4647706ebeee appears to introduce extra TLB
> > > flushes - one per VMA, instead of one over the entire
> > > range unmapped, and also extra flushes when there are
> > > no simultaneous unmappers of the same mm.
> > > 
> > > For that reason, it seems like we should revert
> > > 4647706ebeee and keep only Minchan's solution in
> > > the kernel.
> > > 
> > > Am I overlooking any reason why we should not revert
> > > 4647706ebeee?  
> > 
> > Yes I think so. Discussed here recently:
> > 
> > https://marc.info/?l=linux-mm&m=152878780528037&w=2  
> 
> Unclear if that was an ack ;)
>

Sure, I'm thinking Rik's mail is a ack for my patch :)

No actually I think it's okay, but was in the middle of testing
my series when Aneesh pointed out a bit was missing from powerpc,
so I had to go off and fix that, I think that's upstream now. So
need to go back and re-test this revert.

Wouldn't hurt for other arch maintainers to have a look I guess
(cc linux-arch):

The problem powerpc had is that mmu_gather flushing will flush a
single page size based on the ptes it encounters when we zap. If
we hit a different page size, it flushes and switches to the new
size. If we have concurrent zaps on the same range, the other
thread may have cleared a large page pte so we won't see that and
will only do a small page flush for that range. Which means we can
return before the other thread invalidated our TLB for the large
pages in the range we wanted to flush.

I suspect most arches are probably okay, but if you make any TLB
flush choices based on the pte contents, then you could be exposed.
Except in the case of archs like sparc and powerpc/hash which do
the flushing in arch_leave_lazy_mmu_mode(), because that is called
under the same page table lock, so there can't be concurrent zap.

A quick look through the archs doesn't show anything obvious, but
please take a look at your arch.

And I'll try to do a bit more testing.

Thanks,
Nick


Re: [PATCH tip/core/rcu 0/21] Contention reduction for v4.18

2018-05-13 Thread Nicholas Piggin
On Sun, 22 Apr 2018 20:02:58 -0700
"Paul E. McKenney"  wrote:

> Hello!
> 
> This series reduces lock contention on the root rcu_node structure,
> and is also the first precursor to TBD changes to consolidate the
> three RCU flavors (RCU-bh, RCU-preempt, and RCU-sched) into one.

Hi Paul,

I've been running your rcu/dev branch and haven't noticed any problems
yet. The irqsoff latency improvement is a little hard to measure
because the scheduler, but I've tried turning balancing parameters
right down and I'm yet to see any sign of RCU in traces (down to about
100us on a 176 CPU machine), so that's great.

(Not that RCU was ever the worst contributor to latency as I said, just
that I noticed those couple of traces where it showed up.)

Thanks very much for the fast response, sorry I've taken a while to
test.

Thanks,
Nick


Re: [PATCH v4 00/31] kconfig: move compiler capability tests to Kconfig

2018-05-17 Thread Nicholas Piggin
On Thu, 17 May 2018 15:16:39 +0900
Masahiro Yamada  wrote:

> [Introduction]
> 
> The motivation of this work is to move the compiler option tests to
> Kconfig from Makefile.  A number of kernel features require the
> compiler support.  Enabling such features blindly in Kconfig ends up
> with a lot of nasty build-time testing in Makefiles.  If a chosen
> feature turns out unsupported by the compiler, what the build system
> can do is either to disable it (silently!) or to forcibly break the
> build, despite Kconfig has let the user to enable it.  By moving the
> compiler capability tests to Kconfig, Kconfig entries will be visible
> only when they are available.
> 
> [Major Changes in V4]

Do you have a git tree for v4? I can test it with the powerpc patches.

The new scripting capability in kconfig has allowed us to already
improve the config process on powerpc:

https://marc.info/?l=linuxppc-embedded&m=152648110727868&w=2

I'm sure there's more clever things we can do with it but I haven't
had time to think about it yet. One thing that comes to mind is that
It might be nice to show the option as disabled, then the user could
upgrade their compiler to get the options they want.

Anyway v3 worked fine for me, the documentation is really nice, I
could implement the above patch without any problem despite being a
kbuild dummy. Thanks for the series, ack from me.

Thanks,
Nick


Re: [PATCH] powerpc/lib: Remove .balign inside string functions for PPC32

2018-05-17 Thread Nicholas Piggin
On Thu, 17 May 2018 12:04:13 +0200 (CEST)
Christophe Leroy  wrote:

> commit 87a156fb18fe1 ("Align hot loops of some string functions")
> degraded the performance of string functions by adding useless
> nops
> 
> A simple benchmark on an 8xx calling 10x a memchr() that
> matches the first byte runs in 41668 TB ticks before this patch
> and in 35986 TB ticks after this patch. So this gives an
> improvement of approx 10%
> 
> Another benchmark doing the same with a memchr() matching the 128th
> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
> after this patch, so regardless on the number of loops, removing
> those useless nops improves the test by 5683 TB ticks.
> 
> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
> Signed-off-by: Christophe Leroy 
> ---
>  Was sent already as part of a serie optimising string functions.
>  Resending on itself as it is independent of the other changes in the
> serie
> 
>  arch/powerpc/lib/string.S | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
> index a787776822d8..a026d8fa8a99 100644
> --- a/arch/powerpc/lib/string.S
> +++ b/arch/powerpc/lib/string.S
> @@ -23,7 +23,9 @@ _GLOBAL(strncpy)
>   mtctr   r5
>   addir6,r3,-1
>   addir4,r4,-1
> +#ifdef CONFIG_PPC64
>   .balign 16
> +#endif
>  1:   lbzur0,1(r4)
>   cmpwi   0,r0,0
>   stbur0,1(r6)

The ifdefs are a bit ugly, but you can't argue with the numbers. These
alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise
the ifetch performance when you have such a loop (although there is
always a tradeoff for a single iteration).

Would it make sense to define that for 32-bit as well, and you could use
it here instead of the ifdefs? Small CPUs could just use 0.

Thanks,
Nick


Re: [PATCH v6] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()

2018-05-17 Thread Nicholas Piggin
On Thu, 17 May 2018 12:59:29 +0200 (CEST)
Christophe Leroy  wrote:

> Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every
> userspace instruction miss") has shown that limiting the read of
> faulting instruction to likely cases improves performance.
> 
> This patch goes further into this direction by limiting the read
> of the faulting instruction to the only cases where it is definitly
> needed.
> 
> On an MPC885, with the same benchmark app as in the commit referred
> above, we see a reduction of 4000 dTLB misses (approx 3%):
> 
> Before the patch:
>  Performance counter stats for './fault 500' (10 runs):
> 
>  720495838  cpu-cycles( +-  0.04% )
> 141769  dTLB-load-misses  ( +-  0.02% )
>  52722  iTLB-load-misses  ( +-  0.01% )
>  19611  faults( +-  0.02% )
> 
>5.750535176 seconds time elapsed   ( +-  0.16% )
> 
> With the patch:
>  Performance counter stats for './fault 500' (10 runs):
> 
>  717669123  cpu-cycles( +-  0.02% )
> 137344  dTLB-load-misses  ( +-  0.03% )
>  52731  iTLB-load-misses  ( +-  0.01% )
>  19614  faults( +-  0.03% )
> 
>5.728423115 seconds time elapsed   ( +-  0.14% )
> 
> The proper work of the huge stack expansion was tested with the
> following app:
> 
> int main(int argc, char **argv)
> {
>   char buf[1024 * 1025];
> 
>   sprintf(buf, "Hello world !\n");
>   printf(buf);
> 
>   exit(0);
> }
> 
> Signed-off-by: Christophe Leroy 
> ---
>  v6: Rebased on latest powerpc/merge branch ; Using __get_user_inatomic() 
> instead of get_user() in order
>  to move it inside the semaphored area. That removes all the complexity 
> of the patch.
> 
>  v5: Reworked to fit after Benh do_fault improvement and rebased on top of 
> powerpc/merge (65152902e43fef)
> 
>  v4: Rebased on top of powerpc/next (f718d426d7e42e) and doing access_ok() 
> verification before __get_user_xxx()
> 
>  v3: Do a first try with pagefault disabled before releasing the semaphore
> 
>  v2: Changes 'if (cond1) if (cond2)' by 'if (cond1 && cond2)'
> 
>  arch/powerpc/mm/fault.c | 28 ++--
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index c01d627e687a..a7d5cc76a8ce 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -72,8 +72,18 @@ static inline bool notify_page_fault(struct pt_regs *regs)
>  static bool store_updates_sp(struct pt_regs *regs)
>  {
>   unsigned int inst;
> + int ret;
>  
> - if (get_user(inst, (unsigned int __user *)regs->nip))
> + /*
> +  * Using get_user_in_atomic() as reading code around nip can result in
> +  * fault, which may cause a deadlock when called with mmap_sem held,
> +  * however since we are reading the instruction that generated the DSI
> +  * we are handling, the page is necessarily already present.
> +  */
> + pagefault_disable();
> + ret = __get_user_inatomic(inst, (unsigned int __user *)regs->nip);
> + pagefault_enable();
> + if (ret)
>   return false;

Problem is that the page can be removed from page tables between
taking the fault and reading the address here.

That case would be so rare that it should be fine to do a big hammer
fix like drop the mmap_sem, do a fault_in_pages_readable, and then
restart from taking the mmap_sem again.

Thanks,
Nick


Re: [Skiboot] [PATCH] opal/hmi: Wakeup the cpu before reading core_fir

2018-08-20 Thread Nicholas Piggin
On Mon, 20 Aug 2018 19:36:05 +0530
Vaibhav Jain  wrote:

> When stop state 5 is enabled, reading the core_fir during an HMI can
> result in a xscom read error with xscom_read() returning the
> OPAL_XSCOM_PARTIAL_GOOD error code and core_fir value of all FFs. At
> present this return error code is not handled in decode_core_fir()
> hence the invalid core_fir value is sent to the kernel where it
> interprets it as a FATAL hmi causing a system check-stop.
> 
> This can be prevented by forcing the core to wake-up using before
> reading the core_fir. Hence this patch wraps the call to
> read_core_fir() within calls to dctl_set_special_wakeup() and
> dctl_clear_special_wakeup().

Would it be feasible to enumerate the ranges of scoms that require
special wakeup and check for those in xscom_read/write, and warn if
spwkup was not set?

Thanks,
Nick

> 
> Suggested-by: Michael Neuling 
> Signed-off-by: Vaibhav Jain 
> Signed-off-by: Mahesh J Salgaonkar 
> ---
>  core/hmi.c | 25 +
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/core/hmi.c b/core/hmi.c
> index 1f665a2f..67c520a0 100644
> --- a/core/hmi.c
> +++ b/core/hmi.c
> @@ -379,7 +379,7 @@ static bool decode_core_fir(struct cpu_thread *cpu,
>  {
>   uint64_t core_fir;
>   uint32_t core_id;
> - int i;
> + int i, swkup_rc = OPAL_UNSUPPORTED;
>   bool found = false;
>   int64_t ret;
>   const char *loc;
> @@ -390,14 +390,15 @@ static bool decode_core_fir(struct cpu_thread *cpu,
>  
>   core_id = pir_to_core_id(cpu->pir);
>  
> + /* Force the core to wakeup, otherwise reading core_fir is unrealiable
> +  * if stop-state 5 is enabled.
> +  */
> + swkup_rc = dctl_set_special_wakeup(cpu);
> +
>   /* Get CORE FIR register value. */
>   ret = read_core_fir(cpu->chip_id, core_id, &core_fir);
>  
> - if (ret == OPAL_HARDWARE) {
> - prerror("XSCOM error reading CORE FIR\n");
> - /* If the FIR can't be read, we should checkstop. */
> - return true;
> - } else if (ret == OPAL_WRONG_STATE) {
> + if (ret == OPAL_WRONG_STATE) {
>   /*
>* CPU is asleep, so it probably didn't cause the checkstop.
>* If no other HMI cause is found a "catchall" checkstop
> @@ -407,11 +408,16 @@ static bool decode_core_fir(struct cpu_thread *cpu,
>   prlog(PR_DEBUG,
> "FIR read failed, chip %d core %d asleep\n",
> cpu->chip_id, core_id);
> - return false;
> + goto out;
> + } else if (ret != OPAL_SUCCESS) {
> + prerror("XSCOM error reading CORE FIR\n");
> + /* If the FIR can't be read, we should checkstop. */
> + found = true;
> + goto out;
>   }
>  
>   if (!core_fir)
> - return false;
> + goto out;
>  
>   loc = chip_loc_code(cpu->chip_id);
>   prlog(PR_INFO, "[Loc: %s]: CHIP ID: %x, CORE ID: %x, FIR: %016llx\n",
> @@ -426,6 +432,9 @@ static bool decode_core_fir(struct cpu_thread *cpu,
>   |= xstop_bits[i].reason;
>   }
>   }
> +out:
> + if (!swkup_rc)
> + dctl_clear_special_wakeup(cpu);
>   return found;
>  }
>  



Re: [RFC PATCH v4 08/17] powerpc/ftrace: Move ftrace stub used for init text before _einittext

2024-07-14 Thread Nicholas Piggin
On Sun Jul 14, 2024 at 6:27 PM AEST, Naveen N Rao wrote:
> Move the ftrace stub used to cover inittext before _einittext so that it
> is within kernel text, as seen through core_kernel_text(). This is
> required for a subsequent change to ftrace.

Hmm, is there a reason it was outside einittext anyway?

Does it do anything else? Other than symbols, on some 32-bit platforms
it looks like it could change some of the initial mapping/pinning. Maybe
they jut get lucky and always map it before the change anyway?

It looks like the right thing to do even without the subsequent ftrace
change though.

Thanks,
Nick

>
> Signed-off-by: Naveen N Rao 
> ---
>  arch/powerpc/kernel/vmlinux.lds.S | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
> b/arch/powerpc/kernel/vmlinux.lds.S
> index f420df7888a7..0aef9959f2cd 100644
> --- a/arch/powerpc/kernel/vmlinux.lds.S
> +++ b/arch/powerpc/kernel/vmlinux.lds.S
> @@ -267,14 +267,13 @@ SECTIONS
>   .init.text : AT(ADDR(.init.text) - LOAD_OFFSET) {
>   _sinittext = .;
>   INIT_TEXT
> -
> + *(.tramp.ftrace.init);
>   /*
>*.init.text might be RO so we must ensure this section ends on
>* a page boundary.
>*/
>   . = ALIGN(PAGE_SIZE);
>   _einittext = .;
> - *(.tramp.ftrace.init);
>   } :text
>  
>   /* .exit.text is discarded at runtime, not link time,




Re: [RFC PATCH v4 11/17] kbuild: Add generic hook for architectures to use before the final vmlinux link

2024-07-15 Thread Nicholas Piggin
On Sun Jul 14, 2024 at 6:27 PM AEST, Naveen N Rao wrote:
> On powerpc, we would like to be able to make a pass on vmlinux.o and
> generate a new object file to be linked into vmlinux. Add a generic pass
> in Makefile.vmlinux that architectures can use for this purpose.
>
> Architectures need to select CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX and must
> provide arch//tools/Makefile with .arch.vmlinux.o target, which
> will be invoked prior to the final vmlinux link step.

Maybe POSTLINK should move to more like this with explicit config
option too rather than just picking up Makefile.postlink...

>
> Signed-off-by: Naveen N Rao 
> ---
>  arch/Kconfig |  6 ++
>  scripts/Makefile.vmlinux |  8 
>  scripts/link-vmlinux.sh  | 11 ---
>  3 files changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 975dd22a2dbd..ef868ff8156a 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1643,4 +1643,10 @@ config CC_HAS_SANE_FUNCTION_ALIGNMENT
>  config ARCH_NEED_CMPXCHG_1_EMU
>   bool
>  
> +config ARCH_WANTS_PRE_LINK_VMLINUX
> + def_bool n
> + help
> +   An architecture can select this if it provides 
> arch//tools/Makefile
> +   with .arch.vmlinux.o target to be linked into vmlinux.

Someone bikeshedded me before for putting comments for putting comment
for non-user-selectable option in 'help'. Even though heaps of options
are like that here, apparently they preferred # comment above the option
for developer comments. I personally thought this looks nicer but do
Kconfig maintainers prefer #?

> +
>  endmenu
> diff --git a/scripts/Makefile.vmlinux b/scripts/Makefile.vmlinux
> index 49946cb96844..6410e0be7f52 100644
> --- a/scripts/Makefile.vmlinux
> +++ b/scripts/Makefile.vmlinux
> @@ -22,6 +22,14 @@ targets += .vmlinux.export.o
>  vmlinux: .vmlinux.export.o
>  endif
>  
> +ifdef CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX
> +targets += .arch.vmlinux.o
> +.arch.vmlinux.o: vmlinux.o FORCE
> + $(Q)$(MAKE) $(build)=arch/$(SRCARCH)/tools .arch.vmlinux.o
> +
> +vmlinux: .arch.vmlinux.o
> +endif

Does .vmlinux.arch.o follow convention better? I guess the btf does
not. So, nevermind.

Could this just be done entirely in link-vmlinux.sh like kallsyms and
btf?

> +
>  ARCH_POSTLINK := $(wildcard $(srctree)/arch/$(SRCARCH)/Makefile.postlink)
>  
>  # Final link of vmlinux with optional arch pass after final link
> diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
> index 518c70b8db50..aafaed1412ea 100755
> --- a/scripts/link-vmlinux.sh
> +++ b/scripts/link-vmlinux.sh
> @@ -122,7 +122,7 @@ gen_btf()
>   return 1
>   fi
>  
> - vmlinux_link ${1}
> + vmlinux_link ${1} ${arch_vmlinux_o}
>  
>   info "BTF" ${2}
>   LLVM_OBJCOPY="${OBJCOPY}" ${PAHOLE} -J ${PAHOLE_FLAGS} ${1}

BTF generation needs  the prelink .o?

> @@ -178,7 +178,7 @@ kallsyms_step()
>   kallsymso=${kallsyms_vmlinux}.o
>   kallsyms_S=${kallsyms_vmlinux}.S
>  
> - vmlinux_link ${kallsyms_vmlinux} "${kallsymso_prev}" 
> ${btf_vmlinux_bin_o}
> + vmlinux_link ${kallsyms_vmlinux} "${kallsymso_prev}" 
> ${btf_vmlinux_bin_o} ${arch_vmlinux_o}
>   mksysmap ${kallsyms_vmlinux} ${kallsyms_vmlinux}.syms
>   kallsyms ${kallsyms_vmlinux}.syms ${kallsyms_S}
>  

> @@ -223,6 +223,11 @@ fi
>  
>  ${MAKE} -f "${srctree}/scripts/Makefile.build" obj=init 
> init/version-timestamp.o
>  
> +arch_vmlinux_o=""
> +if is_enabled CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX; then
> + arch_vmlinux_o=.arch.vmlinux.o
> +fi
> +
>  btf_vmlinux_bin_o=""
>  if is_enabled CONFIG_DEBUG_INFO_BTF; then
>   btf_vmlinux_bin_o=.btf.vmlinux.bin.o
> @@ -273,7 +278,7 @@ if is_enabled CONFIG_KALLSYMS; then
>   fi
>  fi
>  
> -vmlinux_link vmlinux "${kallsymso}" ${btf_vmlinux_bin_o}
> +vmlinux_link vmlinux "${kallsymso}" ${btf_vmlinux_bin_o} ${arch_vmlinux_o}
>  
>  # fill in BTF IDs
>  if is_enabled CONFIG_DEBUG_INFO_BTF && is_enabled CONFIG_BPF; then

I guess it looks okay, similar to btf although I'm not a kbuild expert.

Thanks,
Nick



Re: [RFC PATCH v4 12/17] powerpc64/ftrace: Move ftrace sequence out of line

2024-07-15 Thread Nicholas Piggin
On Sun Jul 14, 2024 at 6:27 PM AEST, Naveen N Rao wrote:
> Function profile sequence on powerpc includes two instructions at the
> beginning of each function:
>   mflrr0
>   bl  ftrace_caller
>
> The call to ftrace_caller() gets nop'ed out during kernel boot and is
> patched in when ftrace is enabled.
>
> Given the sequence, we cannot return from ftrace_caller with 'blr' as we
> need to keep LR and r0 intact. This results in link stack (return
> address predictor) imbalance when ftrace is enabled. To address that, we
> would like to use a three instruction sequence:
>   mflrr0
>   bl  ftrace_caller
>   mtlrr0
>
> Further more, to support DYNAMIC_FTRACE_WITH_CALL_OPS, we need to
> reserve two instruction slots before the function. This results in a
> total of five instruction slots to be reserved for ftrace use on each
> function that is traced.
>
> Move the function profile sequence out-of-line to minimize its impact.
> To do this, we reserve a single nop at function entry using
> -fpatchable-function-entry=1 and add a pass on vmlinux.o to determine
> the total number of functions that can be traced. This is then used to
> generate a .S file reserving the appropriate amount of space for use as
> ftrace stubs, which is built and linked into vmlinux.

These are all going into .tramp.ftrace.text AFAIKS? Should that be
moved after some of the other text in the linker script then if it
could get quite large? sched and lock and other things should be
closer to the rest of text and hot code.

Thanks,
Nick



Re: [RFC PATCH v4 13/17] powerpc64/ftrace: Support .text larger than 32MB with out-of-line stubs

2024-07-15 Thread Nicholas Piggin
On Sun Jul 14, 2024 at 6:27 PM AEST, Naveen N Rao wrote:
> We are restricted to a .text size of ~32MB when using out-of-line
> function profile sequence. Allow this to be extended up to the previous
> limit of ~64MB by reserving space in the middle of .text.
>
> A new config option CONFIG_PPC_FTRACE_OUT_OF_LINE_NUM_RESERVE is
> introduced to specify the number of function stubs that are reserved in
> .text. On boot, ftrace utilizes stubs from this area first before using
> the stub area at the end of .text.

[snip]

> diff --git a/arch/powerpc/kernel/trace/ftrace_entry.S 
> b/arch/powerpc/kernel/trace/ftrace_entry.S
> index 71f6a63cd861..86dbaa87532a 100644
> --- a/arch/powerpc/kernel/trace/ftrace_entry.S
> +++ b/arch/powerpc/kernel/trace/ftrace_entry.S
> @@ -374,6 +374,14 @@ _GLOBAL(return_to_handler)
>   blr
>  #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
>  
> +#ifdef CONFIG_PPC_FTRACE_OUT_OF_LINE
> +SYM_DATA(ftrace_ool_stub_text_count, .long 
> CONFIG_PPC_FTRACE_OUT_OF_LINE_NUM_RESERVE)
> +
> +SYM_CODE_START(ftrace_ool_stub_text)
> + .space CONFIG_PPC_FTRACE_OUT_OF_LINE_NUM_RESERVE * FTRACE_OOL_STUB_SIZE
> +SYM_CODE_END(ftrace_ool_stub_text)
> +#endif
> +
>  .pushsection ".tramp.ftrace.text","aw",@progbits;
>  .globl ftrace_tramp_text
>  ftrace_tramp_text:

How are you ensuring these new stubs get to the middle of kernel text? I
guess you just put it in regular .text and hope the linker puts it
in a good place?

Thanks,
Nick



Re: [PATCH v7 2/3] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()

2018-05-22 Thread Nicholas Piggin
On Tue, 22 May 2018 16:02:56 +0200 (CEST)
Christophe Leroy  wrote:

> Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every
> userspace instruction miss") has shown that limiting the read of
> faulting instruction to likely cases improves performance.
> 
> This patch goes further into this direction by limiting the read
> of the faulting instruction to the only cases where it is likely
> needed.
> 
> On an MPC885, with the same benchmark app as in the commit referred
> above, we see a reduction of about 3900 dTLB misses (approx 3%):
> 
> Before the patch:
>  Performance counter stats for './fault 500' (10 runs):
> 
>  683033312  cpu-cycles
> ( +-  0.03% )
> 134538  dTLB-load-misses  
> ( +-  0.03% )
>  46099  iTLB-load-misses  
> ( +-  0.02% )
>  19681  faults
> ( +-  0.02% )
> 
>5.389747878 seconds time elapsed   
>( +-  0.06% )
> 
> With the patch:
> 
>  Performance counter stats for './fault 500' (10 runs):
> 
>  682112862  cpu-cycles
> ( +-  0.03% )
> 130619  dTLB-load-misses  
> ( +-  0.03% )
>  46073  iTLB-load-misses  
> ( +-  0.05% )
>  19681  faults
> ( +-  0.01% )
> 
>5.381342641 seconds time elapsed   
>( +-  0.07% )
> 
> The proper work of the huge stack expansion was tested with the
> following app:
> 
> int main(int argc, char **argv)
> {
>   char buf[1024 * 1025];
> 
>   sprintf(buf, "Hello world !\n");
>   printf(buf);
> 
>   exit(0);
> }
> 
> Signed-off-by: Christophe Leroy 
> ---
>  v7: Following comment from Nicholas on v6 on possibility of the page getting 
> removed from the pagetables
>  between the fault and the read, I have reworked the patch in order to do 
> the get_user() in
>  __do_page_fault() directly in order to reduce complexity compared to 
> version v5

This is looking better, thanks.

> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index fcbb34431da2..dc64b8e06477 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -450,9 +450,6 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
> long address,
>* can result in fault, which will cause a deadlock when called with
>* mmap_sem held
>*/
> - if (is_write && is_user)
> - get_user(inst, (unsigned int __user *)regs->nip);
> -
>   if (is_user)
>   flags |= FAULT_FLAG_USER;
>   if (is_write)
> @@ -498,6 +495,26 @@ static int __do_page_fault(struct pt_regs *regs, 
> unsigned long address,
>   if (unlikely(!(vma->vm_flags & VM_GROWSDOWN)))
>   return bad_area(regs, address);
>  
> + if (unlikely(is_write && is_user && address + 0x10 < vma->vm_end &&
> +  !inst)) {
> + unsigned int __user *nip = (unsigned int __user *)regs->nip;
> +
> + if (likely(access_ok(VERIFY_READ, nip, sizeof(inst {
> + int res;
> +
> + pagefault_disable();
> + res = __get_user_inatomic(inst, nip);
> + pagefault_enable();
> + if (unlikely(res)) {
> + up_read(&mm->mmap_sem);
> + res = __get_user(inst, nip);
> + if (!res && inst)
> + goto retry;

You're handling error here but the previous code did not?

> + return bad_area_nosemaphore(regs, address);
> + }
> + }
> + }

Would it be nicer to move all this up into bad_stack_expansion().
It would need a way to handle the retry and insn, but I think it
would still look better.

Thanks,
Nick


Re: [PATCH v7 2/3] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()

2018-05-22 Thread Nicholas Piggin
On Tue, 22 May 2018 16:50:55 +0200
Christophe LEROY  wrote:

> Le 22/05/2018 à 16:38, Nicholas Piggin a écrit :
> > On Tue, 22 May 2018 16:02:56 +0200 (CEST)
> > Christophe Leroy  wrote:
> >   
> >> Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every
> >> userspace instruction miss") has shown that limiting the read of
> >> faulting instruction to likely cases improves performance.
> >>
> >> This patch goes further into this direction by limiting the read
> >> of the faulting instruction to the only cases where it is likely
> >> needed.
> >>
> >> On an MPC885, with the same benchmark app as in the commit referred
> >> above, we see a reduction of about 3900 dTLB misses (approx 3%):
> >>
> >> Before the patch:
> >>   Performance counter stats for './fault 500' (10 runs):
> >>
> >>   683033312  cpu-cycles
> >> ( +-  0.03% )
> >>  134538  dTLB-load-misses  
> >> ( +-  0.03% )
> >>   46099  iTLB-load-misses  
> >> ( +-  0.02% )
> >>   19681  faults
> >> ( +-  0.02% )
> >>
> >> 5.389747878 seconds time elapsed   
> >>( +-  0.06% )
> >>
> >> With the patch:
> >>
> >>   Performance counter stats for './fault 500' (10 runs):
> >>
> >>   682112862  cpu-cycles
> >> ( +-  0.03% )
> >>  130619  dTLB-load-misses  
> >> ( +-  0.03% )
> >>   46073  iTLB-load-misses  
> >> ( +-  0.05% )
> >>   19681  faults
> >> ( +-  0.01% )
> >>
> >> 5.381342641 seconds time elapsed   
> >>( +-  0.07% )
> >>
> >> The proper work of the huge stack expansion was tested with the
> >> following app:
> >>
> >> int main(int argc, char **argv)
> >> {
> >>char buf[1024 * 1025];
> >>
> >>sprintf(buf, "Hello world !\n");
> >>printf(buf);
> >>
> >>exit(0);
> >> }
> >>
> >> Signed-off-by: Christophe Leroy 
> >> ---
> >>   v7: Following comment from Nicholas on v6 on possibility of the page 
> >> getting removed from the pagetables
> >>   between the fault and the read, I have reworked the patch in order 
> >> to do the get_user() in
> >>   __do_page_fault() directly in order to reduce complexity compared to 
> >> version v5  
> > 
> > This is looking better, thanks.
> >   
> >> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> >> index fcbb34431da2..dc64b8e06477 100644
> >> --- a/arch/powerpc/mm/fault.c
> >> +++ b/arch/powerpc/mm/fault.c
> >> @@ -450,9 +450,6 @@ static int __do_page_fault(struct pt_regs *regs, 
> >> unsigned long address,
> >> * can result in fault, which will cause a deadlock when called with
> >> * mmap_sem held
> >> */
> >> -  if (is_write && is_user)
> >> -  get_user(inst, (unsigned int __user *)regs->nip);
> >> -
> >>if (is_user)
> >>flags |= FAULT_FLAG_USER;
> >>if (is_write)
> >> @@ -498,6 +495,26 @@ static int __do_page_fault(struct pt_regs *regs, 
> >> unsigned long address,
> >>if (unlikely(!(vma->vm_flags & VM_GROWSDOWN)))
> >>return bad_area(regs, address);
> >>   
> >> +  if (unlikely(is_write && is_user && address + 0x10 < vma->vm_end &&
> >> +   !inst)) {
> >> +  unsigned int __user *nip = (unsigned int __user *)regs->nip;
> >> +
> >> +  if (likely(access_ok(VERIFY_READ, nip, sizeof(inst {
> >> +  int res;
> >> +
> >> +  pagefault_disable();
> >> +  res = __get_user_inatomic(inst, nip);
> >> +  pagefault_enable();
> >> +  if (unlikely(res))

Re: [PATCH v8] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()

2018-05-23 Thread Nicholas Piggin
On Wed, 23 May 2018 09:01:19 +0200 (CEST)
Christophe Leroy  wrote:

> Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every
> userspace instruction miss") has shown that limiting the read of
> faulting instruction to likely cases improves performance.
> 
> This patch goes further into this direction by limiting the read
> of the faulting instruction to the only cases where it is likely
> needed.
> 
> On an MPC885, with the same benchmark app as in the commit referred
> above, we see a reduction of about 3900 dTLB misses (approx 3%):
> 
> Before the patch:
>  Performance counter stats for './fault 500' (10 runs):
> 
>  683033312  cpu-cycles
> ( +-  0.03% )
> 134538  dTLB-load-misses  
> ( +-  0.03% )
>  46099  iTLB-load-misses  
> ( +-  0.02% )
>  19681  faults
> ( +-  0.02% )
> 
>5.389747878 seconds time elapsed   
>( +-  0.06% )
> 
> With the patch:
> 
>  Performance counter stats for './fault 500' (10 runs):
> 
>  682112862  cpu-cycles
> ( +-  0.03% )
> 130619  dTLB-load-misses  
> ( +-  0.03% )
>  46073  iTLB-load-misses  
> ( +-  0.05% )
>  19681  faults
> ( +-  0.01% )
> 
>5.381342641 seconds time elapsed   
>( +-  0.07% )
> 
> The proper work of the huge stack expansion was tested with the
> following app:
> 
> int main(int argc, char **argv)
> {
>   char buf[1024 * 1025];
> 
>   sprintf(buf, "Hello world !\n");
>   printf(buf);
> 
>   exit(0);
> }
> 
> Signed-off-by: Christophe Leroy 
> ---
>  v8: Back to a single patch as it now makes no sense to split the first part 
> in two. The third patch has no
>  dependencies with the ones before, so it will be resend independantly. 
> As suggested by Nicholas, the
>  patch now does the get_user() stuff inside bad_stack_expansion(), that's 
> a mid way between v5 and v7.
> 
>  v7: Following comment from Nicholas on v6 on possibility of the page getting 
> removed from the pagetables
>  between the fault and the read, I have reworked the patch in order to do 
> the get_user() in
>  __do_page_fault() directly in order to reduce complexity compared to 
> version v5
> 
>  v6: Rebased on latest powerpc/merge branch ; Using __get_user_inatomic() 
> instead of get_user() in order
>  to move it inside the semaphored area. That removes all the complexity 
> of the patch.
> 
>  v5: Reworked to fit after Benh do_fault improvement and rebased on top of 
> powerpc/merge (65152902e43fef)
> 
>  v4: Rebased on top of powerpc/next (f718d426d7e42e) and doing access_ok() 
> verification before __get_user_xxx()
> 
>  v3: Do a first try with pagefault disabled before releasing the semaphore
> 
>  v2: Changes 'if (cond1) if (cond2)' by 'if (cond1 && cond2)'
> 
>  arch/powerpc/mm/fault.c | 63 
> +++--
>  1 file changed, 45 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 0c99f9b45e8f..7f9363879f4a 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -66,15 +66,11 @@ static inline bool notify_page_fault(struct pt_regs *regs)
>  }
>  
>  /*
> - * Check whether the instruction at regs->nip is a store using
> + * Check whether the instruction inst is a store using
>   * an update addressing form which will update r1.
>   */
> -static bool store_updates_sp(struct pt_regs *regs)
> +static bool store_updates_sp(unsigned int inst)
>  {
> - unsigned int inst;
> -
> - if (get_user(inst, (unsigned int __user *)regs->nip))
> - return false;
>   /* check for 1 in the rA field */
>   if (((inst >> 16) & 0x1f) != 1)
>   return false;
> @@ -233,9 +229,10 @@ static bool bad_kernel_fault(bool is_exec, unsigned long 
> error_code,
>   return is_exec || (address >= TASK_SIZE);
>  }
>  
> -static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address,
> - struct vm_area_struct *vma,
> - bool store_update_sp)
> +/* Return value is true if bad (sem. released), false if good, -1 for retry 
> */
> +static int bad_stack_expansion(struct pt_regs *regs, unsigned long address,
> + struct vm_area_struct *vma, unsigned int flags,
> + bool is_retry)
>  {
>   /*
>* N.B. The POWER/Open ABI allows programs to access up to
> @@ -247,10 +244,15 @@ static bool bad_stack_exp

Re: [PATCH v8] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()

2018-05-23 Thread Nicholas Piggin
On Wed, 23 May 2018 09:31:33 +0200
Christophe LEROY  wrote:

> Le 23/05/2018 à 09:17, Nicholas Piggin a écrit :
> > On Wed, 23 May 2018 09:01:19 +0200 (CEST)
> > Christophe Leroy  wrote:
> >   

> >> @@ -264,8 +266,30 @@ static bool bad_stack_expansion(struct pt_regs *regs, 
> >> unsigned long address,
> >> * between the last mapped region and the stack will
> >> * expand the stack rather than segfaulting.
> >> */
> >> -  if (address + 2048 < uregs->gpr[1] && !store_update_sp)
> >> -  return true;
> >> +  if (address + 2048 >= uregs->gpr[1])
> >> +  return false;
> >> +  if (is_retry)
> >> +  return false;
> >> +
> >> +  if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) &&
> >> +  access_ok(VERIFY_READ, nip, sizeof(inst))) {
> >> +  int res;
> >> +
> >> +  pagefault_disable();
> >> +  res = __get_user_inatomic(inst, nip);
> >> +  pagefault_enable();
> >> +  if (res) {
> >> +  up_read(&mm->mmap_sem);
> >> +  res = __get_user(inst, nip);
> >> +  if (!res && store_updates_sp(inst))
> >> +  return -1;
> >> +  return true;
> >> +  }
> >> +  if (store_updates_sp(inst))
> >> +  return false;
> >> +  }
> >> +  up_read(&mm->mmap_sem);  
> > 
> > Starting to look pretty good... I think probably I prefer the mmap_sem
> > drop going into the caller so we don't don't drop in the child function.  
> 
> Yes I can do that. I though it was ok as the drop is already done in 
> children functions like bad_area(), bad_access(), ...

That's true, all exit functions though. I think it may end up being a
bit nicer with the up_read in the caller, but see what you think.

> > I thought the retry logic was a little bit complex too, what do you
> > think of using fault_in_pages_readable and just doing a full retry to
> > avoid some of this complexity?  
> 
> Yes lets try that way, allthough fault_in_pages_readable() is nothing 
> else than a get_user().
> Should we take any precaution to avoid retrying forever or is it just 
> not worth it ?

generic_perform_write() the core of the data copying for write(2)
syscall does this retry, so I think it's okay... Although I think I
wrote that so maybe that's a circular justification.

I think if we end up thrashing on this type of loop for a long time,
the system will already be basically dead.


> >>/* The stack is being expanded, check if it's valid */
> >> -  if (unlikely(bad_stack_expansion(regs, address, vma, store_update_sp)))
> >> -  return bad_area(regs, address);
> >> +  is_bad = bad_stack_expansion(regs, address, vma, flags, is_retry);
> >> +  if (unlikely(is_bad == -1)) {
> >> +  is_retry = true;
> >> +  goto retry;
> >> +  }
> >> +  if (unlikely(is_bad))
> >> +  return bad_area_nosemaphore(regs, address);  
> > 
> > Suggest making the return so that you can do a single unlikely test for
> > the retry or bad case, and then distinguish the retry in there. Code
> > generation should be better.  
> 
> Ok. I'll try and come with v9 during this morning.

Thanks,
Nick


Re: [PATCH v9] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()

2018-05-23 Thread Nicholas Piggin
On Wed, 23 May 2018 10:53:22 +0200 (CEST)
Christophe Leroy  wrote:

> Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every
> userspace instruction miss") has shown that limiting the read of
> faulting instruction to likely cases improves performance.
> 
> This patch goes further into this direction by limiting the read
> of the faulting instruction to the only cases where it is likely
> needed.
> 
> On an MPC885, with the same benchmark app as in the commit referred
> above, we see a reduction of about 3900 dTLB misses (approx 3%):
> 
> Before the patch:
>  Performance counter stats for './fault 500' (10 runs):
> 
>  683033312  cpu-cycles
> ( +-  0.03% )
> 134538  dTLB-load-misses  
> ( +-  0.03% )
>  46099  iTLB-load-misses  
> ( +-  0.02% )
>  19681  faults
> ( +-  0.02% )
> 
>5.389747878 seconds time elapsed   
>( +-  0.06% )
> 
> With the patch:
> 
>  Performance counter stats for './fault 500' (10 runs):
> 
>  682112862  cpu-cycles
> ( +-  0.03% )
> 130619  dTLB-load-misses  
> ( +-  0.03% )
>  46073  iTLB-load-misses  
> ( +-  0.05% )
>  19681  faults
> ( +-  0.01% )
> 
>5.381342641 seconds time elapsed   
>( +-  0.07% )
> 
> The proper work of the huge stack expansion was tested with the
> following app:
> 
> int main(int argc, char **argv)
> {
>   char buf[1024 * 1025];
> 
>   sprintf(buf, "Hello world !\n");
>   printf(buf);
> 
>   exit(0);
> }
> 
> Signed-off-by: Christophe Leroy 

Reviewed-by: Nicholas Piggin 

Thanks,
Nick


Re: [RFC REBASED 5/5] powerpc/mm/slice: use the dynamic high slice size to limit bitmap operations

2018-02-27 Thread Nicholas Piggin
On Tue, 27 Feb 2018 18:11:07 +0530
"Aneesh Kumar K.V"  wrote:

> Nicholas Piggin  writes:
> 
> > On Tue, 27 Feb 2018 14:31:07 +0530
> > "Aneesh Kumar K.V"  wrote:
> >  
> >> Christophe Leroy  writes:
> >>   
> >> > The number of high slices a process might use now depends on its
> >> > address space size, and what allocation address it has requested.
> >> >
> >> > This patch uses that limit throughout call chains where possible,
> >> > rather than use the fixed SLICE_NUM_HIGH for bitmap operations.
> >> > This saves some cost for processes that don't use very large address
> >> > spaces.
> >> 
> >> I haven't really looked at the final code. One of the issue we had was
> >> with the below scenario.
> >> 
> >> mmap(addr, len) where addr < 128TB and addr+len > 128TB  We want to make
> >> sure we build the mask such that we don't find the addr available.  
> >
> > We should run it through the mmap regression tests. I *think* we moved
> > all of that logic from the slice code to get_ummapped_area before going
> > in to slices. I may have missed something though, it would be good to
> > have more eyes on it.
> >  
> 
> mmap(-1,...) failed with the test. Something like below fix it
> 
> @@ -756,7 +770,7 @@ void slice_set_user_psize(struct mm_struct *mm, unsigned 
> int psize)
> mm->context.low_slices_psize = lpsizes;
>  
> hpsizes = mm->context.high_slices_psize;
> -   high_slices = GET_HIGH_SLICE_INDEX(mm->context.slb_addr_limit);
> +   high_slices = SLICE_NUM_HIGH;
> for (i = 0; i < high_slices; i++) {
> mask_index = i & 0x1;
> index = i >> 1;
> 
> I guess for everything in the mm_context_t, we should compute it till
> SLICE_NUM_HIGH. The reason for failure was, even though we recompute the
> slice mask cached in mm_context on slb_addr_limit, it was still derived
> from the high_slices_psizes which was computed with lower value.

Okay thanks for catching that Aneesh. I guess that's a slow path so it
should be okay. Christophe if you're taking care of the series can you
fold it in? Otherwise I'll do that after yours gets merged.

Thanks,
Nick



  1   2   3   4   5   6   7   8   9   10   >