Re: [v2 PATCH] arm64: replace read_lock to rcu lock in call_break_hook

2015-10-05 Thread Shi, Yang

On 10/1/2015 3:15 PM, Shi, Yang wrote:

On 10/1/2015 2:27 PM, Paul E. McKenney wrote:

On Thu, Oct 01, 2015 at 01:53:51PM -0700, Shi, Yang wrote:

On 10/1/2015 10:08 AM, Steven Rostedt wrote:

On Thu,  1 Oct 2015 09:37:37 -0700
Yang Shi  wrote:


BUG: sleeping function called from invalid context at
kernel/locking/rtmutex.c:917
in_atomic(): 0, irqs_disabled(): 128, pid: 342, name: perf
1 lock held by perf/342:
  #0:  (break_hook_lock){+.+...}, at: []
call_break_hook+0x34/0xd0
irq event stamp: 62224
hardirqs last  enabled at (62223): []
__call_rcu.constprop.59+0x104/0x270
hardirqs last disabled at (62224): []
vprintk_emit+0x68/0x640
softirqs last  enabled at (0): []
copy_process.part.8+0x428/0x17f8
softirqs last disabled at (0): [<  (null)>]   (null)
CPU: 0 PID: 342 Comm: perf Not tainted 4.1.6-rt5 #4
Hardware name: linux,dummy-virt (DT)
Call trace:
[] dump_backtrace+0x0/0x128
[] show_stack+0x20/0x30
[] dump_stack+0x7c/0xa0
[] ___might_sleep+0x174/0x260
[] __rt_spin_lock+0x28/0x40
[] rt_read_lock+0x60/0x80
[] call_break_hook+0x30/0xd0
[] brk_handler+0x30/0x98
[] do_debug_exception+0x50/0xb8
Exception stack(0xffc00514fe30 to 0xffc00514ff50)
fe20:  
c1594680 007f
fe40:   92063940 007f 0550dcd8 ffc0
 
fe60: 0514fe70 ffc0 000be1f8 ffc0 0514feb0 ffc0
0008948c ffc0
fe80: 0004  0514fed0 ffc0  
9282a948 007f
fea0:   9282b708 007f c1592820 007f
00083914 ffc0
fec0:   0010  0064 
0001 
fee0: 005101e0  c1594680 007f c1594740 007f
ffd8 ff80
ff00:     c1594770 007f
c1594770 007f
ff20: 00665e10  7f7f7f7f 7f7f7f7f 01010101 01010101
 
ff40: 928e4cc0 007f 91ff11e8 007f

call_break_hook is called in atomic context (hard irq disabled), so
replace
the sleepable lock to rcu lock and replace relevant list operations
to rcu
version.

Signed-off-by: Yang Shi 
---
v1-> v2
Replace list operations to rcu version.

  arch/arm64/kernel/debug-monitors.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kernel/debug-monitors.c
b/arch/arm64/kernel/debug-monitors.c
index cebf786..cf0e4fc 100644
--- a/arch/arm64/kernel/debug-monitors.c
+++ b/arch/arm64/kernel/debug-monitors.c
@@ -276,14 +276,14 @@ static DEFINE_RWLOCK(break_hook_lock);
  void register_break_hook(struct break_hook *hook)
  {
  write_lock(&break_hook_lock);
-list_add(&hook->node, &break_hook);
+list_add_rcu(&hook->node, &break_hook);
  write_unlock(&break_hook_lock);
  }

  void unregister_break_hook(struct break_hook *hook)
  {
  write_lock(&break_hook_lock);
-list_del(&hook->node);
+list_del_rcu(&hook->node);
  write_unlock(&break_hook_lock);
  }


Shouldn't there be a synchronize_rcu() somewhere?


So far kgdb is the only user of unregister_break_hook in mainline
kernel.

Just read Documentation/RCU/checklist.txt, it says:

Note that synchronize_rcu() -only- guarantees to wait until
all currently executing rcu_read_lock()-protected RCU read-side
critical sections complete.

For kgdb, the unregister is just called in kgdb_arch_exit by
kgdb_unregister_io_module, which is called when rmmod kgdb module.

The break point handler is done synchronously. So, it sounds should
be not a problem without calling synchronize_rcu().


OK, I will bite...  What does "synchronously" mean here?  Unless you
have somehow guaranteed that all current readers in call_break_hook()
are done between the time you call unregister_break_hook() to remove a
given break_hook structure and the time you call register_break_hook()
to add that same structure back in, you have a problem.


For kgdb usecase, this might be guaranteed.

Generally, kgdb module is loaded then register_break_hook() is called.
Then connect kgdb from host or via kdb, set breakpoint, wait for the
break point is hit, run some commands to debug. Then finish debug, rmmod
kgdb which will call unregister_break_hook().

It sounds the current readers in call_break_hook() could be done during
the time otherwise I won't be able to continue my debug when break point
is hit.



What you have now only protects against invoking register_break_hook()
on newly allocated and initialized break_hook structure.  But the only
calls to register_break_hook() that I see in v4.2 use compile-time
initialized structures.  So the only failure from using non-RCU list
primitives would be due to the list_head's ->next pointer initialization.
This could momentarily make the list appear to have only the new element,
but not the old element.

Unless you do a series of register_break_hook() and
unregister_break_hook()
calls, in which case a previously deleted s

Re: [PATCH] arm64: replace read_lock to rcu lock in call_break_hook

2015-10-01 Thread Shi, Yang

On 10/1/2015 8:07 AM, Catalin Marinas wrote:

On Wed, Sep 30, 2015 at 03:59:04PM -0700, Yang Shi wrote:

diff --git a/arch/arm64/kernel/debug-monitors.c 
b/arch/arm64/kernel/debug-monitors.c
index cebf786..eb520d0 100644
--- a/arch/arm64/kernel/debug-monitors.c
+++ b/arch/arm64/kernel/debug-monitors.c
@@ -292,11 +292,11 @@ static int call_break_hook(struct pt_regs *regs, unsigned 
int esr)
struct break_hook *hook;
int (*fn)(struct pt_regs *regs, unsigned int esr) = NULL;

-   read_lock(&break_hook_lock);
+   rcu_read_lock();
list_for_each_entry(hook, &break_hook, node)
if ((esr & hook->esr_mask) == hook->esr_val)
fn = hook->fn;
-   read_unlock(&break_hook_lock);
+   rcu_read_unlock();

return fn ? fn(regs, esr) : DBG_HOOK_ERROR;
  }


That's not enough, you also need list_(add|del)_rcu where the list is
modified, together with list_for_each_entry_rcu() here.


Thanks for pointing it out. I'm going to prepare v2 with list_(add|del)_rcu.

Yang





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [v2 PATCH] arm64: replace read_lock to rcu lock in call_break_hook

2015-10-01 Thread Shi, Yang

On 10/1/2015 10:08 AM, Steven Rostedt wrote:

On Thu,  1 Oct 2015 09:37:37 -0700
Yang Shi  wrote:


BUG: sleeping function called from invalid context at 
kernel/locking/rtmutex.c:917
in_atomic(): 0, irqs_disabled(): 128, pid: 342, name: perf
1 lock held by perf/342:
  #0:  (break_hook_lock){+.+...}, at: [] 
call_break_hook+0x34/0xd0
irq event stamp: 62224
hardirqs last  enabled at (62223): [] 
__call_rcu.constprop.59+0x104/0x270
hardirqs last disabled at (62224): [] vprintk_emit+0x68/0x640
softirqs last  enabled at (0): [] 
copy_process.part.8+0x428/0x17f8
softirqs last disabled at (0): [<  (null)>]   (null)
CPU: 0 PID: 342 Comm: perf Not tainted 4.1.6-rt5 #4
Hardware name: linux,dummy-virt (DT)
Call trace:
[] dump_backtrace+0x0/0x128
[] show_stack+0x20/0x30
[] dump_stack+0x7c/0xa0
[] ___might_sleep+0x174/0x260
[] __rt_spin_lock+0x28/0x40
[] rt_read_lock+0x60/0x80
[] call_break_hook+0x30/0xd0
[] brk_handler+0x30/0x98
[] do_debug_exception+0x50/0xb8
Exception stack(0xffc00514fe30 to 0xffc00514ff50)
fe20:   c1594680 007f
fe40:   92063940 007f 0550dcd8 ffc0  
fe60: 0514fe70 ffc0 000be1f8 ffc0 0514feb0 ffc0 0008948c ffc0
fe80: 0004  0514fed0 ffc0   9282a948 007f
fea0:   9282b708 007f c1592820 007f 00083914 ffc0
fec0:   0010  0064  0001 
fee0: 005101e0  c1594680 007f c1594740 007f ffd8 ff80
ff00:     c1594770 007f c1594770 007f
ff20: 00665e10  7f7f7f7f 7f7f7f7f 01010101 01010101  
ff40: 928e4cc0 007f 91ff11e8 007f

call_break_hook is called in atomic context (hard irq disabled), so replace
the sleepable lock to rcu lock and replace relevant list operations to rcu
version.

Signed-off-by: Yang Shi 
---
v1-> v2
Replace list operations to rcu version.

  arch/arm64/kernel/debug-monitors.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kernel/debug-monitors.c 
b/arch/arm64/kernel/debug-monitors.c
index cebf786..cf0e4fc 100644
--- a/arch/arm64/kernel/debug-monitors.c
+++ b/arch/arm64/kernel/debug-monitors.c
@@ -276,14 +276,14 @@ static DEFINE_RWLOCK(break_hook_lock);
  void register_break_hook(struct break_hook *hook)
  {
write_lock(&break_hook_lock);
-   list_add(&hook->node, &break_hook);
+   list_add_rcu(&hook->node, &break_hook);
write_unlock(&break_hook_lock);
  }

  void unregister_break_hook(struct break_hook *hook)
  {
write_lock(&break_hook_lock);
-   list_del(&hook->node);
+   list_del_rcu(&hook->node);
write_unlock(&break_hook_lock);
  }


Shouldn't there be a synchronize_rcu() somewhere?


So far kgdb is the only user of unregister_break_hook in mainline kernel.

Just read Documentation/RCU/checklist.txt, it says:

Note that synchronize_rcu() -only- guarantees to wait until
all currently executing rcu_read_lock()-protected RCU read-side critical 
sections complete.


For kgdb, the unregister is just called in kgdb_arch_exit by 
kgdb_unregister_io_module, which is called when rmmod kgdb module.


The break point handler is done synchronously. So, it sounds should be 
not a problem without calling synchronize_rcu().


Yang


-- Steve



@@ -292,11 +292,11 @@ static int call_break_hook(struct pt_regs *regs, unsigned 
int esr)
struct break_hook *hook;
int (*fn)(struct pt_regs *regs, unsigned int esr) = NULL;

-   read_lock(&break_hook_lock);
-   list_for_each_entry(hook, &break_hook, node)
+   rcu_read_lock();
+   list_for_each_entry_rcu(hook, &break_hook, node)
if ((esr & hook->esr_mask) == hook->esr_val)
fn = hook->fn;
-   read_unlock(&break_hook_lock);
+   rcu_read_unlock();

return fn ? fn(regs, esr) : DBG_HOOK_ERROR;
  }




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [v2 PATCH] arm64: replace read_lock to rcu lock in call_break_hook

2015-10-01 Thread Shi, Yang

On 10/1/2015 2:27 PM, Paul E. McKenney wrote:

On Thu, Oct 01, 2015 at 01:53:51PM -0700, Shi, Yang wrote:

On 10/1/2015 10:08 AM, Steven Rostedt wrote:

On Thu,  1 Oct 2015 09:37:37 -0700
Yang Shi  wrote:


BUG: sleeping function called from invalid context at 
kernel/locking/rtmutex.c:917
in_atomic(): 0, irqs_disabled(): 128, pid: 342, name: perf
1 lock held by perf/342:
  #0:  (break_hook_lock){+.+...}, at: [] 
call_break_hook+0x34/0xd0
irq event stamp: 62224
hardirqs last  enabled at (62223): [] 
__call_rcu.constprop.59+0x104/0x270
hardirqs last disabled at (62224): [] vprintk_emit+0x68/0x640
softirqs last  enabled at (0): [] 
copy_process.part.8+0x428/0x17f8
softirqs last disabled at (0): [<  (null)>]   (null)
CPU: 0 PID: 342 Comm: perf Not tainted 4.1.6-rt5 #4
Hardware name: linux,dummy-virt (DT)
Call trace:
[] dump_backtrace+0x0/0x128
[] show_stack+0x20/0x30
[] dump_stack+0x7c/0xa0
[] ___might_sleep+0x174/0x260
[] __rt_spin_lock+0x28/0x40
[] rt_read_lock+0x60/0x80
[] call_break_hook+0x30/0xd0
[] brk_handler+0x30/0x98
[] do_debug_exception+0x50/0xb8
Exception stack(0xffc00514fe30 to 0xffc00514ff50)
fe20:   c1594680 007f
fe40:   92063940 007f 0550dcd8 ffc0  
fe60: 0514fe70 ffc0 000be1f8 ffc0 0514feb0 ffc0 0008948c ffc0
fe80: 0004  0514fed0 ffc0   9282a948 007f
fea0:   9282b708 007f c1592820 007f 00083914 ffc0
fec0:   0010  0064  0001 
fee0: 005101e0  c1594680 007f c1594740 007f ffd8 ff80
ff00:     c1594770 007f c1594770 007f
ff20: 00665e10  7f7f7f7f 7f7f7f7f 01010101 01010101  
ff40: 928e4cc0 007f 91ff11e8 007f

call_break_hook is called in atomic context (hard irq disabled), so replace
the sleepable lock to rcu lock and replace relevant list operations to rcu
version.

Signed-off-by: Yang Shi 
---
v1-> v2
Replace list operations to rcu version.

  arch/arm64/kernel/debug-monitors.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kernel/debug-monitors.c 
b/arch/arm64/kernel/debug-monitors.c
index cebf786..cf0e4fc 100644
--- a/arch/arm64/kernel/debug-monitors.c
+++ b/arch/arm64/kernel/debug-monitors.c
@@ -276,14 +276,14 @@ static DEFINE_RWLOCK(break_hook_lock);
  void register_break_hook(struct break_hook *hook)
  {
write_lock(&break_hook_lock);
-   list_add(&hook->node, &break_hook);
+   list_add_rcu(&hook->node, &break_hook);
write_unlock(&break_hook_lock);
  }

  void unregister_break_hook(struct break_hook *hook)
  {
write_lock(&break_hook_lock);
-   list_del(&hook->node);
+   list_del_rcu(&hook->node);
write_unlock(&break_hook_lock);
  }


Shouldn't there be a synchronize_rcu() somewhere?


So far kgdb is the only user of unregister_break_hook in mainline kernel.

Just read Documentation/RCU/checklist.txt, it says:

Note that synchronize_rcu() -only- guarantees to wait until
all currently executing rcu_read_lock()-protected RCU read-side
critical sections complete.

For kgdb, the unregister is just called in kgdb_arch_exit by
kgdb_unregister_io_module, which is called when rmmod kgdb module.

The break point handler is done synchronously. So, it sounds should
be not a problem without calling synchronize_rcu().


OK, I will bite...  What does "synchronously" mean here?  Unless you
have somehow guaranteed that all current readers in call_break_hook()
are done between the time you call unregister_break_hook() to remove a
given break_hook structure and the time you call register_break_hook()
to add that same structure back in, you have a problem.


For kgdb usecase, this might be guaranteed.

Generally, kgdb module is loaded then register_break_hook() is called.
Then connect kgdb from host or via kdb, set breakpoint, wait for the 
break point is hit, run some commands to debug. Then finish debug, rmmod 
kgdb which will call unregister_break_hook().


It sounds the current readers in call_break_hook() could be done during 
the time otherwise I won't be able to continue my debug when break point 
is hit.




What you have now only protects against invoking register_break_hook()
on newly allocated and initialized break_hook structure.  But the only
calls to register_break_hook() that I see in v4.2 use compile-time
initialized structures.  So the only failure from using non-RCU list
primitives would be due to the list_head's ->next pointer initialization.
This could momentarily make the list appear to have only the new element,
but not the old element.

Unless you do a series of register_break_hook() and unregister_break_hook()
calls, in which case a previously deleted structure could 

Re: [PATCH] perf: change samples type to unsigned long long

2015-10-02 Thread Shi, Yang

On 10/2/2015 12:10 PM, Arnaldo Carvalho de Melo wrote:

Em Fri, Oct 02, 2015 at 04:08:38PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Tue, Sep 29, 2015 at 02:49:43PM -0700, Yang Shi escreveu:

When run "perf record -e", the number of samples showed up is wrong on some
32 bit systems, i.e. powerpc and arm.

For example, run the below commands on 32 bit powerpc:

perf probe -x /lib/libc.so.6 malloc
perf record -e probe_libc:malloc -a ls
perf.data
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.036 MB perf.data (13829241621624967218 
samples) ]

Actually, "perf script" just shows 21 samples. The number of samples is also
absurd since samples is long type, but it is printed as PRIu64.

Build test is run on x86-64, x86, aarch64, arm, mips, ppc and ppc64.


Sure?

   AR   /tmp/build/perf/libperf.a
builtin-record.c: In function ‘__cmd_record’:
builtin-record.c:689:12: error: comparison between signed and unsigned
integer expressions [-Werror=sign-compare]


You missed something important, I am fixing this and applying your
patch:


Thanks a lot. Yes, it was wrong, however, my compiler just throws a 
warning instead of an error. And, the warning was hidden by my OE build 
environment.


Regards,
Yang



diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 974065f8ce80..24ace2f318c1 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -667,7 +667,7 @@ static int __cmd_record(struct record *rec, int argc, const 
char **argv)

auxtrace_snapshot_enabled = 1;
for (;;) {
-   int hits = rec->samples;
+   unsigned long long hits = rec->samples;

if (record__mmap_read_all(rec) < 0) {
auxtrace_snapshot_enabled = 0;



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: Kconfig: make SCHED_MC and SCHED_SMT depend on SMP

2016-04-14 Thread Shi, Yang

On 4/14/2016 1:47 AM, Will Deacon wrote:

On Wed, Apr 13, 2016 at 05:54:12PM -0700, Yang Shi wrote:

SCHED_MC and SCHED_SMT are pointless when SMP is disabled.
Although SMP is rarely disabled for ARM64, it looks more consistent to have
such depend in Kconfig.


You can't disable CONFIG_SMP for arm64 -- we select it unconditionally
in the kconfig.


Thanks Will. I didn't realize ARM64 has SMP selected unconditionally, it 
looks the patch is pointless.


A follow-up question, I know ARM64 has no UP implementation now, it 
sounds make sense to have SMP select unconditionally, however, it might 
be more flexible to have SMP like x86 and other architectures. And, it 
may also help to find more bugs when !SMP.


Thanks,
Yang



Will





Re: [PATCH] locktorture: make verbose writable and control stats print

2016-04-15 Thread Shi, Yang

On 4/15/2016 4:26 PM, Paul E. McKenney wrote:

On Fri, Apr 15, 2016 at 01:28:11PM -0700, Yang Shi wrote:

When building locktorture test into kernel image, it keeps printing out
stats information even though there is no lock type specified.

There is already verbose parameter to control print, but it is read-only,
so it can't be changed at runtime. Make verbose read-write and control
stats print.

Signed-off-by: Yang Shi 


Interesting change!

But just out of curiosity, when you boot with locktorture built in,
do you specify the shutdown_secs boot parameter?  If so, another


No, just use the default value, which is 0 for shutdown_secs.


approach would be to shutdown immediately upon detecting an error
during initialization.


In my case, it looks there is not error involved.



If not, I would like to know more about your use case.


In my test, I just built locktorture test into kernel instead of a 
module then check how it behaves, no specific purpose.


It sounds like not a normal approach to use it.

Thanks,
Yang



Thanx, Paul


---
  include/linux/torture.h  |  2 ++
  kernel/locking/locktorture.c | 11 +++
  2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/torture.h b/include/linux/torture.h
index 7759fc3..86d6e54 100644
--- a/include/linux/torture.h
+++ b/include/linux/torture.h
@@ -48,6 +48,8 @@
do { if (verbose) pr_alert("%s" TORTURE_FLAG " %s\n", torture_type, s); 
} while (0)
  #define VERBOSE_TOROUT_ERRSTRING(s) \
do { if (verbose) pr_alert("%s" TORTURE_FLAG "!!! %s\n", torture_type, 
s); } while (0)
+#define VERBOSE_STRING(s) \
+   do { if (verbose) pr_alert("%s", s); } while (0)

  /* Definitions for online/offline exerciser. */
  int torture_onoff_init(long ooholdoff, long oointerval);
diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index 8ef1919..d9838a3 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -55,8 +55,11 @@ torture_param(int, shutdown_secs, 0, "Shutdown time (j), <= zero 
to disable.");
  torture_param(int, stat_interval, 60,
 "Number of seconds between stats printk()s");
  torture_param(int, stutter, 5, "Number of jiffies to run/halt test, 
0=disable");
-torture_param(bool, verbose, true,
-"Enable verbose debugging printk()s");
+
+static bool verbose = true;
+module_param(verbose, bool, 0644);
+MODULE_PARM_DESC(verbose,
+"Enable verbose debugging printk()s");

  static char *torture_type = "spin_lock";
  module_param(torture_type, charp, 0444);
@@ -693,7 +696,7 @@ static void lock_torture_stats_print(void)
}

__torture_print_stats(buf, cxt.lwsa, true);
-   pr_alert("%s", buf);
+   VERBOSE_STRING(buf);
kfree(buf);

if (cxt.cur_ops->readlock) {
@@ -705,7 +708,7 @@ static void lock_torture_stats_print(void)
}

__torture_print_stats(buf, cxt.lrsa, false);
-   pr_alert("%s", buf);
+   VERBOSE_STRING(buf);
kfree(buf);
}
  }
--
2.0.2







Re: [PATCH] locktorture: make verbose writable and control stats print

2016-04-15 Thread Shi, Yang

On 4/15/2016 5:09 PM, Paul E. McKenney wrote:

On Fri, Apr 15, 2016 at 04:45:32PM -0700, Shi, Yang wrote:

On 4/15/2016 4:26 PM, Paul E. McKenney wrote:

On Fri, Apr 15, 2016 at 01:28:11PM -0700, Yang Shi wrote:

When building locktorture test into kernel image, it keeps printing out
stats information even though there is no lock type specified.

There is already verbose parameter to control print, but it is read-only,
so it can't be changed at runtime. Make verbose read-write and control
stats print.

Signed-off-by: Yang Shi 


Interesting change!

But just out of curiosity, when you boot with locktorture built in,
do you specify the shutdown_secs boot parameter?  If so, another


No, just use the default value, which is 0 for shutdown_secs.


approach would be to shutdown immediately upon detecting an error
during initialization.


In my case, it looks there is not error involved.


You said that there is no lock type specified, but that should mean that
the default ("spin_lock") is chosen.  If so, I would expect it to just


Yes, spin_lock is chosen by default.


do the test, at least if locktorture.torture_runnable has been set.


But, the default value of torture_runnable is 0. And, it is readonly 
parameter too. This prevents torture from running if it is built into 
kernel instead of module.


Actually, I'm confused why there is not LOCK_TORTURE_TEST_RUNNABLE 
Kconfig like RCU torture?


Thanks,
Yang



Either way, the usual way to make locktorture shut up would be to boot
with locktorture.stat_interval=0.


If not, I would like to know more about your use case.


In my test, I just built locktorture test into kernel instead of a
module then check how it behaves, no specific purpose.

It sounds like not a normal approach to use it.


Agreed, I do believe that this is a case of "working as designed".

Thanx, Paul





Re: [BUG linux-next] Kernel panic found with linux-next-20160414

2016-04-28 Thread Shi, Yang

On 4/27/2016 1:14 AM, Hugh Dickins wrote:

On Wed, 20 Apr 2016, Shi, Yang wrote:

On 4/20/2016 1:01 AM, Hugh Dickins wrote:

On Tue, 19 Apr 2016, Shi, Yang wrote:

Hi folks,

When I ran ltp on linux-next-20160414 on my ARM64 machine, I got the
below
kernel panic:

Unable to handle kernel paging request at virtual address
ffc007846000
pgd = ffc01e21d000
[ffc007846000] *pgd=, *pud=
Internal error: Oops: 9647 [#11] PREEMPT SMP
Modules linked in: loop
CPU: 7 PID: 274 Comm: systemd-journal Tainted: G  D
4.6.0-rc3-next-20160414-WR8.0.0.0_standard+ #9
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
task: ffc01e3fcf80 ti: ffc01ea8c000 task.ti: ffc01ea8c000
PC is at copy_page+0x38/0x120
LR is at migrate_page_copy+0x604/0x1660
pc : [] lr : [] pstate: 2145
sp : ffc01ea8ecd0
x29: ffc01ea8ecd0 x28: 
x27: 1ff7b80240f8 x26: ffc018196f20
x25: ffbdc01e1180 x24: ffbdc01e1180
x23:  x22: ffc01e3fcf80
x21: ffc00481f000 x20: ff900a31d000
x19: ffbdc01207c0 x18: 0f00
x17:  x16: 
x15:  x14: 
x13:  x12: 
x11:  x10: 
x9 :  x8 : 
x7 :  x6 : 
x5 :  x4 : 
x3 :  x2 : 
x1 : ffc00481f080 x0 : ffc007846000

Call trace:
Exception stack(0xffc021fc2ed0 to 0xffc021fc2ff0)
2ec0:   ffbdc00887c0 ff900a31d000
2ee0: ffc021fc30f0 ff9008ff2318 2145 0025
2f00: ffbdc025a280 ffc020adc4c0 41b58ab3 ff900a085fd0
2f20: ff9008200658   ffbdc00887c0
2f40: ff900b0f1320 ffc021fc3078 41b58ab3 ff900a0864f8
2f60: ff9008210010 ffc021fb8960 ff900867bacc 1ff8043f712d
2f80: ffc021fc2fb0 ff9008210564 ffc021fc3070 ffc021fb8940
2fa0: 08221f78 ff900862f9c8 ffc021fc2fe0 ff9008215dc8
2fc0: 1ff8043f8602 ffc021fc ffc00968a000 ffc00221f080
2fe0: f9407e11d1f0 d61f02209103e210
[] copy_page+0x38/0x120
[] migrate_page+0x74/0x98
[] nfs_migrate_page+0x58/0x80
[] move_to_new_page+0x15c/0x4d8
[] migrate_pages+0x7c8/0x11f0
[] compact_zone+0xdfc/0x2570
[] compact_zone_order+0xe0/0x170
[] try_to_compact_pages+0x2e8/0x8f8
[] __alloc_pages_direct_compact+0x100/0x540
[] __alloc_pages_nodemask+0xc40/0x1c58
[] khugepaged+0x468/0x19c8
[] kthread+0x248/0x2c0
[] ret_from_fork+0x10/0x40
Code: d281f012 91020021 f1020252 d503201f (a8000c02)


I did some initial investigation and found it is caused by
DEBUG_PAGEALLOC
and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT. And, mainline 4.6-rc3 works
well.

It should be not arch specific although I got it caught on ARM64. I
suspect
this might be caused by Hugh's huge tmpfs patches.


Thanks for testing.  It might be caused by my patches, but I don't think
that's very likely.  This is page migraton for compaction, in the service
of anon THP's khugepaged; and I wonder if you were even exercising huge
tmpfs when running LTP here (it certainly can be done: I like to mount a
huge tmpfs on /opt/ltp and install there, with shmem_huge 2 so any other
tmpfs mounts are also huge).


Some further investigation shows I got the panic even though I don't have
tmpfs mounted with huge=1 or set shmem_huge to 2.



There are compaction changes in linux-next too, but I don't see any
reason why they'd cause this.  I don't know arm64 traces enough to know
whether it's the source page or the destination page for the copy, but
it looks as if it has been freed (and DEBUG_PAGEALLOC unmapped) before
reaching migration's copy.


The fault address is passed by x0, which is dest in the implementation of
copy_page, so it is the destination page.



Needs more debugging, I'm afraid: is it reproducible?


Yes, as long as I enable those two PAGEALLOC debug options, I can get the
panic once I run ltp. But, it is not caused any specific ltp test case
directly, the panic happens randomly during ltp is running.


Your ping on the crash in release_freepages() reminded me to take another
look at this one.  And found that I only needed to enable DEBUG_PAGEALLOC
and run LTP to get it on x86_64 too, as you suspected.

It's another of those compaction errors, in mmotm and linux-next of a
week or two ago, whose patch has since been withdrawn (but huge tmpfs
has also been withdrawn for now, so you're right to stick with the
older linux-next for testing it).


Yes, I saw the discussion on LSFMM 2016 and the patches have gone in my 
latest update from linux-next. I will stick to 20160420 for the huge 
tmpfs testing.




I believe the patch below fixes it; but I've not done full diligence
on 

Re: [PATCH] mm: move huge_pmd_set_accessed out of huge_memory.c

2016-04-29 Thread Shi, Yang

On 4/22/2016 2:48 AM, Kirill A. Shutemov wrote:

On Thu, Apr 21, 2016 at 03:56:07PM -0700, Shi, Yang wrote:

On 4/21/2016 12:30 AM, Kirill A. Shutemov wrote:

On Wed, Apr 20, 2016 at 02:00:11PM -0700, Shi, Yang wrote:

Hi folks,

I didn't realize pmd_* functions are protected by
CONFIG_TRANSPARENT_HUGEPAGE on the most architectures before I made this
change.

Before I fix all the affected architectures code, I want to check if you
guys think this change is worth or not?

Thanks,
Yang

On 4/20/2016 11:24 AM, Yang Shi wrote:

huge_pmd_set_accessed is only called by __handle_mm_fault from memory.c,
move the definition to memory.c and make it static like create_huge_pmd and
wp_huge_pmd.

Signed-off-by: Yang Shi 


On pte side we have the same functionality open-coded. Should we do the
same for pmd? Or change pte side the same way?


Sorry, I don't quite understand you. Do you mean pte_* functions?


See handle_pte_fault(), we do the same for pte there what
huge_pmd_set_accessed() does for pmd.


Thanks for directing to this code.



I think we should be consistent here: either both are abstructed into
functions or both open-coded.


I'm supposed functions sound better. However, do_wp_page has to be 
called with pte lock acquired. So, the abstracted function has to call it.


Thanks,
Yang








[BUG] Null pointer dereference when freeing pages on 4.6-rc6

2016-05-05 Thread Shi, Yang

Hi folks,

When I enable the below kernel configs on 4.6-rc6, I came across null 
pointer deference issue in boot stage.


CONFIG_SPARSEMEM
CONFIG_DEFERRED_STRUCT_PAGE_INIT
CONFIG_DEBUG_PAGEALLOC
CONFIG_PAGE_EXTENSION
CONFIG_DEBUG_VM


The splat is:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [] page_is_buddy+0x7b/0xe0
PGD 0
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 3 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc6 #8
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009

task: 88017c1d0040 ti: 88017c1d4000 task.ti: 88017c1d4000
RIP: 0010:[]  [] page_is_buddy+0x7b/0xe0
RSP: :88017c1d7bf0  EFLAGS: 00010046
RAX:  RBX: ea0019810040 RCX: 
RDX:  RSI: ea0019810040 RDI: 00660401
RBP: 88017c1d7c08 R08: 0001 R09: 
R10: 01af R11: 0001 R12: ea001981
R13:  R14: 0009 R15: ea001981
FS:  () GS:88066cc4() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 02406000 CR4: 06e0
Stack:
 1981  88066cfe6080 88017c1d7c70
 8118bfea 000a 1600 0001
 0401 ea0019810040 0400 88066cc5aca8
Call Trace:
 [] __free_one_page+0x23a/0x450
 [] free_pcppages_bulk+0x136/0x360
 [] free_hot_cold_page+0x168/0x1b0
 [] __free_pages+0x5c/0x90
 [] __free_pages_boot_core.isra.70+0x11a/0x14d
 [] deferred_free_range+0x50/0x62
 [] deferred_init_memmap+0x220/0x3c3
 [] ? deferred_free_range+0x62/0x62
 [] kthread+0xf8/0x110
 [] ret_from_fork+0x22/0x40
 [] ? kthread_create_on_node+0x200/0x200
Code: 75 7b 48 89 d8 8b 40 1c 85 c0 74 50 48 c7 c6 38 bd 0d 82 48 89 df 
e8 25 e2 02 00 0f 0b 48 89 f7 89 55 ec e8 18 cb 07 00 8b 55 ec <48> 8b 
00 a8 02 74 9d 3b 53 30 75 98 49 8b 14 24 48 8b 03 48 c1

RIP  [] page_is_buddy+0x7b/0xe0
 RSP 
CR2: 
---[ end trace e0c05a86b43d97f9 ]---
note: pgdatinit1[106] exited with preempt_count 1


I changed page_is_buddy and __free_one_page to non-inline to get more 
accurate stack trace.



Then I did some investigation on it with printing the address of page 
and buddy, please see the below log:


@@__free_one_page:715: page is at ea0005f05c00 buddy is at 
ea0005f05c80, order is 1
@@__free_one_page:715: page is at ea0005f05c00 buddy is at 
ea0005f05d00, order is 2
@@__free_one_page:715: page is at ea0005f05c00 buddy is at 
ea0005f05e00, order is 3
@@__free_one_page:715: page is at ea001981 buddy is at 
ea0019810040, order is 0


call trace splat

@@__free_one_page:715: page is at ea0005f05bc0 buddy is at 
ea0005f05b80, order is 0
@@__free_one_page:715: page is at ea0005f05b80 buddy is at 
ea0005f05bc0, order is 0
@@__free_one_page:715: page is at ea0005f05b80 buddy is at 
ea0005f05b00, order is 1
@@__free_one_page:715: page is at ea0005f05b40 buddy is at 
ea0005f05b00, order is 0
@@__free_one_page:715: page is at ea0005f05b00 buddy is at 
ea0005f05b40, order is 0


It shows just before the call trace splat, the page address jumped to 
ea001981 from ea0005f05xxx, not sure why this is happening. 
Any hint is appreciated.


And, reading the code leads me to the below call path:

page_is_buddy()
--> page_is_guard()
--> lookup_page_ext()

Then lookup_page_ext() just returns null due to the below code:

#if defined(CONFIG_DEBUG_VM) || defined(CONFIG_PAGE_POISONING)
/*
 * The sanity checks the page allocator does upon freeing a
 * page can reach here before the page_ext arrays are
 * allocated when feeding a range of pages to the allocator
 * for the first time during bootup or memory hotplug.
 *
 * This check is also necessary for ensuring page poisoning
 * works as expected when enabled
 */
if (!section->page_ext)
return NULL;
#endif

So, according to the comment, it looks there should be a WARN or BUG if 
it returns NULL? And, almost no codes check if the return pointer is 
null or not after lookup_page_ext() is called.


Thanks,
Yang


Re: [PATCH] mm: slab: remove ZONE_DMA_FLAG

2016-05-09 Thread Shi, Yang

On 5/5/2016 4:49 AM, Michal Hocko wrote:

On Wed 04-05-16 10:01:37, Yang Shi wrote:

Now we have IS_ENABLED helper to check if a Kconfig option is enabled or not,
so ZONE_DMA_FLAG sounds no longer useful.

And, the use of ZONE_DMA_FLAG in slab looks pointless according to the
comment [1] from Johannes Weiner, so remove them and ORing passed in flags with
the cache gfp flags has been done in kmem_getpages().

[1] https://lkml.org/lkml/2014/9/25/553


I haven't checked the patch but I have a formal suggestion.
lkml.org tends to break and forget, please use
http://lkml.kernel.org/r/$msg-id instead. In this case
http://lkml.kernel.org/r/20140925185047.ga21...@cmpxchg.org


Thanks for the suggestion. Will use msg-id in later post.

Regards,
Yang



Thanks!





Re: [v2 PATCH] mm: move page_ext_init after all struct pages are initialized

2016-05-20 Thread Shi, Yang

On 5/20/2016 6:16 AM, Michal Hocko wrote:

On Thu 19-05-16 15:13:26, Yang Shi wrote:
[...]

diff --git a/init/main.c b/init/main.c
index b3c6e36..2075faf 100644
--- a/init/main.c
+++ b/init/main.c
@@ -606,7 +606,6 @@ asmlinkage __visible void __init start_kernel(void)
initrd_start = 0;
}
 #endif
-   page_ext_init();
debug_objects_mem_init();
kmemleak_init();
setup_per_cpu_pageset();
@@ -1004,6 +1003,8 @@ static noinline void __init kernel_init_freeable(void)
sched_init_smp();

page_alloc_init_late();
+   /* Initialize page ext after all struct pages are initializaed */
+   page_ext_init();

do_basic_setup();


I might be missing something but don't we have the same problem with
CONFIG_FLATMEM? page_ext_init_flatmem is called way earlier. Or
CONFIG_DEFERRED_STRUCT_PAGE_INIT is never enabled for CONFIG_FLATMEM?


Yes, CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on MEMORY_HOTPLUG which 
depends on SPARSEMEM. So, this config is not valid for FLATMEM at all.


Thanks,
Yang







Re: [PATCH] mm: page_is_guard return false when page_ext arrays are not allocated yet

2016-05-20 Thread Shi, Yang

On 5/19/2016 7:40 PM, Joonsoo Kim wrote:

2016-05-20 2:18 GMT+09:00 Shi, Yang :

On 5/18/2016 5:28 PM, Joonsoo Kim wrote:


Vlastiml, thanks for ccing me on original bug report.

On Wed, May 18, 2016 at 03:23:45PM -0700, Yang Shi wrote:


When enabling the below kernel configs:

CONFIG_DEFERRED_STRUCT_PAGE_INIT
CONFIG_DEBUG_PAGEALLOC
CONFIG_PAGE_EXTENSION
CONFIG_DEBUG_VM

kernel bootup may fail due to the following oops:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [] free_pcppages_bulk+0x2d2/0x8d0
PGD 0
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS
S5500.86B.01.10.0025.030220091519 03/02/2009
task: 88017c080040 ti: 88017c084000 task.ti: 88017c084000
RIP: 0010:[]  []
free_pcppages_bulk+0x2d2/0x8d0
RSP: :88017c087c48  EFLAGS: 00010046
RAX:  RBX:  RCX: 0001
RDX: 0980 RSI: 0080 RDI: 00660401
RBP: 88017c087cd0 R08: 0401 R09: 0009
R10: 88017c080040 R11: 000a R12: 0400
R13: ea001981 R14: ea0019810040 R15: 88066cfe6080
FS:  () GS:88066cd4()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 02406000 CR4: 06e0
Stack:
 88066cd5bbd8 88066cfe6640  
 001f001f 88066cd5bbe8 ea001981 8118f53e
 0009 0401 000a 0001
Call Trace:
 [] free_hot_cold_page+0x192/0x1d0
 [] __free_pages+0x5c/0x90
 [] __free_pages_boot_core+0x11a/0x14e
 [] deferred_free_range+0x50/0x62
 [] deferred_init_memmap+0x220/0x3c3
 [] ? setup_per_cpu_pageset+0x35/0x35
 [] kthread+0xf8/0x110
 [] ret_from_fork+0x22/0x40
 [] ? kthread_create_on_node+0x200/0x200
Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8
4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44
8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
RIP  [] free_pcppages_bulk+0x2d2/0x8d0
 RSP 
CR2: 

The problem is lookup_page_ext() returns NULL then page_is_guard() tried
to
access it in page freeing.

page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension
flag, but
freeing page might reach here before the page_ext arrays are allocated
when
feeding a range of pages to the allocator for the first time during
bootup or
memory hotplug.



Patch itself looks find to me because I also found that this kind of
problem happens during memory hotplug. So, we need to fix more sites,
all callers of lookup_page_ext().



Yes, I agree. I will come up with a patch or a couple of patches to check
the return value of lookup_page_ext().



But, I'd like to know how your problem occurs during bootup.
debug_guardpage_enabled() is turned to 'enable' after page_ext is
initialized. Before that, page_is_guard() unconditionally returns
false so I think that the problem what you mentioned can't happen.

Could you check that when debug_guardpage_enabled() returns 'enable'
and init_section_page_ext() is called?



I think the problem is I have CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled,
which will defer some struct pages initialization to "pgdatinitX" kernel
thread in page_alloc_init_late(). But, page_ext_init() is called before it.
So, it leads debug_guardpage_enabled() return true, but page extension is
not allocated yet for the struct pages initialized by "pgdatinitX".


No. After page_ext_init(), it is ensured that all page extension is initialized.


It sounds page_ext_init() should be called after page_alloc_init_late(). Or
it should be just incompatible with CONFIG_DEFERRED_STRUCT_PAGE_INIT.

I will try to move the init call around.


We need to investigate more. I guess that problem is introduced by
CONFIG_DEFERRED_STRUCT_PAGE_INIT. It makes pfn_to_nid() invalid
until page_alloc_init_late() is done. That is a big side-effect. If
there is pfn walker
and it uses pfn_to_nid() between memmap_init_zone() and page_alloc_init_late(),
it also has same problem. So, we need to think how to fix it more
carefully.


Thanks for the analysis. I think you are correct. Since pfn_to_nid() 
depends on memmap which has not been fully setup yet until 
page_alloc_init_late() is done.


So, for such usecase early_pfn_to_nid() should be used.



Anyway, to make sure that my assumption is true, could you confirm that
below change fix your problem?


Yes, it does.



Thanks.

->8--
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 2d864e6..cac5dc9 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -391,7 +391,7 @@ void __init page_ext_init(void)
 * -pfn---

Re: [v2 PATCH] mm: move page_ext_init after all struct pages are initialized

2016-05-23 Thread Shi, Yang

On 5/23/2016 12:31 AM, Michal Hocko wrote:

On Fri 20-05-16 08:41:09, Shi, Yang wrote:

On 5/20/2016 6:16 AM, Michal Hocko wrote:

On Thu 19-05-16 15:13:26, Yang Shi wrote:
[...]

diff --git a/init/main.c b/init/main.c
index b3c6e36..2075faf 100644
--- a/init/main.c
+++ b/init/main.c
@@ -606,7 +606,6 @@ asmlinkage __visible void __init start_kernel(void)
initrd_start = 0;
}
 #endif
-   page_ext_init();
debug_objects_mem_init();
kmemleak_init();
setup_per_cpu_pageset();
@@ -1004,6 +1003,8 @@ static noinline void __init kernel_init_freeable(void)
sched_init_smp();

page_alloc_init_late();
+   /* Initialize page ext after all struct pages are initializaed */
+   page_ext_init();

do_basic_setup();


I might be missing something but don't we have the same problem with
CONFIG_FLATMEM? page_ext_init_flatmem is called way earlier. Or
CONFIG_DEFERRED_STRUCT_PAGE_INIT is never enabled for CONFIG_FLATMEM?


Yes, CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on MEMORY_HOTPLUG which
depends on SPARSEMEM. So, this config is not valid for FLATMEM at all.


Well
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
depends on SPARSEMEM || X86_64_ACPI_NUMA
depends on ARCH_ENABLE_MEMORY_HOTPLUG

I wasn't really sure about X86_64_ACPI_NUMA dependency branch which
depends on X86_64 && NUMA && ACPI && PCI and that didn't sound like
SPARSEMEM only. If the FLATMEM shouldn't exist with


Actually, FLATMEMT depends on !NUMA.


CONFIG_DEFERRED_STRUCT_PAGE_INIT can we make that explicit please?


Sure, it makes the condition clearer and more readable.

Thanks,
Yang







Re: [PATCH] mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly

2016-05-23 Thread Shi, Yang

On 5/23/2016 11:22 AM, Michal Hocko wrote:

On Mon 23-05-16 09:54:31, Yang Shi wrote:

Per the suggestion from Michal Hocko [1], CONFIG_DEFERRED_STRUCT_PAGE_INIT
should be incompatible with FLATMEM, make this explicitly in Kconfig.


I guess the changelog could benefit from some clarification. What
do you think about the following:

"
DEFERRED_STRUCT_PAGE_INIT requires some ordering wrt other
initialization operations, e.g. page_ext_init has to happen after the
whole memmap is initialized properly. For SPARSEMEM this requires to
wait for page_alloc_init_late. Other memory models (e.g. flatmem) might
have different initialization layouts (page_ext_init_flatmem). Currently
DEFERRED_STRUCT_PAGE_INIT depends on MEMORY_HOTPLUG which in turn
depends on SPARSEMEM || X86_64_ACPI_NUMA
depends on ARCH_ENABLE_MEMORY_HOTPLUG

and X86_64_ACPI_NUMA depends on NUMA which in turn disable FLATMEM
memory model:
config ARCH_FLATMEM_ENABLE
def_bool y
depends on X86_32 && !NUMA

so FLATMEM is ruled out via dependency maze. Be explicit and disable
FLATMEM for DEFERRED_STRUCT_PAGE_INIT so that we do not reintroduce
subtle initialization bugs
"


Thanks a lot. It sounds way better. Will address in V2.

Yang





[1] http://lkml.kernel.org/r/20160523073157.gd2...@dhcp22.suse.cz

Signed-off-by: Yang Shi 
---
 mm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 2664c11..22fa818 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -649,6 +649,7 @@ config DEFERRED_STRUCT_PAGE_INIT
default n
depends on ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
depends on MEMORY_HOTPLUG
+   depends on !FLATMEM
help
  Ordinarily all struct pages are initialised during early boot in a
  single thread. On very large machines this can take a considerable
--
2.0.2






Re: [PATCH] mm: fix build problems from lookup_page_ext

2016-05-24 Thread Shi, Yang

Hi Arnd,

Thanks a lot for the patch. My bad, sorry for the inconvenience. I 
omitted the specific page_idle change is for 32 bit only.


And, my host compiler looks old which is still 4.8 so it might not catch 
the warning. I will update my compiler.


Regards,
Yang


On 5/24/2016 3:08 AM, Arnd Bergmann wrote:

A patch for lookup_page_ext introduced several build errors and
warnings, e.g.

mm/page_owner.c: In function '__set_page_owner':
mm/page_owner.c:71:2: error: ISO C90 forbids mixed declarations and code 
[-Werror=declaration-after-statement]
include/linux/page_idle.h: In function 'set_page_young':
include/linux/page_idle.h:62:3: error: expected ')' before 'return'

This fixes all of them. Please fold into the original patch.

Signed-off-by: Arnd Bergmann 
Fixes: 38c4fffbad3c ("mm: check the return value of lookup_page_ext for all call 
sites")

diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 569c3a180625..fec40271339f 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -48,7 +48,7 @@ static inline bool page_is_young(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext)
+   if (unlikely(!page_ext))
return false;

return test_bit(PAGE_EXT_YOUNG, &page_ext->flags);
@@ -58,7 +58,7 @@ static inline void set_page_young(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext)
+   if (unlikely(!page_ext))
return;

set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
@@ -68,7 +68,7 @@ static inline bool test_and_clear_page_young(struct page 
*page)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext)
+   if (unlikely(!page_ext))
return false;

return test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
@@ -78,7 +78,7 @@ static inline bool page_is_idle(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext)
+   if (unlikely(!page_ext))
return false;

return test_bit(PAGE_EXT_IDLE, &page_ext->flags);
@@ -88,7 +88,7 @@ static inline void set_page_idle(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext)
+   if (unlikely(!page_ext))
return;

set_bit(PAGE_EXT_IDLE, &page_ext->flags);
@@ -98,7 +98,7 @@ static inline void clear_page_idle(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext)
+   if (unlikely(!page_ext))
return;

clear_bit(PAGE_EXT_IDLE, &page_ext->flags);
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 902e39813295..c6cda3e36212 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -65,9 +65,6 @@ void __set_page_owner(struct page *page, unsigned int order, 
gfp_t gfp_mask)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext))
-   return;
-
struct stack_trace trace = {
.nr_entries = 0,
.max_entries = ARRAY_SIZE(page_ext->trace_entries),
@@ -75,6 +72,9 @@ void __set_page_owner(struct page *page, unsigned int order, 
gfp_t gfp_mask)
.skip = 3,
};

+   if (unlikely(!page_ext))
+   return;
+
save_stack_trace(&trace);

page_ext->order = order;
@@ -111,12 +111,11 @@ void __copy_page_owner(struct page *oldpage, struct page 
*newpage)
 {
struct page_ext *old_ext = lookup_page_ext(oldpage);
struct page_ext *new_ext = lookup_page_ext(newpage);
+   int i;

if (unlikely(!old_ext || !new_ext))
return;

-   int i;
-
new_ext->order = old_ext->order;
new_ext->gfp_mask = old_ext->gfp_mask;
new_ext->nr_entries = old_ext->nr_entries;
@@ -204,11 +203,6 @@ err:
 void __dump_page_owner(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);
-   if (unlikely(!page_ext)) {
-   pr_alert("There is not page extension available.\n");
-   return;
-   }
-
struct stack_trace trace = {
.nr_entries = page_ext->nr_entries,
.entries = &page_ext->trace_entries[0],
@@ -216,6 +210,11 @@ void __dump_page_owner(struct page *page)
gfp_t gfp_mask = page_ext->gfp_mask;
int mt = gfpflags_to_migratetype(gfp_mask);

+   if (unlikely(!page_ext)) {
+   pr_alert("There is not page extension available.\n");
+   return;
+   }
+
if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags)) {
pr_alert("page_owner info is not active (free page?)\n");
return;





Re: [PATCH] mm: page_is_guard return false when page_ext arrays are not allocated yet

2016-05-24 Thread Shi, Yang

On 5/23/2016 10:37 PM, Joonsoo Kim wrote:

On Fri, May 20, 2016 at 10:00:06AM -0700, Shi, Yang wrote:

On 5/19/2016 7:40 PM, Joonsoo Kim wrote:

2016-05-20 2:18 GMT+09:00 Shi, Yang :

On 5/18/2016 5:28 PM, Joonsoo Kim wrote:


Vlastiml, thanks for ccing me on original bug report.

On Wed, May 18, 2016 at 03:23:45PM -0700, Yang Shi wrote:


When enabling the below kernel configs:

CONFIG_DEFERRED_STRUCT_PAGE_INIT
CONFIG_DEBUG_PAGEALLOC
CONFIG_PAGE_EXTENSION
CONFIG_DEBUG_VM

kernel bootup may fail due to the following oops:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [] free_pcppages_bulk+0x2d2/0x8d0
PGD 0
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS
S5500.86B.01.10.0025.030220091519 03/02/2009
task: 88017c080040 ti: 88017c084000 task.ti: 88017c084000
RIP: 0010:[]  []
free_pcppages_bulk+0x2d2/0x8d0
RSP: :88017c087c48  EFLAGS: 00010046
RAX:  RBX:  RCX: 0001
RDX: 0980 RSI: 0080 RDI: 00660401
RBP: 88017c087cd0 R08: 0401 R09: 0009
R10: 88017c080040 R11: 000a R12: 0400
R13: ea001981 R14: ea0019810040 R15: 88066cfe6080
FS:  () GS:88066cd4()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 02406000 CR4: 06e0
Stack:
88066cd5bbd8 88066cfe6640  
001f001f 88066cd5bbe8 ea001981 8118f53e
0009 0401 000a 0001
Call Trace:
[] free_hot_cold_page+0x192/0x1d0
[] __free_pages+0x5c/0x90
[] __free_pages_boot_core+0x11a/0x14e
[] deferred_free_range+0x50/0x62
[] deferred_init_memmap+0x220/0x3c3
[] ? setup_per_cpu_pageset+0x35/0x35
[] kthread+0xf8/0x110
[] ret_from_fork+0x22/0x40
[] ? kthread_create_on_node+0x200/0x200
Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8
4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44
8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
RIP  [] free_pcppages_bulk+0x2d2/0x8d0
RSP 
CR2: 

The problem is lookup_page_ext() returns NULL then page_is_guard() tried
to
access it in page freeing.

page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension
flag, but
freeing page might reach here before the page_ext arrays are allocated
when
feeding a range of pages to the allocator for the first time during
bootup or
memory hotplug.



Patch itself looks find to me because I also found that this kind of
problem happens during memory hotplug. So, we need to fix more sites,
all callers of lookup_page_ext().



Yes, I agree. I will come up with a patch or a couple of patches to check
the return value of lookup_page_ext().



But, I'd like to know how your problem occurs during bootup.
debug_guardpage_enabled() is turned to 'enable' after page_ext is
initialized. Before that, page_is_guard() unconditionally returns
false so I think that the problem what you mentioned can't happen.

Could you check that when debug_guardpage_enabled() returns 'enable'
and init_section_page_ext() is called?



I think the problem is I have CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled,
which will defer some struct pages initialization to "pgdatinitX" kernel
thread in page_alloc_init_late(). But, page_ext_init() is called before it.
So, it leads debug_guardpage_enabled() return true, but page extension is
not allocated yet for the struct pages initialized by "pgdatinitX".


No. After page_ext_init(), it is ensured that all page extension is initialized.


It sounds page_ext_init() should be called after page_alloc_init_late(). Or
it should be just incompatible with CONFIG_DEFERRED_STRUCT_PAGE_INIT.

I will try to move the init call around.


We need to investigate more. I guess that problem is introduced by
CONFIG_DEFERRED_STRUCT_PAGE_INIT. It makes pfn_to_nid() invalid
until page_alloc_init_late() is done. That is a big side-effect. If
there is pfn walker
and it uses pfn_to_nid() between memmap_init_zone() and page_alloc_init_late(),
it also has same problem. So, we need to think how to fix it more
carefully.


Thanks for the analysis. I think you are correct. Since pfn_to_nid()
depends on memmap which has not been fully setup yet until
page_alloc_init_late() is done.

So, for such usecase early_pfn_to_nid() should be used.



Anyway, to make sure that my assumption is true, could you confirm that
below change fix your problem?


Yes, it does.



Thanks.

->8--
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 2d864e6..cac5dc9 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -391,7 +391,7 @@ void __init pag

Re: [BUG] Null pointer dereference when freeing pages on 4.6-rc6

2016-05-18 Thread Shi, Yang

On 5/16/2016 5:44 AM, Vlastimil Babka wrote:

[+CC Joonsoo based on git blame]

On 05/05/2016 11:13 PM, Shi, Yang wrote:

Hi folks,

When I enable the below kernel configs on 4.6-rc6, I came across null
pointer deference issue in boot stage.

CONFIG_SPARSEMEM
CONFIG_DEFERRED_STRUCT_PAGE_INIT
CONFIG_DEBUG_PAGEALLOC
CONFIG_PAGE_EXTENSION
CONFIG_DEBUG_VM


The splat is:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [] page_is_buddy+0x7b/0xe0
PGD 0
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 3 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc6 #8
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS
S5500.86B.01.10.0025.030220091519 03/02/2009
task: 88017c1d0040 ti: 88017c1d4000 task.ti: 88017c1d4000
RIP: 0010:[]  []
page_is_buddy+0x7b/0xe0
RSP: :88017c1d7bf0  EFLAGS: 00010046
RAX:  RBX: ea0019810040 RCX: 
RDX:  RSI: ea0019810040 RDI: 00660401
RBP: 88017c1d7c08 R08: 0001 R09: 
R10: 01af R11: 0001 R12: ea001981
R13:  R14: 0009 R15: ea001981
FS:  () GS:88066cc4()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 02406000 CR4: 06e0
Stack:
   1981  88066cfe6080 88017c1d7c70
   8118bfea 000a 1600 0001
   0401 ea0019810040 0400 88066cc5aca8
Call Trace:
   [] __free_one_page+0x23a/0x450
   [] free_pcppages_bulk+0x136/0x360
   [] free_hot_cold_page+0x168/0x1b0
   [] __free_pages+0x5c/0x90
   [] __free_pages_boot_core.isra.70+0x11a/0x14d
   [] deferred_free_range+0x50/0x62
   [] deferred_init_memmap+0x220/0x3c3
   [] ? deferred_free_range+0x62/0x62
   [] kthread+0xf8/0x110
   [] ret_from_fork+0x22/0x40
   [] ? kthread_create_on_node+0x200/0x200
Code: 75 7b 48 89 d8 8b 40 1c 85 c0 74 50 48 c7 c6 38 bd 0d 82 48 89 df
e8 25 e2 02 00 0f 0b 48 89 f7 89 55 ec e8 18 cb 07 00 8b 55 ec <48> 8b
00 a8 02 74 9d 3b 53 30 75 98 49 8b 14 24 48 8b 03 48 c1
RIP  [] page_is_buddy+0x7b/0xe0
   RSP 
CR2: 
---[ end trace e0c05a86b43d97f9 ]---
note: pgdatinit1[106] exited with preempt_count 1


I changed page_is_buddy and __free_one_page to non-inline to get more
accurate stack trace.


Then I did some investigation on it with printing the address of page
and buddy, please see the below log:

@@__free_one_page:715: page is at ea0005f05c00 buddy is at
ea0005f05c80, order is 1
@@__free_one_page:715: page is at ea0005f05c00 buddy is at
ea0005f05d00, order is 2
@@__free_one_page:715: page is at ea0005f05c00 buddy is at
ea0005f05e00, order is 3
@@__free_one_page:715: page is at ea001981 buddy is at
ea0019810040, order is 0

call trace splat

@@__free_one_page:715: page is at ea0005f05bc0 buddy is at
ea0005f05b80, order is 0
@@__free_one_page:715: page is at ea0005f05b80 buddy is at
ea0005f05bc0, order is 0
@@__free_one_page:715: page is at ea0005f05b80 buddy is at
ea0005f05b00, order is 1
@@__free_one_page:715: page is at ea0005f05b40 buddy is at
ea0005f05b00, order is 0
@@__free_one_page:715: page is at ea0005f05b00 buddy is at
ea0005f05b40, order is 0

It shows just before the call trace splat, the page address jumped to
ea001981 from ea0005f05xxx, not sure why this is happening.
Any hint is appreciated.


I think the page address didn't jump, it was that the previous call of
__free_one_page() finished at order=3 and now another one was invoked
that started at order=0 page ea001981 and immediately hit the
splat.


And, reading the code leads me to the below call path:

page_is_buddy()
--> page_is_guard()
--> lookup_page_ext()


From a quick decodecode I guess that's indeed the path that's crashed.
You could try verifying with e.g. addr2line on your vmlinux?


I tracked down there via gdb.




Then lookup_page_ext() just returns null due to the below code:

#if defined(CONFIG_DEBUG_VM) || defined(CONFIG_PAGE_POISONING)
  /*
   * The sanity checks the page allocator does upon freeing a
   * page can reach here before the page_ext arrays are
   * allocated when feeding a range of pages to the allocator
   * for the first time during bootup or memory hotplug.
   *
   * This check is also necessary for ensuring page poisoning
   * works as expected when enabled
   */
  if (!section->page_ext)
  return NULL;
#endif

So, according to the comment, it looks there should be a WARN or BUG if
it returns NULL?


The comment reads like it's unlikely, but possible, so WARN/BUG doesn't
sounds right.


And, almost no code

Re: [PATCH] mm: page_is_guard return false when page_ext arrays are not allocated yet

2016-05-19 Thread Shi, Yang

On 5/18/2016 5:28 PM, Joonsoo Kim wrote:

Vlastiml, thanks for ccing me on original bug report.

On Wed, May 18, 2016 at 03:23:45PM -0700, Yang Shi wrote:

When enabling the below kernel configs:

CONFIG_DEFERRED_STRUCT_PAGE_INIT
CONFIG_DEBUG_PAGEALLOC
CONFIG_PAGE_EXTENSION
CONFIG_DEBUG_VM

kernel bootup may fail due to the following oops:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [] free_pcppages_bulk+0x2d2/0x8d0
PGD 0
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009
task: 88017c080040 ti: 88017c084000 task.ti: 88017c084000
RIP: 0010:[]  [] 
free_pcppages_bulk+0x2d2/0x8d0
RSP: :88017c087c48  EFLAGS: 00010046
RAX:  RBX:  RCX: 0001
RDX: 0980 RSI: 0080 RDI: 00660401
RBP: 88017c087cd0 R08: 0401 R09: 0009
R10: 88017c080040 R11: 000a R12: 0400
R13: ea001981 R14: ea0019810040 R15: 88066cfe6080
FS:  () GS:88066cd4() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 02406000 CR4: 06e0
Stack:
 88066cd5bbd8 88066cfe6640  
 001f001f 88066cd5bbe8 ea001981 8118f53e
 0009 0401 000a 0001
Call Trace:
 [] free_hot_cold_page+0x192/0x1d0
 [] __free_pages+0x5c/0x90
 [] __free_pages_boot_core+0x11a/0x14e
 [] deferred_free_range+0x50/0x62
 [] deferred_init_memmap+0x220/0x3c3
 [] ? setup_per_cpu_pageset+0x35/0x35
 [] kthread+0xf8/0x110
 [] ret_from_fork+0x22/0x40
 [] ? kthread_create_on_node+0x200/0x200
Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 
44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 
45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
RIP  [] free_pcppages_bulk+0x2d2/0x8d0
 RSP 
CR2: 

The problem is lookup_page_ext() returns NULL then page_is_guard() tried to
access it in page freeing.

page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension flag, but
freeing page might reach here before the page_ext arrays are allocated when
feeding a range of pages to the allocator for the first time during bootup or
memory hotplug.


Patch itself looks find to me because I also found that this kind of
problem happens during memory hotplug. So, we need to fix more sites,
all callers of lookup_page_ext().


Yes, I agree. I will come up with a patch or a couple of patches to 
check the return value of lookup_page_ext().




But, I'd like to know how your problem occurs during bootup.
debug_guardpage_enabled() is turned to 'enable' after page_ext is
initialized. Before that, page_is_guard() unconditionally returns
false so I think that the problem what you mentioned can't happen.

Could you check that when debug_guardpage_enabled() returns 'enable'
and init_section_page_ext() is called?


I think the problem is I have CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled, 
which will defer some struct pages initialization to "pgdatinitX" kernel 
thread in page_alloc_init_late(). But, page_ext_init() is called before it.


So, it leads debug_guardpage_enabled() return true, but page extension 
is not allocated yet for the struct pages initialized by "pgdatinitX".


It sounds page_ext_init() should be called after page_alloc_init_late(). 
Or it should be just incompatible with CONFIG_DEFERRED_STRUCT_PAGE_INIT.


I will try to move the init call around.

Thanks,
Yang



And, above comment would be stale because it comes from when memcg uses
this struct page extension funtionality. Now, memcg doesn't use it and
there are some changes on this area so I'm not sure that is still true.

Thanks.





Re: [PATCH] mm: move page_ext_init after all struct pages are initialized

2016-05-19 Thread Shi, Yang

On 5/19/2016 3:30 PM, Andrew Morton wrote:

On Thu, 19 May 2016 14:29:05 -0700 Yang Shi  wrote:


When DEFERRED_STRUCT_PAGE_INIT is enabled, just a subset of memmap at boot
are initialized, then the rest are initialized in parallel by starting one-off
"pgdatinitX" kernel thread for each node X.

If page_ext_init is called before it, some pages will not have valid extension,
so move page_ext_init() after it.



When fixing a bug, please fully describe the end-user impact
of that bug


The kernel ran into the below oops which is same with the oops reported 
in 
http://ozlabs.org/~akpm/mmots/broken-out/mm-page_is_guard-return-false-when-page_ext-arrays-are-not-allocated-yet.patch.


BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [] free_pcppages_bulk+0x2d2/0x8d0
PGD 0
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009

task: 88017c080040 ti: 88017c084000 task.ti: 88017c084000
RIP: 0010:[]  [] 
free_pcppages_bulk+0x2d2/0x8d0

RSP: :88017c087c48  EFLAGS: 00010046
RAX:  RBX:  RCX: 0001
RDX: 0980 RSI: 0080 RDI: 00660401
RBP: 88017c087cd0 R08: 0401 R09: 0009
R10: 88017c080040 R11: 000a R12: 0400
R13: ea001981 R14: ea0019810040 R15: 88066cfe6080
FS:  () GS:88066cd4() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 02406000 CR4: 06e0
Stack:
 88066cd5bbd8 88066cfe6640  
 001f001f 88066cd5bbe8 ea001981 8118f53e
 0009 0401 000a 0001
Call Trace:
 [] free_hot_cold_page+0x192/0x1d0
 [] __free_pages+0x5c/0x90
 [] __free_pages_boot_core+0x11a/0x14e
 [] deferred_free_range+0x50/0x62
 [] deferred_init_memmap+0x220/0x3c3
 [] ? setup_per_cpu_pageset+0x35/0x35
 [] kthread+0xf8/0x110
 [] ret_from_fork+0x22/0x40
 [] ? kthread_create_on_node+0x200/0x200
Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 
4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 
00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff

RIP  [] free_pcppages_bulk+0x2d2/0x8d0
 RSP 
CR2: 


I will add the oops info into the commit log in V2.

Thanks,
Yang







Re: [PATCH] mm: move page_ext_init after all struct pages are initialized

2016-05-19 Thread Shi, Yang

On 5/19/2016 4:21 PM, Andrew Morton wrote:

On Thu, 19 May 2016 15:35:15 -0700 "Shi, Yang"  wrote:


On 5/19/2016 3:30 PM, Andrew Morton wrote:

On Thu, 19 May 2016 14:29:05 -0700 Yang Shi  wrote:


When DEFERRED_STRUCT_PAGE_INIT is enabled, just a subset of memmap at boot
are initialized, then the rest are initialized in parallel by starting one-off
"pgdatinitX" kernel thread for each node X.

If page_ext_init is called before it, some pages will not have valid extension,
so move page_ext_init() after it.



When fixing a bug, please fully describe the end-user impact
of that bug


The kernel ran into the below oops which is same with the oops reported
in
http://ozlabs.org/~akpm/mmots/broken-out/mm-page_is_guard-return-false-when-page_ext-arrays-are-not-allocated-yet.patch.


So this patch makes
mm-page_is_guard-return-false-when-page_ext-arrays-are-not-allocated-yet.patch
obsolete?


Actually, no. Checking the return value for lookup_page_ext() is still 
needed. But, the commit log need to be amended since that bootup oops 
won't happen anymore with this patch applied.


Thanks,
Yang







Re: [PATCH] Fixing compilation error from file include/linux/page_idle.h

2016-05-25 Thread Shi, Yang

On 5/25/2016 3:11 AM, shakilk1...@gmail.com wrote:

From: shakil khan 


Thanks for the patch, this issue had been fixed by Arnd.

Yang



---
 include/linux/page_idle.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 569c3a1..fec4027 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -48,7 +48,7 @@ static inline bool page_is_young(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext)
+   if (unlikely(!page_ext))
return false;

return test_bit(PAGE_EXT_YOUNG, &page_ext->flags);
@@ -58,7 +58,7 @@ static inline void set_page_young(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext)
+   if (unlikely(!page_ext))
return;

set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
@@ -68,7 +68,7 @@ static inline bool test_and_clear_page_young(struct page 
*page)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext)
+   if (unlikely(!page_ext))
return false;

return test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
@@ -78,7 +78,7 @@ static inline bool page_is_idle(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext)
+   if (unlikely(!page_ext))
return false;

return test_bit(PAGE_EXT_IDLE, &page_ext->flags);
@@ -88,7 +88,7 @@ static inline void set_page_idle(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext)
+   if (unlikely(!page_ext))
return;

set_bit(PAGE_EXT_IDLE, &page_ext->flags);
@@ -98,7 +98,7 @@ static inline void clear_page_idle(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);

-   if (unlikely(!page_ext)
+   if (unlikely(!page_ext))
return;

clear_bit(PAGE_EXT_IDLE, &page_ext->flags);





Re: [PATCH] mm: use early_pfn_to_nid in register_page_bootmem_info_node

2016-05-25 Thread Shi, Yang

On 5/25/2016 3:23 PM, Andrew Morton wrote:

On Wed, 25 May 2016 14:00:07 -0700 Yang Shi  wrote:


register_page_bootmem_info_node() is invoked in mem_init(), so it will be
called before page_alloc_init_late() if CONFIG_DEFERRED_STRUCT_PAGE_INIT
is enabled. But, pfn_to_nid() depends on memmap which won't be fully setup
until page_alloc_init_late() is done, so replace pfn_to_nid() by
early_pfn_to_nid().


What are the runtime effects of this fix?


I didn't experience any problem without the fix. During working on the 
page_ext_init() fix (replace to early_pfn_to_nid()), I added printk 
before each pfn_to_nid() calls to check which one might be called before 
page_alloc_init_late(), then this one is caught.


From the code perspective, it sounds not right since 
register_page_bootmem_info_section() may miss some pfns when 
CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, just like the problem 
happened in page_ext_init().


Thanks,
Yang








Re: [PATCH] mm: check the return value of lookup_page_ext for all call sites

2016-05-26 Thread Shi, Yang

On 5/25/2016 5:37 PM, Minchan Kim wrote:

On Tue, May 24, 2016 at 11:58:11AM +0900, Minchan Kim wrote:

On Mon, May 23, 2016 at 10:16:08AM -0700, Yang Shi wrote:

Per the discussion with Joonsoo Kim [1], we need check the return value of
lookup_page_ext() for all call sites since it might return NULL in some cases,
although it is unlikely, i.e. memory hotplug.

Tested with ltp with "page_owner=0".

[1] http://lkml.kernel.org/r/20160519002809.GA10245@js1304-P5Q-DELUXE

Signed-off-by: Yang Shi 


I didn't read code code in detail to see how page_ext memory space
allocated in boot code and memory hotplug but to me, it's not good
to check NULL whenever we calls lookup_page_ext.

More dangerous thing is now page_ext is used by optionable feature(ie, not
critical for system stability) but if we want to use page_ext as
another important tool for the system in future,
it could be a serious problem.

Can we put some hooks of page_ext into memory-hotplug so guarantee
that page_ext memory space is allocated with memmap space at the
same time? IOW, once every PFN wakers find a page is valid, page_ext
is valid, too so lookup_page_ext never returns NULL on valid page
by design.

I hope we consider this direction, too.


Yang, Could you think about this?


Thanks a lot for the suggestion. Sorry for the late reply, I was busy on 
preparing patches. I do agree this is a direction we should look into, 
but I haven't got time to think about it deeper. I hope Joonsoo could 
chime in too since he is the original author for page extension.




Even, your patch was broken, I think.
It doesn't work with !CONFIG_DEBUG_VM && !CONFIG_PAGE_POISONING because
lookup_page_ext doesn't return NULL in that case.


Actually, I think the #ifdef should be removed if lookup_page_ext() is 
possible to return NULL. It sounds not make sense returning NULL only 
when DEBUG_VM is enabled. It should return NULL no matter what debug 
config is selected. If Joonsoo agrees with me I'm going to come up with 
a patch to fix it.


Regards,
Yang





Thanks.


---
 include/linux/page_idle.h | 43 ---
 mm/page_alloc.c   |  6 ++
 mm/page_owner.c   | 27 +++
 mm/page_poison.c  |  8 +++-
 mm/vmstat.c   |  2 ++
 5 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index bf268fa..8f5d4ad 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -46,33 +46,62 @@ extern struct page_ext_operations page_idle_ops;

 static inline bool page_is_young(struct page *page)
 {
-   return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+   struct page_ext *page_ext;
+   page_ext = lookup_page_ext(page);
+   if (unlikely(!page_ext)
+   return false;
+
+   return test_bit(PAGE_EXT_YOUNG, &page_ext->flags);
 }

 static inline void set_page_young(struct page *page)
 {
-   set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+   struct page_ext *page_ext;
+   page_ext = lookup_page_ext(page);
+   if (unlikely(!page_ext)
+   return;
+
+   set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
 }

 static inline bool test_and_clear_page_young(struct page *page)
 {
-   return test_and_clear_bit(PAGE_EXT_YOUNG,
- &lookup_page_ext(page)->flags);
+   struct page_ext *page_ext;
+   page_ext = lookup_page_ext(page);
+   if (unlikely(!page_ext)
+   return false;
+
+   return test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
 }

 static inline bool page_is_idle(struct page *page)
 {
-   return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+   struct page_ext *page_ext;
+   page_ext = lookup_page_ext(page);
+   if (unlikely(!page_ext)
+   return false;
+
+   return test_bit(PAGE_EXT_IDLE, &page_ext->flags);
 }

 static inline void set_page_idle(struct page *page)
 {
-   set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+   struct page_ext *page_ext;
+   page_ext = lookup_page_ext(page);
+   if (unlikely(!page_ext)
+   return;
+
+   set_bit(PAGE_EXT_IDLE, &page_ext->flags);
 }

 static inline void clear_page_idle(struct page *page)
 {
-   clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+   struct page_ext *page_ext;
+   page_ext = lookup_page_ext(page);
+   if (unlikely(!page_ext)
+   return;
+
+   clear_bit(PAGE_EXT_IDLE, &page_ext->flags);
 }
 #endif /* CONFIG_64BIT */

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8f3bfc..d27e8b9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -656,6 +656,9 @@ static inline void set_page_guard(struct zone *zone, struct 
page *page,
return;

page_ext = lookup_page_ext(page);
+   if (unlikely(!page_ext))
+   return;
+
__set_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flag

Re: [PATCH] arm64: kasan: instrument user memory access API

2016-05-27 Thread Shi, Yang

On 5/27/2016 4:02 AM, Andrey Ryabinin wrote:



On 05/26/2016 09:43 PM, Yang Shi wrote:

The upstream commit 1771c6e1a567ea0ba20a4ffe68a1419fd8ef
("x86/kasan: instrument user memory access API") added KASAN instrument to
x86 user memory access API, so added such instrument to ARM64 too.

Tested by test_kasan module.

Signed-off-by: Yang Shi 
---
 arch/arm64/include/asm/uaccess.h | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)


Please, cover __copy_from_user() and __copy_to_user() too.
Unlike x86, your patch doesn't instrument these two.


I should elaborated this in my review. Yes, I did think about it, but 
unlike x86, __copy_to/from_user are implemented by asm code on ARM64. If 
I add kasan_check_read/write into them, I have to move the registers 
around to prepare the parameters for kasan calls, then restore them 
after the call, for example the below code for __copy_to_user:


mov x9, x0
mov x10, x1
mov x11, x2
mov x0, x10
mov x1, x11
bl  kasan_check_read
mov x0, x9
mov x1, x10


So, I'm wondering if it is worth or not since __copy_to/from_user are 
just called at a couple of places, i.e. sctp, a couple of drivers, etc 
and not used too much. Actually, I think some of them could be replaced 
by __copy_to/from_user_inatomic.


Any idea is appreciated.

Thanks,
Yang







Re: [PATCH] arm64: kasan: instrument user memory access API

2016-05-27 Thread Shi, Yang

On 5/27/2016 10:46 AM, Mark Rutland wrote:

On Fri, May 27, 2016 at 09:34:03AM -0700, Shi, Yang wrote:

On 5/27/2016 4:02 AM, Andrey Ryabinin wrote:



On 05/26/2016 09:43 PM, Yang Shi wrote:

The upstream commit 1771c6e1a567ea0ba20a4ffe68a1419fd8ef
("x86/kasan: instrument user memory access API") added KASAN instrument to
x86 user memory access API, so added such instrument to ARM64 too.

Tested by test_kasan module.

Signed-off-by: Yang Shi 
---
arch/arm64/include/asm/uaccess.h | 18 --
1 file changed, 16 insertions(+), 2 deletions(-)


Please, cover __copy_from_user() and __copy_to_user() too.
Unlike x86, your patch doesn't instrument these two.


Argh, I missed those when reviewing. My bad.


I should elaborated this in my review. Yes, I did think about it,
but unlike x86, __copy_to/from_user are implemented by asm code on
ARM64. If I add kasan_check_read/write into them, I have to move the
registers around to prepare the parameters for kasan calls, then
restore them after the call, for example the below code for
__copy_to_user:

mov x9, x0
mov x10, x1
mov x11, x2
mov x0, x10
mov x1, x11
bl  kasan_check_read
mov x0, x9
mov x1, x10


There's no need to alter the assembly.

Rename the functions (e.g. have __arch_raw_copy_from_user), and add
static inline wrappers in uaccess.h that do the kasan calls before
calling the assembly functions.

That gives the compiler the freedom to do the right thing, and avoids
horrible ifdeffery in the assembly code.


Thanks for the suggestion, will address in v2.

Yang




So, I'm wondering if it is worth or not since __copy_to/from_user
are just called at a couple of places, i.e. sctp, a couple of
drivers, etc and not used too much.


[mark@leverpostej:~/src/linux]% git grep -w __copy_to_user -- ^arch | wc -l
63
[mark@leverpostej:~/src/linux]% git grep -w __copy_from_user -- ^arch | wc -l
47

That's a reasonable number of callsites.

If we're going to bother adding this, it should be complete. So please
do update __copy_from_user and __copy_to_user.


Actually, I think some of them
could be replaced by __copy_to/from_user_inatomic.


Given the number of existing callers outside of arch code, I think we'll
get far more traction reworking the arm64 parts for now.

Thanks,
Mark.





Re: [PATCH] mm: check the return value of lookup_page_ext for all call sites

2016-05-27 Thread Shi, Yang

On 5/27/2016 1:11 AM, Minchan Kim wrote:

On Fri, May 27, 2016 at 03:08:39PM +0900, Joonsoo Kim wrote:

On Fri, May 27, 2016 at 02:14:32PM +0900, Minchan Kim wrote:

On Thu, May 26, 2016 at 04:15:28PM -0700, Shi, Yang wrote:

On 5/25/2016 5:37 PM, Minchan Kim wrote:

On Tue, May 24, 2016 at 11:58:11AM +0900, Minchan Kim wrote:

On Mon, May 23, 2016 at 10:16:08AM -0700, Yang Shi wrote:

Per the discussion with Joonsoo Kim [1], we need check the return value of
lookup_page_ext() for all call sites since it might return NULL in some cases,
although it is unlikely, i.e. memory hotplug.

Tested with ltp with "page_owner=0".

[1] http://lkml.kernel.org/r/20160519002809.GA10245@js1304-P5Q-DELUXE

Signed-off-by: Yang Shi 


I didn't read code code in detail to see how page_ext memory space
allocated in boot code and memory hotplug but to me, it's not good
to check NULL whenever we calls lookup_page_ext.

More dangerous thing is now page_ext is used by optionable feature(ie, not
critical for system stability) but if we want to use page_ext as
another important tool for the system in future,
it could be a serious problem.


Hello, Minchan.


Hi Joonsoo,



I wonder how pages that isn't managed by kernel yet will cause serious
problem. Until onlining, these pages are out of our scope. Any
information about them would be useless until it is actually
activated. I guess that returning NULL for those pages will not hurt
any functionality. Do you have any possible scenario that this causes the
serious problem?


I don't have any specific usecase now. That's why I said "in future".
And I don't want to argue whether there is possible scenario or not
to make the feature useful but if you want, I should write novel.
One of example, pop up my mind, xen, hv and even memory_hotplug itself
might want to use page_ext for their functionality extension to hook
guest pages.

My opinion is that page_ext is extension of struct page so it would
be better to allow any operation on struct page without any limitation
if we can do it. Whether it's useful or useless depend on random
usecase and we don't need to limit that way from the beginning.

However, current design allows deferred page_ext population so any user
of page_ext should keep it in mind and should either make fallback plan
or don't use page_ext for those cases. If we decide go this way through
discussion, at least, we should make such limitation more clear to
somewhere in this chance, maybe around page_ext_operation->need comment.

My comment's point is that we should consider that way at least. It's
worth to discuss pros and cons, what's the best and what makes that way
hesitate if we can't.



And, allocation such memory space doesn't come from free. If someone
just add the memory device and don't online it, these memory will be


Here goes several questions.
Cced hotplug guys

1.
If someone just add the memory device without onlining, kernel code
can return pfn_valid == true on the offlined page?

2.
If so, it means memmap on offline memory is already populated somewhere.
Where is the memmap allocated? part of offlined memory space or other memory?

3. Could we allocate page_ext in part of offline memory space so that
it doesn't consume online memory.


wasted. I don't know if there is such a usecase but it's possible
scenario.






Can we put some hooks of page_ext into memory-hotplug so guarantee
that page_ext memory space is allocated with memmap space at the
same time? IOW, once every PFN wakers find a page is valid, page_ext
is valid, too so lookup_page_ext never returns NULL on valid page
by design.

I hope we consider this direction, too.


Yang, Could you think about this?


Thanks a lot for the suggestion. Sorry for the late reply, I was
busy on preparing patches. I do agree this is a direction we should
look into, but I haven't got time to think about it deeper. I hope
Joonsoo could chime in too since he is the original author for page
extension.



Even, your patch was broken, I think.
It doesn't work with !CONFIG_DEBUG_VM && !CONFIG_PAGE_POISONING because
lookup_page_ext doesn't return NULL in that case.


Actually, I think the #ifdef should be removed if lookup_page_ext()
is possible to return NULL. It sounds not make sense returning NULL
only when DEBUG_VM is enabled. It should return NULL no matter what
debug config is selected. If Joonsoo agrees with me I'm going to
come up with a patch to fix it.


Agreed but let's wait for Minchan's response.


If we goes this way, how to guarantee this race?


Thanks for pointing out this. It sounds reasonable. However, this should 
be only possible to happen on 32 bit since just 32 bit version 
page_is_idle() calls lookup_page_ext(), it doesn't do it on 64 bit.


And, such race condition should exist regardless of whether DEBUG_VM is 
enabled or not, right?


rc

Re: [PATCH] mm: check the return value of lookup_page_ext for all call sites

2016-05-27 Thread Shi, Yang

On 5/27/2016 1:02 PM, Andrew Morton wrote:

On Thu, 26 May 2016 16:15:28 -0700 "Shi, Yang"  wrote:



I hope we consider this direction, too.


Yang, Could you think about this?


Thanks a lot for the suggestion. Sorry for the late reply, I was busy on
preparing patches. I do agree this is a direction we should look into,
but I haven't got time to think about it deeper. I hope Joonsoo could
chime in too since he is the original author for page extension.



Even, your patch was broken, I think.
It doesn't work with !CONFIG_DEBUG_VM && !CONFIG_PAGE_POISONING because
lookup_page_ext doesn't return NULL in that case.


Actually, I think the #ifdef should be removed if lookup_page_ext() is
possible to return NULL. It sounds not make sense returning NULL only
when DEBUG_VM is enabled. It should return NULL no matter what debug
config is selected. If Joonsoo agrees with me I'm going to come up with
a patch to fix it.



I've lost the plot here.  What is the status of this patch?

Latest version:


Yes, this is the latest version. We are discussing about some future 
optimization.


And, Minchan Kim pointed out a possible race condition which exists even 
before this patch. I proposed a quick fix, as long as they are happy to 
the fix, I will post it to the mailing list.


Thanks,
Yang



From: Yang Shi 
Subject: mm: check the return value of lookup_page_ext for all call sites

Per the discussion with Joonsoo Kim [1], we need check the return value of
lookup_page_ext() for all call sites since it might return NULL in some
cases, although it is unlikely, i.e.  memory hotplug.

Tested with ltp with "page_owner=0".

[1] http://lkml.kernel.org/r/20160519002809.GA10245@js1304-P5Q-DELUXE

[a...@linux-foundation.org: fix build-breaking typos]
[a...@arndb.de: fix build problems from lookup_page_ext]
  Link: http://lkml.kernel.org/r/6285269.2CksypHdYp@wuerfel
[a...@linux-foundation.org: coding-style fixes]
Link: 
http://lkml.kernel.org/r/1464023768-31025-1-git-send-email-yang@linaro.org
Signed-off-by: Yang Shi 
Signed-off-by: Arnd Bergmann 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
---

 include/linux/page_idle.h |   43 ++--
 mm/page_alloc.c   |6 +
 mm/page_owner.c   |   26 +
 mm/page_poison.c  |8 +-
 mm/vmstat.c   |2 +
 5 files changed, 77 insertions(+), 8 deletions(-)

diff -puN 
include/linux/page_idle.h~mm-check-the-return-value-of-lookup_page_ext-for-all-call-sites
 include/linux/page_idle.h
--- 
a/include/linux/page_idle.h~mm-check-the-return-value-of-lookup_page_ext-for-all-call-sites
+++ a/include/linux/page_idle.h
@@ -46,33 +46,62 @@ extern struct page_ext_operations page_i

 static inline bool page_is_young(struct page *page)
 {
-   return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+   struct page_ext *page_ext = lookup_page_ext(page);
+
+   if (unlikely(!page_ext))
+   return false;
+
+   return test_bit(PAGE_EXT_YOUNG, &page_ext->flags);
 }

 static inline void set_page_young(struct page *page)
 {
-   set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+   struct page_ext *page_ext = lookup_page_ext(page);
+
+   if (unlikely(!page_ext))
+   return;
+
+   set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
 }

 static inline bool test_and_clear_page_young(struct page *page)
 {
-   return test_and_clear_bit(PAGE_EXT_YOUNG,
- &lookup_page_ext(page)->flags);
+   struct page_ext *page_ext = lookup_page_ext(page);
+
+   if (unlikely(!page_ext))
+   return false;
+
+   return test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
 }

 static inline bool page_is_idle(struct page *page)
 {
-   return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+   struct page_ext *page_ext = lookup_page_ext(page);
+
+   if (unlikely(!page_ext))
+   return false;
+
+   return test_bit(PAGE_EXT_IDLE, &page_ext->flags);
 }

 static inline void set_page_idle(struct page *page)
 {
-   set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+   struct page_ext *page_ext = lookup_page_ext(page);
+
+   if (unlikely(!page_ext))
+   return;
+
+   set_bit(PAGE_EXT_IDLE, &page_ext->flags);
 }

 static inline void clear_page_idle(struct page *page)
 {
-   clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+   struct page_ext *page_ext = lookup_page_ext(page);
+
+   if (unlikely(!page_ext))
+   return;
+
+   clear_bit(PAGE_EXT_IDLE, &page_ext->flags);
 }
 #endif /* CONFIG_64BIT */

diff -puN 
mm/page_alloc.c~mm-check-the-return-value-of-lookup_page_ext-for-all-call-sites 
mm/page_alloc.c
--- 
a/mm/page_alloc.c~mm-check-the-return-value-of-lookup_page_ext-for-al

Re: [BUG linux-next] kernel NULL pointer dereference on linux-next-20160420

2016-04-26 Thread Shi, Yang

Did some preliminary investigation on this one.

The problem is the cc->freepages list is empty, but cc->nr_freepages > 
0, it looks the list and nr_freepages get out-of-sync somewhere.


Any hint is appreciated.

Thanks,
Yang


On 4/21/2016 5:38 PM, Shi, Yang wrote:

Hi folks,

I did the below test with huge tmpfs on linux-next-20160420:

# mount -t tmpfs huge=1 tmpfs /mnt
# cd /mnt
Then clone linux kernel

Then I got the below bug, such test works well on non-huge tmpfs.

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [] release_freepages+0x18/0xa0
PGD 0
Oops:  [#1] PREEMPT SMP
Modules linked in:
CPU: 6 PID: 110 Comm: kcompactd0 Not tainted
4.6.0-rc4-next-20160420-WR7.0.0.0_standard #4
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS
S5500.86B.01.10.0025.030220091519 03/02/2009
task: 880361708040 ti: 880361704000 task.ti: 880361704000
RIP: 0010:[]  []
release_freepages+0x18/0xa0
RSP: 0018:880361707cf8  EFLAGS: 00010282
RAX:  RBX: 88036ffde7c0 RCX: 0009
RDX: 1bf1 RSI: 000f RDI: 880361707dd0
RBP: 880361707d20 R08: 0007 R09: 1600
R10: 88036ffde7c0 R11:  R12: 
R13: 880361707dd0 R14: 880361707dc0 R15: 880361707db0
FS:  () GS:880363cc()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 02206000 CR4: 06e0
Stack:
  88036ffde7c0  1a00 880361707dc0
  880361707db0 880361707da0 8119f13d 81196239
   880361708040 0001 0010
Call Trace:
  [] compact_zone+0x55d/0x9f0
  [] ? fragmentation_index+0x19/0x70
  [] kcompactd_do_work+0x10f/0x230
  [] kcompactd+0x90/0x1e0
  [] ? wait_woken+0xa0/0xa0
  [] ? kcompactd_do_work+0x230/0x230
  [] kthread+0xdd/0x100
  [] ret_from_fork+0x22/0x40
  [] ? kthread_create_on_node+0x180/0x180
Code: c1 fa 06 31 f6 e8 a9 9b fd ff eb 98 0f 1f 80 00 00 00 00 66 66 66
66 90 55 48 89 e5 41 57 41 56 41 55 49 89 fd 41 54 53 48 8b 07 <48> 8b
10 48 8d 78 e0 49 39 c5 4c 8d 62 e0 74 70 49 be 00 00 00
RIP  [] release_freepages+0x18/0xa0
  RSP 
CR2: 
---[ end trace 855da7e142f7311f ]---




Re: [BUG] set_pte_at: racy dirty state clearing warning

2016-04-27 Thread Shi, Yang

On 4/21/2016 1:49 AM, Catalin Marinas wrote:

On Wed, Apr 20, 2016 at 04:01:39PM -0700, Shi, Yang wrote:

When I enable memory comact via

# echo 1 > /proc/sys/vm/compact_memory

I got the below WARNING:

set_pte_at: racy dirty state clearing: 0x006899371bd3 ->
0x006899371fd3
[ cut here ]
WARNING: CPU: 5 PID: 294 at ./arch/arm64/include/asm/pgtable.h:227
ptep_set_access_flags+0x138/0x1b8
Modules linked in:


Do you have this patch applied:

http://article.gmane.org/gmane.linux.ports.arm.kernel/492239

It's also queued into -next as commit 66dbd6e61a52.


No, but I just applied it, it works.

Thanks,
Yang




My kernel has ARM64_HW_AFDBM enabled, but LS2085 is not ARMv8.1.

The code shows it just check if ARM64_HW_AFDBM is enabled or not, but
doesn't check if the CPU really has such capability.

So, it might be better to have the capability checked runtime?


The warnings are there to spot any incorrect uses of the pte accessors
even before you run on AF/DBM-capable hardware.





Re: [PATCH net-next] bpf: arm64: remove callee-save registers use for tmp registers

2016-05-16 Thread Shi, Yang

On 5/16/2016 4:45 PM, Z Lim wrote:

Hi Yang,

On Mon, May 16, 2016 at 4:09 PM, Yang Shi  wrote:

In the current implementation of ARM64 eBPF JIT, R23 and R24 are used for
tmp registers, which are callee-saved registers. This leads to variable size
of JIT prologue and epilogue. The latest blinding constant change prefers to
constant size of prologue and epilogue. AAPCS reserves R9 ~ R15 for temp
registers which not need to be saved/restored during function call. So, replace
R23 and R24 to R10 and R11, and remove tmp_used flag.

CC: Zi Shen Lim 
CC: Daniel Borkmann 
Signed-off-by: Yang Shi 
---


Couple suggestions, but otherwise:
Acked-by: Zi Shen Lim 

1. Update the diagram. I think it should now be:

-* BPF fp register => -80:+-+ <= (BPF_FP)
+* BPF fp register => -64:+-+ <= (BPF_FP)


Nice catch. I forgot the stack diagram.



2. Add a comment in commit log along the lines of: this is an
optimization saving 2 instructions per jited BPF program.


Sure, will address in V2.

Thanks,
Yang



Thanks :)

z


Apply on top of Daniel's blinding constant patchset.

 arch/arm64/net/bpf_jit_comp.c | 32 
 1 file changed, 4 insertions(+), 28 deletions(-)





Re: [PATCH 1/2] arm64: bpf: add 'store immediate' instruction

2015-11-12 Thread Shi, Yang

On 11/11/2015 4:39 AM, Will Deacon wrote:

On Wed, Nov 11, 2015 at 12:12:56PM +, Will Deacon wrote:

On Tue, Nov 10, 2015 at 06:45:39PM -0800, Z Lim wrote:

On Tue, Nov 10, 2015 at 2:41 PM, Yang Shi  wrote:

aarch64 doesn't have native store immediate instruction, such operation


Actually, aarch64 does have "STR (immediate)". For arm64 JIT, we can
consider using it as an optimization.


Yes, I'd definitely like to see that in preference to moving via a
temporary register.


Wait a second, we're both talking rubbish here :) The STR (immediate)
form is referring to the addressing mode, whereas this patch wants to
store an immediate value to memory, which does need moving to a register
first.


Yes, the immediate means immediate offset for addressing index. Doesn't 
mean to store immediate to memory.


I don't think any load-store architecture has store immediate instruction.

Thanks,
Yang



So the original patch is fine.

Will



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] arm64: bpf: make BPF prologue and epilogue align with ARM64 AAPCS

2015-11-13 Thread Shi, Yang

On 11/12/2015 7:28 PM, Z Lim wrote:

On Thu, Nov 12, 2015 at 1:57 PM, Yang Shi  wrote:


Save and restore FP/LR in BPF prog prologue and epilogue, save SP to FP
in prologue in order to get the correct stack backtrace.

However, ARM64 JIT used FP (x29) as eBPF fp register, FP is subjected to
change during function call so it may cause the BPF prog stack base address
change too.

Use x25 to replace FP as BPF stack base register (fp). Since x25 is callee
saved register, so it will keep intact during function call.
It is initialized in BPF prog prologue when BPF prog is started to run
everytime. When BPF prog exits, it could be just tossed.

So, the BPF stack layout looks like:

  high
  original A64_SP =>   0:+-+ BPF prologue
 | | FP/LR and callee saved registers
  BPF fp register => -64:+-+
 | |
 | ... | BPF prog stack
 | |
 | |
  current A64_SP/FP =>   +-+
 | |
 | ... | Function call stack
 | |
 +-+
   low



Yang, for stack unwinding to work, shouldn't it be something like the following?


Yes, thanks for catching this. v3 will be post soon.

Yang



   | LR |
A64_FP => | FP |
   | .. |



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: reenable interrupt when handling ptrace breakpoint

2015-12-16 Thread Shi, Yang

On 12/16/2015 3:13 AM, Will Deacon wrote:

On Tue, Dec 15, 2015 at 04:18:08PM -0800, Yang Shi wrote:

The kernel just send out a SIGTRAP signal when handling ptrace breakpoint in
debug exception, so it sounds safe to have interrupt enabled if it is not
disabled by the parent process.


Is this actually fixing an issue you're seeing, or did you just spot this?
Given that force_sig_info disable interrupts, I don't think this is really
worth doing.


I should made more comments at the first place, sorry for the inconvenience.

I did run into some problems on -rt kernel with CRIU restore, please see 
the below kernel bug log:


BUG: sleeping function called from invalid context at 
/kernel-source/kernel/locking/rtmutex.c:917

in_atomic(): 0, irqs_disabled(): 128, pid: 551, name: test.sh
CPU: 5 PID: 551 Comm: test.sh Not tainted 4.1.13-rt13 #7
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Call trace:
[] dump_backtrace+0x0/0x128
[] show_stack+0x24/0x30
[] dump_stack+0x80/0xa0
[] ___might_sleep+0x128/0x1a0
[] rt_spin_lock+0x2c/0x40
[] force_sig_info+0xcc/0x210
[] brk_handler.part.2+0x6c/0x80
[] brk_handler+0xd8/0xe8
[] do_debug_exception+0x58/0xb8
Exception stack(0x80834b6e3e30 to 0x80834b6e3f50)
3e20:   004e6000 

3e40:   0044  d280  0007 

3e60: 0021    011a  00de 

3e80: 4ab9ef50 8083 00086324 8000 e587f780  000839b0 
8000
3ea0:   004e6000    0044 

3ec0: 6000  0015  aa3e6000  d280 

3ee0: 0007  0021     

3f00:     00de  0008 

3f20: 004eff00  004e4ff0  004f0490  e587f730 


3f40: 0046f508  0028 

It is because force_sig_info called spin_lock_irqsave which could sleep 
on -rt kernel with irq disabled.


However, it just happens at brk_handler in my test. But, I saw 
single_step has the same code path, so I expanded it to single step too.


Since this is rt related, cc to rt mailing list too.




Signed-off-by: Yang Shi 
---
  arch/arm64/kernel/debug-monitors.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/arch/arm64/kernel/debug-monitors.c 
b/arch/arm64/kernel/debug-monitors.c
index 8aee3ae..90d70e4 100644
--- a/arch/arm64/kernel/debug-monitors.c
+++ b/arch/arm64/kernel/debug-monitors.c
@@ -239,6 +239,9 @@ static int single_step_handler(unsigned long addr, unsigned 
int esr,
return 0;

if (user_mode(regs)) {
+   if (interrupts_enabled(regs))
+   local_irq_enable();
+


My worry here is that we take an interrupt and, on the return path,
decide to reschedule due to CONFIG_PREEMPT. If we somehow end up back
in the debugger, I'm concerned that it could remove the breakpoint and
then later see an unexpected SIGTRAP from the child.

Having said that, I've failed to construct a non-racy scenario in which
that can happen, but I'm just really uncomfortable making this change
unless there's a real problem being solved.


The patch is inspired by the similar code in other architectures, e.g. 
x86 and powerpc which have hardware debug exception to handle breakpoint 
and single step like arm64. And, they have interrupt enabled in both 
breakpoint and single step. So, I'm supposed arm64 could do the same thing.


For the preempt case, it might be possible, but it sounds like a 
exception handling problem to me. The preempt should not be allowed in 
debug exception (current arm64 kernel does it), and in interrupt return 
path the code should check if debug is on or not. If debug is on, 
preempt should be just skipped. Or we could disable preempt in debug 
exception.


I also checked the handling in x86 and powerpc, they go different way.

1. x86
Disable preempt in IST exception since it uses per CPU stack.

2. powerpc
Check if debug is on in interrupt return path. Powerpc has DBCR0_IDM 
indicate if the processor is in debug mode.


For ARM64, I don't find such bit. So, I may consider to have preempt 
disabled.


Thanks,
Yang



Will



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] writeback: initialize m_dirty to avoid compile warning

2015-11-18 Thread Shi, Yang

On 11/18/2015 1:53 AM, Arnd Bergmann wrote:

On Tuesday 17 November 2015 15:38:55 Andrew Morton wrote:

On Fri, 13 Nov 2015 10:26:41 -0800 Yang Shi  wrote:


When building kernel with gcc 5.2, the below warning is raised:

mm/page-writeback.c: In function 'balance_dirty_pages.isra.10':
mm/page-writeback.c:1545:17: warning: 'm_dirty' may be used uninitialized in 
this function [-Wmaybe-uninitialized]
unsigned long m_dirty, m_thresh, m_bg_thresh;

The m_dirty{thresh, bg_thresh} are initialized in the block of "if (mdtc)",
so if mdts is null, they won't be initialized before being used.
Initialize m_dirty to zero, also initialize m_thresh and m_bg_thresh to keep
consistency.

They are used later by if condition:
!mdtc || m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh)

If mdtc is null, dirty_freerun_ceiling will not be called at all, so the
initialization will not change any behavior other than just ceasing the compile
warning.


Geeze I hate that warning.  gcc really could be a bit smarter about it
and this is such a case.


--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1542,7 +1542,7 @@ static void balance_dirty_pages(struct address_space 
*mapping,
   for (;;) {
   unsigned long now = jiffies;
   unsigned long dirty, thresh, bg_thresh;
- unsigned long m_dirty, m_thresh, m_bg_thresh;
+ unsigned long m_dirty = 0, m_thresh = 0, m_bg_thresh = 0;

   /*
* Unstable writes are a feature of certain networked


Adding runtime overhead to suppress a compile-time warning is Just
Wrong.

With gcc-4.4.4 the above patch actually reduces page-writeback.o's
.text by 36 bytes, lol.  With gcc-4.8.4 the patch saves 19 bytes.  No
idea what's going on there...


I've done tons of build tests and never got the warning for the variables
other than m_dirty, and that one also just with very few configurations
(e.g. ARM omap2plus_defconfig).


Yes, I just got the warning for m_dirty too. Just initialize m_thresh 
and m_bg_thresh to keep consistency (not sure if it is necessary). And, 
I'm a little bit confused why gcc just reports m_dirty but ignore others.




How about initializing only m_dirty but not the others?


Fine to me.

Thanks,
Yang



Arnd



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] writeback: initialize m_dirty to avoid compile warning

2015-11-18 Thread Shi, Yang

On 11/18/2015 10:11 AM, Tejun Heo wrote:

Hello,

On Tue, Nov 17, 2015 at 03:38:55PM -0800, Andrew Morton wrote:

--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1542,7 +1542,7 @@ static void balance_dirty_pages(struct address_space 
*mapping,
for (;;) {
unsigned long now = jiffies;
unsigned long dirty, thresh, bg_thresh;
-   unsigned long m_dirty, m_thresh, m_bg_thresh;
+   unsigned long m_dirty = 0, m_thresh = 0, m_bg_thresh = 0;

/*
 * Unstable writes are a feature of certain networked


Adding runtime overhead to suppress a compile-time warning is Just
Wrong.

With gcc-4.4.4 the above patch actually reduces page-writeback.o's
.text by 36 bytes, lol.  With gcc-4.8.4 the patch saves 19 bytes.  No
idea what's going on there...


And initializing locals in the above fashion can hide real bugs -
looky:


This was the main reason the code was structured the way it is.  If
cgroup writeback is not enabled, any derefs of mdtc variables should
trigger warnings.  Ugh... I don't know.  Compiler really should be
able to tell this much.


Thanks for the explanation. It sounds like a compiler problem.

If you think it is still good to cease the compile warning, maybe we 
could just assign it to an insane value as what Andrew suggested, maybe 
0xdeadbeef.


Thanks,
Yang



Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] writeback: initialize m_dirty to avoid compile warning

2015-11-18 Thread Shi, Yang

On 11/18/2015 10:33 AM, Tejun Heo wrote:

Hello,

On Wed, Nov 18, 2015 at 10:27:32AM -0800, Shi, Yang wrote:

This was the main reason the code was structured the way it is.  If
cgroup writeback is not enabled, any derefs of mdtc variables should
trigger warnings.  Ugh... I don't know.  Compiler really should be
able to tell this much.


Thanks for the explanation. It sounds like a compiler problem.

If you think it is still good to cease the compile warning, maybe we could


If this is gonna be a problem with new gcc versions, I don't think we
have any other options. :(


just assign it to an insane value as what Andrew suggested, maybe
0xdeadbeef.


I'd just keep it at zero.  Whatever we do, the effect is gonna be
difficult to track down - it's not gonna blow up in an obvious way.
Can you please add a comment tho explaining that this is to work
around compiler deficiency?


Sure.

Other than this, in v2, I will just initialize m_dirty since compiler 
just reports it is uninitialized.


Thanks,
Yang



Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: restore bogomips information in /proc/cpuinfo

2015-11-18 Thread Shi, Yang

On 11/18/2015 10:47 AM, Will Deacon wrote:

Hello,

On Wed, Nov 18, 2015 at 10:15:05AM -0800, Yang Shi wrote:

As what Pavel Machek reported [1], some userspace applications depend on
bogomips showed by /proc/cpuinfo.

Although there is much less legacy impact on aarch64 than arm, but it does
break libvirt.

Basically, this patch reverts commit 326b16db9f69fd0d279be873c6c00f88c0a4aad5
("arm64: delay: don't bother reporting bogomips in /proc/cpuinfo"), but with
some tweak due to context change.

[1] https://lkml.org/lkml/2015/1/4/132


I lost this argument last time around, so I won't re-tread that path this
time around. I do, however, have some comments on the patch.



Signed-off-by: Yang Shi 
---
  arch/arm64/kernel/cpuinfo.c | 5 +
  arch/arm64/kernel/smp.c | 7 ++-
  2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index 706679d..8d4ba77 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -30,6 +30,7 @@
  #include 
  #include 
  #include 
+#include 

  /*
   * In case the boot CPU is hotpluggable, we record its initial state and
@@ -112,6 +113,10 @@ static int c_show(struct seq_file *m, void *v)
 */
seq_printf(m, "processor\t: %d\n", i);

+   seq_printf(m, "BogoMIPS\t: %lu.%02lu\n\n",


This double newline makes /proc/cpuinfo looks really odd. Can we just
have one, please?


+  loops_per_jiffy / (50UL/HZ),
+  loops_per_jiffy / (5000UL/HZ) % 100);
+
/*
 * Dump out the common processor features in a single line.
 * Userspace should read the hwcaps with getauxval(AT_HWCAP)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index b1adc51..1bed772 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -326,7 +326,12 @@ static void __init hyp_mode_check(void)

  void __init smp_cpus_done(unsigned int max_cpus)
  {
-   pr_info("SMP: Total of %d processors activated.\n", num_online_cpus());
+   unsigned long bogosum = loops_per_jiffy * num_online_cpus();
+
+   pr_info("SMP: Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
+   num_online_cpus(), bogosum / (50/HZ),
+   (bogosum / (5000/HZ)) % 100);


Can we drop this hunk? I don't see a pressing need to print this in
dmesg.


Sure, will solve them in v2.

Thanks,
Yang



With those two changes:

   Acked-by: Will Deacon 

I guess this needs Cc'ing to stable, too.

Thanks,

Will



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] writeback: initialize m_dirty to avoid compile warning

2015-11-18 Thread Shi, Yang

On 11/18/2015 10:55 AM, Andrew Morton wrote:

On Wed, 18 Nov 2015 10:39:23 -0800 "Shi, Yang"  wrote:


On 11/18/2015 10:33 AM, Tejun Heo wrote:

Hello,

On Wed, Nov 18, 2015 at 10:27:32AM -0800, Shi, Yang wrote:

This was the main reason the code was structured the way it is.  If
cgroup writeback is not enabled, any derefs of mdtc variables should
trigger warnings.  Ugh... I don't know.  Compiler really should be
able to tell this much.


Thanks for the explanation. It sounds like a compiler problem.

If you think it is still good to cease the compile warning, maybe we could


If this is gonna be a problem with new gcc versions, I don't think we
have any other options. :(


just assign it to an insane value as what Andrew suggested, maybe
0xdeadbeef.


I'd just keep it at zero.  Whatever we do, the effect is gonna be
difficult to track down - it's not gonna blow up in an obvious way.
Can you please add a comment tho explaining that this is to work
around compiler deficiency?


Sure.

Other than this, in v2, I will just initialize m_dirty since compiler
just reports it is uninitialized.


gcc-4.4.4 and gcc-4.8.4 warn about all three variables.


It sounds 5.x is smarter :-)



--- 
a/mm/page-writeback.c~writeback-initialize-m_dirty-to-avoid-compile-warning-fix
+++ a/mm/page-writeback.c
@@ -1542,7 +1542,9 @@ static void balance_dirty_pages(struct a
for (;;) {
unsigned long now = jiffies;
unsigned long dirty, thresh, bg_thresh;
-   unsigned long m_dirty = 0, m_thresh = 0, m_bg_thresh = 0;
+   unsigned long m_dirty = 0;  /* stop bogus uninit warnings */
+   unsigned long m_thresh = 0;
+   unsigned long m_bg_thresh = 0;


Still need v2?

Thanks,
Yang



/*
 * Unstable writes are a feature of certain networked
_



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: bpf: fix buffer pointer

2015-11-18 Thread Shi, Yang

On 11/18/2015 12:56 AM, Zi Shen Lim wrote:

During code review, I noticed we were passing a bad buffer pointer
to bpf_load_pointer helper function called by jitted code.

Point to the buffer allocated by JIT, so we don't silently corrupt
other parts of the stack.

Signed-off-by: Zi Shen Lim 
---
  arch/arm64/net/bpf_jit_comp.c | 27 +--
  1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index d6a53ef..7cf032b 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -139,6 +139,12 @@ static inline int epilogue_offset(const struct jit_ctx 
*ctx)
  /* Stack must be multiples of 16B */
  #define STACK_ALIGN(sz) (((sz) + 15) & ~15)

+#define _STACK_SIZE \
+   (MAX_BPF_STACK \
++ 4 /* extra for skb_copy_bits buffer */)
+
+#define STACK_SIZE STACK_ALIGN(_STACK_SIZE)
+
  static void build_prologue(struct jit_ctx *ctx)
  {
const u8 r6 = bpf2a64[BPF_REG_6];
@@ -150,10 +156,6 @@ static void build_prologue(struct jit_ctx *ctx)
const u8 rx = bpf2a64[BPF_REG_X];
const u8 tmp1 = bpf2a64[TMP_REG_1];
const u8 tmp2 = bpf2a64[TMP_REG_2];
-   int stack_size = MAX_BPF_STACK;
-
-   stack_size += 4; /* extra for skb_copy_bits buffer */
-   stack_size = STACK_ALIGN(stack_size);

/*
 * BPF prog stack layout
@@ -165,12 +167,13 @@ static void build_prologue(struct jit_ctx *ctx)
 *| ... | callee saved registers
 *+-+
 *| | x25/x26
-* BPF fp register => -80:+-+
+* BPF fp register => -80:+-+ <= (BPF_FP)
 *| |
 *| ... | BPF prog stack
 *| |
-*| |
-* current A64_SP =>  +-+
+*+-+ <= (BPF_FP - MAX_BPF_STACK)
+*|RSVD | JIT scratchpad
+* current A64_SP =>  +-+ <= (BPF_FP - STACK_SIZE)
 *| |
 *| ... | Function call stack
 *| |
@@ -196,7 +199,7 @@ static void build_prologue(struct jit_ctx *ctx)
emit(A64_MOV(1, fp, A64_SP), ctx);

/* Set up function call stack */
-   emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
+   emit(A64_SUB_I(1, A64_SP, A64_SP, STACK_SIZE), ctx);

/* Clear registers A and X */
emit_a64_mov_i64(ra, 0, ctx);
@@ -213,13 +216,9 @@ static void build_epilogue(struct jit_ctx *ctx)
const u8 fp = bpf2a64[BPF_REG_FP];
const u8 tmp1 = bpf2a64[TMP_REG_1];
const u8 tmp2 = bpf2a64[TMP_REG_2];
-   int stack_size = MAX_BPF_STACK;
-
-   stack_size += 4; /* extra for skb_copy_bits buffer */
-   stack_size = STACK_ALIGN(stack_size);

/* We're done with BPF stack */
-   emit(A64_ADD_I(1, A64_SP, A64_SP, stack_size), ctx);
+   emit(A64_ADD_I(1, A64_SP, A64_SP, STACK_SIZE), ctx);

/* Restore fs (x25) and x26 */
emit(A64_POP(fp, A64_R(26), A64_SP), ctx);
@@ -658,7 +657,7 @@ emit_cond_jmp:
return -EINVAL;
}
emit_a64_mov_i64(r3, size, ctx);
-   emit(A64_ADD_I(1, r4, fp, MAX_BPF_STACK), ctx);
+   emit(A64_SUB_I(1, r4, fp, STACK_SIZE), ctx);


Should not it sub MAX_BPF_STACK?

If you sub STACK_SIZE here, the buffer pointer will point to bottom of 
the reserved area.


You stack layout change also shows this:

+*+-+ <= (BPF_FP - MAX_BPF_STACK)
+*|RSVD | JIT scratchpad
+* current A64_SP =>  +-+ <= (BPF_FP - STACK_SIZE)

Thanks,
Yang



emit_a64_mov_i64(r5, (unsigned long)bpf_load_pointer, ctx);
emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
emit(A64_MOV(1, A64_FP, A64_SP), ctx);



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: bpf: fix buffer pointer

2015-11-18 Thread Shi, Yang

On 11/18/2015 1:41 PM, Z Lim wrote:

On Wed, Nov 18, 2015 at 1:07 PM, Shi, Yang  wrote:

On 11/18/2015 12:56 AM, Zi Shen Lim wrote:

 emit_a64_mov_i64(r3, size, ctx);
-   emit(A64_ADD_I(1, r4, fp, MAX_BPF_STACK), ctx);
+   emit(A64_SUB_I(1, r4, fp, STACK_SIZE), ctx);



Should not it sub MAX_BPF_STACK?


No, if it's at (BPF_FP - MAX_BPF_STACK), we'll be writing into the BPF
stack area, which should only be used by the BPF program.


If you sub STACK_SIZE here, the buffer pointer will point to bottom of the
reserved area.


Yes, that's the idea. The buffer is allocated in here. Right now we're
using this "reserved" space for this buffer only.


OK, I see. The buffer grows from low to high.

Thanks for the elaboration.

Acked-by: Yang Shi 

Yang





You stack layout change also shows this:

+*+-+ <= (BPF_FP - MAX_BPF_STACK)
+*|RSVD | JIT scratchpad
+* current A64_SP =>  +-+ <= (BPF_FP - STACK_SIZE)


Yes, this diagram reflects the code and intention.


Thanks for reviewing, we definitely need more of these :)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [V2 PATCH] arm64: restore bogomips information in /proc/cpuinfo

2015-11-19 Thread Shi, Yang

On 11/19/2015 9:59 AM, Catalin Marinas wrote:

On Wed, Nov 18, 2015 at 10:48:55AM -0800, Yang Shi wrote:

As what Pavel Machek reported [1], some userspace applications depend on
bogomips showed by /proc/cpuinfo.

Although there is much less legacy impact on aarch64 than arm, but it does
break libvirt.

Basically, this patch reverts commit 326b16db9f69fd0d279be873c6c00f88c0a4aad5
("arm64: delay: don't bother reporting bogomips in /proc/cpuinfo"), but with
some tweak due to context change.

[1] https://lkml.org/lkml/2015/1/4/132

Acked-by: Will Deacon 
Cc:  #3.12+
Signed-off-by: Yang Shi 


Patch applied as a fix for stable, basically returning back to the
pre-3.12 behaviour. If there is a need for some more useful information
here, it can be done as an additional patch, though without cc: stable.


Thanks. This approach sounds safer to usersapce.

Yang



Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH] arm64: bpf: add 'store immediate' instruction

2015-12-01 Thread Shi, Yang

On 11/30/2015 2:24 PM, Yang Shi wrote:

aarch64 doesn't have native store immediate instruction, such operation
has to be implemented by the below instruction sequence:

Load immediate to register
Store register

Signed-off-by: Yang Shi 
CC: Zi Shen Lim 


Had email exchange offline with Zi Shen Lim since he is traveling and 
cannot send text-only mail, quoted below for his reply:


"I've given reviewed-by in response to original posting. Unless 
something has changed, feel free to add it."


Since there is nothing changed, added his reviewed-by.

Reviewed-by: Zi Shen Lim 

Thanks,
Yang


CC: Xi Wang 
---
Thsi patch might be buried by the storm of xadd discussion, however, it is
absolutely irrelevent to xadd, so resend the patch itself.

  arch/arm64/net/bpf_jit_comp.c | 20 +++-
  1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 6809647..49c1f1b 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -563,7 +563,25 @@ emit_cond_jmp:
case BPF_ST | BPF_MEM | BPF_H:
case BPF_ST | BPF_MEM | BPF_B:
case BPF_ST | BPF_MEM | BPF_DW:
-   goto notyet;
+   /* Load imm to a register then store it */
+   ctx->tmp_used = 1;
+   emit_a64_mov_i(1, tmp2, off, ctx);
+   emit_a64_mov_i(1, tmp, imm, ctx);
+   switch (BPF_SIZE(code)) {
+   case BPF_W:
+   emit(A64_STR32(tmp, dst, tmp2), ctx);
+   break;
+   case BPF_H:
+   emit(A64_STRH(tmp, dst, tmp2), ctx);
+   break;
+   case BPF_B:
+   emit(A64_STRB(tmp, dst, tmp2), ctx);
+   break;
+   case BPF_DW:
+   emit(A64_STR64(tmp, dst, tmp2), ctx);
+   break;
+   }
+   break;

/* STX: *(size *)(dst + off) = src */
case BPF_STX | BPF_MEM | BPF_W:



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] trace/events: Add gup trace events

2015-12-01 Thread Shi, Yang

On 12/1/2015 3:56 PM, Steven Rostedt wrote:

On Tue,  1 Dec 2015 15:06:11 -0800
Yang Shi  wrote:


page-faults events record the invoke to handle_mm_fault, but the invoke
may come from do_page_fault or gup. In some use cases, the finer event count
mey be needed, so add trace events support for:

__get_user_pages
__get_user_pages_fast
fixup_user_fault

Signed-off-by: Yang Shi 
---
  include/trace/events/gup.h | 77 ++
  1 file changed, 77 insertions(+)
  create mode 100644 include/trace/events/gup.h

diff --git a/include/trace/events/gup.h b/include/trace/events/gup.h
new file mode 100644
index 000..37d18f9
--- /dev/null
+++ b/include/trace/events/gup.h
@@ -0,0 +1,77 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM gup
+
+#if !defined(_TRACE_GUP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_GUP_H
+
+#include 
+#include 
+
+TRACE_EVENT(gup_fixup_user_fault,
+
+   TP_PROTO(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long address, unsigned int fault_flags),
+
+   TP_ARGS(tsk, mm, address, fault_flags),
+
+   TP_STRUCT__entry(
+   __array(char,   comm,   TASK_COMM_LEN   )


Why save the comm? The tracing infrastructure should keep track of that.


The code is referred to kmem.h which has comm copied. If it is 
unnecessary, it definitely could be removed.





+   __field(unsigned long,  address )
+   ),
+
+   TP_fast_assign(
+   memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+   __entry->address = address;
+   ),
+
+   TP_printk("comm=%s address=%lx", __entry->comm, __entry->address)
+);
+
+TRACE_EVENT(gup_get_user_pages,
+
+   TP_PROTO(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, unsigned long nr_pages,
+   unsigned int gup_flags, struct page **pages,
+   struct vm_area_struct **vmas, int *nonblocking),
+
+   TP_ARGS(tsk, mm, start, nr_pages, gup_flags, pages, vmas, nonblocking),


Why so many arguments? Most are not used.


My understanding to TP_ARGS may be not right. Doesn't it require all the 
args defined by the function? If not, it could definitely be shrunk. 
Just need keep the args used by TP_printk?


Thanks,
Yang



-- Steve


+
+   TP_STRUCT__entry(
+   __array(char,   comm,   TASK_COMM_LEN   )
+   __field(unsigned long,  start   )
+   __field(unsigned long,  nr_pages)
+   ),
+
+   TP_fast_assign(
+   memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+   __entry->start   = start;
+   __entry->nr_pages= nr_pages;
+   ),
+
+   TP_printk("comm=%s start=%lx nr_pages=%lu", __entry->comm, __entry->start, 
__entry->nr_pages)
+);
+
+TRACE_EVENT(gup_get_user_pages_fast,
+
+   TP_PROTO(unsigned long start, int nr_pages, int write,
+   struct page **pages),
+
+   TP_ARGS(start, nr_pages, write, pages),
+
+   TP_STRUCT__entry(
+   __field(unsigned long,  start   )
+   __field(unsigned long,  nr_pages)
+   ),
+
+   TP_fast_assign(
+   __entry->start   = start;
+   __entry->nr_pages= nr_pages;
+   ),
+
+   TP_printk("start=%lx nr_pages=%lu",  __entry->start, __entry->nr_pages)
+);
+
+#endif /* _TRACE_GUP_H */
+
+/* This part must be outside protection */
+#include 




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] trace/events: Add gup trace events

2015-12-01 Thread Shi, Yang

On 12/1/2015 4:18 PM, Steven Rostedt wrote:

On Tue, 01 Dec 2015 16:07:44 -0800
"Shi, Yang"  wrote:


On 12/1/2015 3:56 PM, Steven Rostedt wrote:

On Tue,  1 Dec 2015 15:06:11 -0800
Yang Shi  wrote:


page-faults events record the invoke to handle_mm_fault, but the invoke
may come from do_page_fault or gup. In some use cases, the finer event count
mey be needed, so add trace events support for:

__get_user_pages
__get_user_pages_fast
fixup_user_fault

Signed-off-by: Yang Shi 
---
   include/trace/events/gup.h | 77 
++
   1 file changed, 77 insertions(+)
   create mode 100644 include/trace/events/gup.h

diff --git a/include/trace/events/gup.h b/include/trace/events/gup.h
new file mode 100644
index 000..37d18f9
--- /dev/null
+++ b/include/trace/events/gup.h
@@ -0,0 +1,77 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM gup
+
+#if !defined(_TRACE_GUP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_GUP_H
+
+#include 
+#include 
+
+TRACE_EVENT(gup_fixup_user_fault,
+
+   TP_PROTO(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long address, unsigned int fault_flags),
+
+   TP_ARGS(tsk, mm, address, fault_flags),
+
+   TP_STRUCT__entry(
+   __array(char,   comm,   TASK_COMM_LEN   )


Why save the comm? The tracing infrastructure should keep track of that.


The code is referred to kmem.h which has comm copied. If it is
unnecessary, it definitely could be removed.


Sometimes comm isn't that reliable. But really, the only tracepoint
that should record it is sched_switch, and sched_wakeup. With those
two, the rest of the trace points should be fine.






+   __field(unsigned long,  address )
+   ),
+
+   TP_fast_assign(
+   memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+   __entry->address = address;
+   ),
+
+   TP_printk("comm=%s address=%lx", __entry->comm, __entry->address)
+);
+
+TRACE_EVENT(gup_get_user_pages,
+
+   TP_PROTO(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, unsigned long nr_pages,
+   unsigned int gup_flags, struct page **pages,
+   struct vm_area_struct **vmas, int *nonblocking),
+
+   TP_ARGS(tsk, mm, start, nr_pages, gup_flags, pages, vmas, nonblocking),


Why so many arguments? Most are not used.


My understanding to TP_ARGS may be not right. Doesn't it require all the
args defined by the function? If not, it could definitely be shrunk.
Just need keep the args used by TP_printk?


It only needs what is used by TP_fast_assign().


Thanks, will fix them in V2.

Yang



-- Steve



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 2/7] mm/gup: add gup trace points

2015-12-02 Thread Shi, Yang

On 12/2/2015 3:36 PM, Dave Hansen wrote:

On 12/02/2015 02:53 PM, Yang Shi wrote:

diff --git a/mm/gup.c b/mm/gup.c
index deafa2c..10245a4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -13,6 +13,9 @@
  #include 
  #include 

+#define CREATE_TRACE_POINTS
+#include 
+
  #include 
  #include 


This needs to be _the_ last thing that gets #included.  Otherwise, you
risk colliding with any other trace header that gets implicitly included
below.


Thanks for the suggestion, will move it to the last.




@@ -1340,6 +1346,8 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
start, len)))
return 0;

+   trace_gup_get_user_pages_fast(start, nr_pages, write, pages);
+
/*
 * Disable interrupts.  We use the nested form as we can already have
 * interrupts disabled by get_futex_key.


It would be _really_ nice to be able to see return values from the
various gup calls as well.  Is that feasible?


I think it should be feasible. kmem_cache_alloc trace event could show 
return value. I'm supposed gup trace events should be able to do the 
same thing.


Regards,
Yang





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 2/7] mm/gup: add gup trace points

2015-12-02 Thread Shi, Yang

On 12/2/2015 4:11 PM, Shi, Yang wrote:

On 12/2/2015 3:36 PM, Dave Hansen wrote:

On 12/02/2015 02:53 PM, Yang Shi wrote:

diff --git a/mm/gup.c b/mm/gup.c
index deafa2c..10245a4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -13,6 +13,9 @@
  #include 
  #include 

+#define CREATE_TRACE_POINTS
+#include 
+
  #include 
  #include 


This needs to be _the_ last thing that gets #included.  Otherwise, you
risk colliding with any other trace header that gets implicitly included
below.


Thanks for the suggestion, will move it to the last.




@@ -1340,6 +1346,8 @@ int __get_user_pages_fast(unsigned long start,
int nr_pages, int write,
  start, len)))
  return 0;

+trace_gup_get_user_pages_fast(start, nr_pages, write, pages);
+
  /*
   * Disable interrupts.  We use the nested form as we can
already have
   * interrupts disabled by get_futex_key.


It would be _really_ nice to be able to see return values from the
various gup calls as well.  Is that feasible?


I think it should be feasible. kmem_cache_alloc trace event could show
return value. I'm supposed gup trace events should be able to do the
same thing.


Just did a quick test, it is definitely feasible. Please check the below 
test log:


   trace-cmd-200   [000]99.221486: gup_get_user_pages: 
start=800ff0 nr_pages=1 ret=1
   trace-cmd-200   [000]99.223215: gup_get_user_pages: 
start=800fdb nr_pages=1 ret=1
   trace-cmd-200   [000]99.223298: gup_get_user_pages: 
start=800ed0 nr_pages=1 ret=1


nr_pages is the number of pages requested by the gup, ret is the return 
value.


If nobody has objection, I will add it into V3.

Regards,
Yang



Regards,
Yang







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 2/7] mm/gup: add gup trace points

2015-12-03 Thread Shi, Yang

On 12/2/2015 8:13 PM, Steven Rostedt wrote:

On Wed, 2 Dec 2015 15:36:50 -0800
Dave Hansen  wrote:


On 12/02/2015 02:53 PM, Yang Shi wrote:

diff --git a/mm/gup.c b/mm/gup.c
index deafa2c..10245a4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -13,6 +13,9 @@
  #include 
  #include 

+#define CREATE_TRACE_POINTS
+#include 
+
  #include 
  #include 


This needs to be _the_ last thing that gets #included.  Otherwise, you
risk colliding with any other trace header that gets implicitly included
below.


Agreed.




@@ -1340,6 +1346,8 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
start, len)))
return 0;

+   trace_gup_get_user_pages_fast(start, nr_pages, write, pages);
+
/*
 * Disable interrupts.  We use the nested form as we can already have
 * interrupts disabled by get_futex_key.


It would be _really_ nice to be able to see return values from the
various gup calls as well.  Is that feasible?


Only if you rewrite the functions to have a single return code path
that we can add a tracepoint too. Or have a wrapper function that gets


Yes. My preliminary test just could cover the success case. gup could 
return errno from different a few code path.



called directly that calls these functions internally and the tracepoint
can trap the return value.


This will incur more changes in other subsystems (futex, kvm, etc), I'm 
not sure if it is worth making such changes to get return value.



I can probably make function_graph tracer give return values, although
it will give a return value for void functions as well. And it may give
long long returns for int returns that may have bogus data in the
higher bits.


If the return value requirement is not limited to gup, the approach 
sounds more reasonable.


Thanks,
Yang



-- Steve



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/7] trace/events: Add gup trace events

2015-12-03 Thread Shi, Yang

On 12/2/2015 8:07 PM, Steven Rostedt wrote:

On Wed,  2 Dec 2015 14:53:27 -0800
Yang Shi  wrote:


page-faults events record the invoke to handle_mm_fault, but the invoke
may come from do_page_fault or gup. In some use cases, the finer event count
mey be needed, so add trace events support for:

__get_user_pages
__get_user_pages_fast
fixup_user_fault

Signed-off-by: Yang Shi 
---
  include/trace/events/gup.h | 71 ++
  1 file changed, 71 insertions(+)
  create mode 100644 include/trace/events/gup.h

diff --git a/include/trace/events/gup.h b/include/trace/events/gup.h
new file mode 100644
index 000..03a4674
--- /dev/null
+++ b/include/trace/events/gup.h
@@ -0,0 +1,71 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM gup
+
+#if !defined(_TRACE_GUP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_GUP_H
+
+#include 
+#include 
+
+TRACE_EVENT(gup_fixup_user_fault,
+
+   TP_PROTO(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long address, unsigned int fault_flags),
+
+   TP_ARGS(tsk, mm, address, fault_flags),


Arges added and not used by TP_fast_assign(), this will slow down the
code while tracing is enabled, as they need to be added to the trace
function call.


+
+   TP_STRUCT__entry(
+   __field(unsigned long,  address )
+   ),
+
+   TP_fast_assign(
+   __entry->address = address;
+   ),
+
+   TP_printk("address=%lx",  __entry->address)
+);
+
+TRACE_EVENT(gup_get_user_pages,
+
+   TP_PROTO(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, unsigned long nr_pages),
+
+   TP_ARGS(tsk, mm, start, nr_pages),


Here too but this is worse. See below.


+
+   TP_STRUCT__entry(
+   __field(unsigned long,  start   )
+   __field(unsigned long,  nr_pages)
+   ),
+
+   TP_fast_assign(
+   __entry->start   = start;
+   __entry->nr_pages= nr_pages;
+   ),
+
+   TP_printk("start=%lx nr_pages=%lu", __entry->start, __entry->nr_pages)
+);
+
+TRACE_EVENT(gup_get_user_pages_fast,
+
+   TP_PROTO(unsigned long start, int nr_pages, int write,
+   struct page **pages),
+
+   TP_ARGS(start, nr_pages, write, pages),


This and the above "gup_get_user_pages" have the same entry field,
assign and printk. They should be combined into a DECLARE_EVENT_CLASS()
and two DEFINE_EVENT()s. That will save on size as the
DECLARE_EVENT_CLASS() is the biggest part of each TRACE_EVENT().


Thanks for the suggestion, will fix them in V3.

Regards,
Yang



-- Steve



+
+   TP_STRUCT__entry(
+   __field(unsigned long,  start   )
+   __field(unsigned long,  nr_pages)
+   ),
+
+   TP_fast_assign(
+   __entry->start   = start;
+   __entry->nr_pages= nr_pages;
+   ),
+
+   TP_printk("start=%lx nr_pages=%lu",  __entry->start, __entry->nr_pages)
+);
+
+#endif /* _TRACE_GUP_H */
+
+/* This part must be outside protection */
+#include 




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 2/7] mm/gup: add gup trace points

2015-12-03 Thread Shi, Yang

On 12/3/2015 11:06 AM, Steven Rostedt wrote:

On Thu, 03 Dec 2015 10:36:18 -0800
"Shi, Yang"  wrote:


called directly that calls these functions internally and the tracepoint
can trap the return value.


This will incur more changes in other subsystems (futex, kvm, etc), I'm
not sure if it is worth making such changes to get return value.


No, it wouldn't require any changes outside of this.

-long __get_user_pages(..)
+static long __get_user_pages_internal(..)
{
   [..]
}
+
+long __get_user_pages(..)
+{
+   long ret;
+   ret = __get_user_pages_internal(..);
+   trace_get_user_pages(.., ret)
+}


Thanks for this. I just checked the fast version, it looks it just has 
single return path, so this should be just needed by slow version.







I can probably make function_graph tracer give return values, although
it will give a return value for void functions as well. And it may give
long long returns for int returns that may have bogus data in the
higher bits.


If the return value requirement is not limited to gup, the approach
sounds more reasonable.



Others have asked about it. Maybe I should do it.


If you are going to add return value in common trace code, I won't do 
the gup specific one in V3.


Thanks,
Yang



-- Steve



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [V2 PATCH] sparc64/gup: check address scope legitimacy

2015-12-03 Thread Shi, Yang

On 12/3/2015 12:38 PM, Sam Ravnborg wrote:

Hi Yang.

On Wed, Nov 25, 2015 at 02:45:43PM -0800, Yang Shi wrote:

Check if user address is accessible in atomic version __get_user_pages_fast()
before walking the page table.
And, check if end > start in get_user_pages_fast(), otherwise fallback to slow
path.


Two different but related things in one patch is often a bad thing.
It would have been better to split it up.




Signed-off-by: Yang Shi 
---
Just found slow_irqon label is not defined, added it to avoid compile error.

  arch/sparc/mm/gup.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 2e5c4fc..cf4fb47 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -173,6 +173,9 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
addr = start;
len = (unsigned long) nr_pages << PAGE_SHIFT;
end = start + len;
+   if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+   (void __user *)start, len)))
+   return 0;

This change is not justified.
Why would we take the time to first do the access_ok() stuff.
If this had been an expensive operation then we had made this function
slower in the normal case ( assuming there were no access violations in the
normal case).
When I look at the implementation of access_ok() I get the impression that
this is not really a check we need.

access_ok() always returns 1.


Thanks for pointing it out. And, I didn't notice that gup is just built 
for SPARC64. I though it is built by both 64 bit and 32 bit.


A follow-up question, is there any reason to just have sparc specific 
fast gup for 64 bit not for 32 bit?







local_irq_save(flags);
pgdp = pgd_offset(mm, addr);
@@ -203,6 +206,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
addr = start;
len = (unsigned long) nr_pages << PAGE_SHIFT;
end = start + len;
+   if (end < start)
+   goto slow_irqon;


end can only be smaller than start if there is some overflow.
See how end is calculated just the line above.

This looks like a highly suspicious change.


I'm supposed this is used to protect the overflow. I copied the code 
from other arch. Actually, every arch has this except sparc.


Thanks,
Yang



Sam



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: add HAVE_LATENCYTOP_SUPPORT config

2015-12-14 Thread Shi, Yang

Hi folks,

I tried to enable latencytop for arm64 and came across this discussion, 
so any plan about when this will get merged into mainline? 4.5 merge window?


Thanks,
Yang


On 11/10/2015 3:34 AM, Heiko Carstens wrote:

From: Will Deacon 
Date: Tue, 10 Nov 2015 11:10:04 +
Subject: [PATCH] Kconfig: remove HAVE_LATENCYTOP_SUPPORT

As illustrated by a3afe70b83fd ("[S390] latencytop s390 support."),
HAVE_LATENCYTOP_SUPPORT is defined by an architecture to advertise an
implementation of save_stack_trace_tsk.

However, as of 9212ddb5eada ("stacktrace: provide save_stack_trace_tsk()
weak alias") a dummy implementation is provided if STACKTRACE=y.
Given that LATENCYTOP already depends on STACKTRACE_SUPPORT and selects
STACKTRACE, we can remove HAVE_LATENCYTOP_SUPPORT altogether.

Signed-off-by: Will Deacon 
---
  arch/arc/Kconfig| 3 ---
  arch/arm/Kconfig| 5 -
  arch/metag/Kconfig  | 3 ---
  arch/microblaze/Kconfig | 3 ---
  arch/parisc/Kconfig | 3 ---
  arch/powerpc/Kconfig| 3 ---
  arch/s390/Kconfig   | 3 ---
  arch/sh/Kconfig | 3 ---
  arch/sparc/Kconfig  | 4 
  arch/unicore32/Kconfig  | 3 ---
  arch/x86/Kconfig| 3 ---
  lib/Kconfig.debug   | 1 -
  12 files changed, 37 deletions(-)


Acked-by: Heiko Carstens 


___
linux-arm-kernel mailing list
linux-arm-ker...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: restore bogomips information in /proc/cpuinfo

2015-11-25 Thread Shi, Yang

On 11/25/2015 7:16 AM, Nicolas Pitre wrote:

On Wed, 25 Nov 2015, Jon Masters wrote:


On 11/18/15, 1:15 PM, Yang Shi wrote:


As what Pavel Machek reported [1], some userspace applications depend on
bogomips showed by /proc/cpuinfo.

Although there is much less legacy impact on aarch64 than arm, but it does
break libvirt.

Basically, this patch reverts commit
326b16db9f69fd0d279be873c6c00f88c0a4aad5
("arm64: delay: don't bother reporting bogomips in /proc/cpuinfo"), but with
some tweak due to context change.


On a total tangent, it would be ideal to (eventually) have something reported
in /proc/cpuinfo or dmesg during boot that does "accurately" map back to the
underlying core frequency (as opposed to the generic timer frequency). I have
seen almost countless silly situations in the industry (external to my own
organization) in which someone has taken a $VENDOR_X reference system that
they're not supposed to run benchmarks on, and they've done it anyway. But
usually on some silicon that's clocked multiples under what production would
be. Then silly rumors about performance get around because nobody can do
simple arithmetic and notice that they ought to have at least divided by some
factor.


Be my guest my friend.

According to the common wisdom, the bogomips reporting is completely
senseless at this point and no one should expect anything useful from
it. Therefore I attempted to rehabilitate some meaning into it given
that we just can't get rid of it either and it continues to cause
dammage. You certainly saw where that has led me.


Or we may create a new one, i.e. "cpu MHz" like x86? Then we keep both 
in cpuinfo so that the userspace could adopt it gradually?


Thanks,
Yang




Nicolas



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sparc64/gup: check address scope legitimacy

2015-11-25 Thread Shi, Yang

On 11/25/2015 4:26 PM, kbuild test robot wrote:

Hi Yang,

[auto build test ERROR on v4.4-rc2]
[also build test ERROR on next-20151124]

url:
https://github.com/0day-ci/linux/commits/Yang-Shi/sparc64-gup-check-address-scope-legitimacy/20151126-065342
config: sparc64-allnoconfig (attached as .config)
reproduce:
 wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=sparc64

All errors (new ones prefixed by >>):

arch/sparc/mm/gup.c: In function 'get_user_pages_fast':

arch/sparc/mm/gup.c:210:3: error: label 'slow_irqon' used but not defined

   goto slow_irqon;


Already fixed in v2.

Thanks,
Yang


   ^

vim +/slow_irqon +210 arch/sparc/mm/gup.c

204 
205 start &= PAGE_MASK;
206 addr = start;
207 len = (unsigned long) nr_pages << PAGE_SHIFT;
208 end = start + len;
209 if (end < start)
  > 210  goto slow_irqon;
211 
212 /*
213  * XXX: batch / limit 'nr', to avoid large irq off latency

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/7] mm/gup: add gup trace points

2015-12-08 Thread Shi, Yang

On 12/8/2015 12:25 PM, Steven Rostedt wrote:

On Tue,  8 Dec 2015 11:39:50 -0800
Yang Shi  wrote:


For slow version, just add trace point for raw __get_user_pages since all
slow variants call it to do the real work finally.

Signed-off-by: Yang Shi 
---
  mm/gup.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index deafa2c..44f05c9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -18,6 +18,9 @@

  #include "internal.h"

+#define CREATE_TRACE_POINTS
+#include 
+
  static struct page *no_page_table(struct vm_area_struct *vma,
unsigned int flags)
  {
@@ -462,6 +465,8 @@ long __get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
if (!nr_pages)
return 0;

+   trace_gup_get_user_pages(start, nr_pages);
+
VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));

/*
@@ -599,6 +604,7 @@ int fixup_user_fault(struct task_struct *tsk, struct 
mm_struct *mm,
if (!(vm_flags & vma->vm_flags))
return -EFAULT;

+   trace_gup_fixup_user_fault(address);
ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
@@ -1340,6 +1346,8 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
start, len)))
return 0;

+   trace_gup_get_user_pages_fast(start, (unsigned long) nr_pages);


typecast shouldn't be needed. But I'm wondering, it would save space in
the ring buffer if we used unsigend int instead of long. Will nr_pages
ever be bigger than 4 billion?


The "unsigned long" comes from get_user_pages() definition, I'm not 
quite sure why "unsigned long" is used. The fast version uses int (I 
guess unsigned int sounds better since it will not go negative).


"unsigned int" could cover 0x pages (almost 16TB), it sounds 
good enough in the most use case to me. In my test, just 1 page is 
passed to nr_pages in the most cases.


Thanks,
Yang



-- Steve


+
/*
 * Disable interrupts.  We use the nested form as we can already have
 * interrupts disabled by get_futex_key.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 3/7] x86: mm/gup: add gup trace points

2015-12-09 Thread Shi, Yang

On 12/9/2015 1:07 PM, Steven Rostedt wrote:

On Wed,  9 Dec 2015 09:29:20 -0800
Yang Shi  wrote:


Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Signed-off-by: Yang Shi 
---
  arch/x86/mm/gup.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index ae9a37b..a96bcb7 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -12,6 +12,9 @@

  #include 

+#define CREATE_TRACE_POINTS
+#include >


First off, does the above even compile?

Second, you already created the tracepoints in mm/gup.c, why are you
creating them here again? CREATE_TRACE_POINTS must be defined only once
per events/.h file.


Sorry for that. The typo was introduced by git amend without running 
test build. The multiple definition of trace points was not caught by my 
shaky test script.


Will fix them soon.

Thanks,
Yang



-- Steve


+
  static inline pte_t gup_get_pte(pte_t *ptep)
  {
  #ifndef CONFIG_X86_PAE
@@ -270,6 +273,8 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
(void __user *)start, len)))
return 0;

+   trace_gup_get_user_pages_fast(start, nr_pages);
+
/*
 * XXX: batch / limit 'nr', to avoid large irq off latency
 * needs some instrumenting to determine the common sizes used by
@@ -373,6 +378,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
} while (pgdp++, addr = next, addr != end);
local_irq_enable();

+   trace_gup_get_user_pages_fast(start, nr_pages);
+
VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
return nr;





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ppc64: select HAVE_CONTEXT_TRACKING by default

2016-01-04 Thread Shi, Yang

Hi folks,

Any comment on this is appreciated.

Thanks,
Yang


On 12/9/2015 2:43 PM, Yang Shi wrote:

The functionality of context tracking has been implemented by PPC64 and
HAVE_CONTEXT_TRACKING was selected by pseries by default.
Actually, it is applicale to all PPC64 platforms, so select it in PPC64
generic Kconfig.

NO_HZ_FULL depends on it, with this change NO_HZ_FULL could be enabled
for all PPC64 machines.

Signed-off-by: Yang Shi 
---
Following the instruction in Documentation/timers/NO_HZ.txt, I tested full nohz
on my FSL T2080 target, the below trace log shows it works well.

user_loop-574   [001] d..1   137.044892: tick_stop: success=yes msg=
user_loop-574   [001] d.h1   138.044880: hrtimer_expire_entry: 
hrtimer=c0007fd22d10 function=.tick_sched_timer now=137796002092
user_loop-574   [001] d.h1   139.044880: hrtimer_expire_entry: 
hrtimer=c0007fd22d10 function=.tick_sched_timer now=138796002129
user_loop-574   [001] d.h1   140.044880: hrtimer_expire_entry: 
hrtimer=c0007fd22d10 function=.tick_sched_timer now=139796002219
user_loop-574   [001] d.h1   141.044880: hrtimer_expire_entry: 
hrtimer=c0007fd22d10 function=.tick_sched_timer now=140796002229
user_loop-574   [001] d.h1   142.044879: hrtimer_expire_entry: 
hrtimer=c0007fd22d10 function=.tick_sched_timer now=141796002159
user_loop-574   [001] d.h1   143.044879: hrtimer_expire_entry: 
hrtimer=c0007fd22d10 function=.tick_sched_timer now=142796002142
user_loop-574   [001] d.h1   144.044878: hrtimer_expire_entry: 
hrtimer=c0007fd22d10 function=.tick_sched_timer now=143796002046
user_loop-574   [001] d.h1   145.044878: hrtimer_expire_entry: 
hrtimer=c0007fd22d10 function=.tick_sched_timer now=144796002189
user_loop-574   [001] d.h1   146.044878: hrtimer_expire_entry: 
hrtimer=c0007fd22d10 function=.tick_sched_timer now=145796002199
user_loop-574   [001] d.h1   147.044880: hrtimer_expire_entry: 
hrtimer=c0007fd22d10 function=.tick_sched_timer now=146796003836

It shows the timer tick just comes in every second.

It should be low risk to have problem on other PPC64 targets, although they are
not tested.

  arch/powerpc/platforms/Kconfig.cputype | 1 +
  arch/powerpc/platforms/pseries/Kconfig | 1 -
  2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 142dff5..7b25dd1 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -3,6 +3,7 @@ config PPC64
default n
select HAVE_VIRT_CPU_ACCOUNTING
select ZLIB_DEFLATE
+   select HAVE_CONTEXT_TRACKING
help
  This option selects whether a 32-bit or a 64-bit kernel
  will be built.
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index bec90fb..6853148 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -17,7 +17,6 @@ config PPC_PSERIES
select PPC_UDBG_16550
select PPC_NATIVE
select PPC_DOORBELL
-   select HAVE_CONTEXT_TRACKING
select HOTPLUG_CPU if SMP
select ARCH_RANDOM
select PPC_DOORBELL



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: ubsan: select ARCH_HAS_UBSAN_SANITIZE_ALL

2016-02-08 Thread Shi, Yang

On 2/8/2016 3:46 AM, Mark Rutland wrote:

Hi,

On Fri, Feb 05, 2016 at 03:50:18PM -0800, Yang Shi wrote:

To enable UBSAN on arm64, ARCH_HAS_UBSAN_SANITIZE_ALL need to be selected.

Basic kernel bootup test is passed on arm64 with CONFIG_UBSAN_SANITIZE_ALL
enabled.

Signed-off-by: Yang Shi 
---
  arch/arm64/Kconfig | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 8cc6228..1c29e20 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -14,6 +14,7 @@ config ARM64
select ARCH_WANT_OPTIONAL_GPIOLIB
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
select ARCH_WANT_FRAME_POINTERS
+   select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARM_AMBA
select ARM_ARCH_TIMER
select ARM_GIC


I gave this a go, and I got a couple of splats (included below) when
booting an Ubuntu 14.04 arm64 rootfs. I'm using Linato 15.08 GCC 5.1 to
compile an arm64 defconfig, and I see the issue with v4.5-rc2 and
v4.5-rc3.

I will dig into that and file a report shortl, unless someone has
already reported the same issue.

Did you see any failures in your testing? For reference, which kernel
version, compiler, and config were you using?


Yes, I did. I'm using 4.5-rc1 with gcc 5.2. And, I got one more splat 
and was digging into it.


I saw your report to ext4 maintainers. I tried to have a quick fix, but 
it sounds not work well. And, that code does look suspicious. Let's see 
what the ext4 maintainers say.


Thanks,
Yang



This patch itself looks good, so FWIW:

Tested-by: Mark Rutland 

Thanks,
Mark.

[3.804750] 

[3.813176] UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2612:15
[3.819431] shift exponent 4294967295 is too large for 32-bit type 'int'
[3.826121] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.5.0-rc2+ #48
[3.832463] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS 
ROD0085E 11/23/2015
[3.841060] Call trace:
[3.843499] [] dump_backtrace+0x0/0x298
[3.848887] [] show_stack+0x14/0x20
[3.853929] [] dump_stack+0xe0/0x178
[3.859056] [] ubsan_epilogue+0x14/0x50
[3.86] [] 
__ubsan_handle_shift_out_of_bounds+0xe0/0x138
[3.871655] [] ext4_mb_init+0x84c/0x920
[3.877043] [] ext4_fill_super+0x2eac/0x4958
[3.882866] [] mount_bdev+0x180/0x1e8
[3.888079] [] ext4_mount+0x14/0x20
[3.893118] [] mount_fs+0x44/0x1c8
[3.898073] [] vfs_kern_mount+0x50/0x1a8
[3.903547] [] do_mount+0x240/0x1478
[3.908673] [] SyS_mount+0x90/0xf8
[3.913627] [] mount_block_root+0x22c/0x3c4
[3.919361] [] mount_root+0x120/0x138
[3.924574] [] prepare_namespace+0x13c/0x184
[3.930396] [] kernel_init_freeable+0x390/0x3b4
[3.936479] [] kernel_init+0x10/0xe0
[3.941606] [] ret_from_fork+0x10/0x40
[3.946905] 


[5.566166] 

[5.574596] UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1274:11
[5.580851] shift exponent -1 is negative
[5.584851] CPU: 4 PID: 1028 Comm: mount Not tainted 4.5.0-rc2+ #48
[5.591105] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS 
ROD0085E 11/23/2015
[5.599702] Call trace:
[5.602142] [] dump_backtrace+0x0/0x298
[5.607530] [] show_stack+0x14/0x20
[5.612572] [] dump_stack+0xe0/0x178
[5.617700] [] ubsan_epilogue+0x14/0x50
[5.623088] [] 
__ubsan_handle_shift_out_of_bounds+0xe0/0x138
[5.630300] [] mb_find_order_for_block+0x154/0x1b0
[5.636641] [] mb_find_extent+0xcc/0x548
[5.642116] [] ext4_mb_complex_scan_group+0xe8/0x4e8
[5.648632] [] ext4_mb_regular_allocator+0x2d4/0x648
[5.655148] [] ext4_mb_new_blocks+0x344/0x7e0
[5.661056] [] ext4_ext_map_blocks+0x684/0xf68
[5.667052] [] ext4_map_blocks+0x12c/0x500
[5.672699] [] ext4_writepages+0x47c/0xe38
[5.678348] [] do_writepages+0x48/0xc8
[5.683649] [] __filemap_fdatawrite_range+0x70/0xe8
[5.690078] [] filemap_flush+0x18/0x20
[5.695378] [] ext4_alloc_da_blocks+0x3c/0x78
[5.701285] [] ext4_rename+0x690/0xe38
[5.706585] [] ext4_rename2+0x1c/0x40
[5.711800] [] vfs_rename+0x2c0/0xa90
[5.717013] [] SyS_renameat2+0x464/0x5c0
[5.722486] [] SyS_renameat+0x10/0x18
[5.727700] [] el0_svc_naked+0x24/0x28
[5.732998] 






Re: [PATCH] arm64: disable kasan when accessing frame->fp in unwind_frame

2016-02-08 Thread Shi, Yang

On 2/8/2016 12:51 AM, Andrey Ryabinin wrote:



On 02/06/2016 02:04 AM, Yang Shi wrote:


  #include 
  #include 
@@ -64,7 +65,9 @@ int notrace unwind_frame(struct task_struct *tsk, struct 
stackframe *frame)
return -EINVAL;

frame->sp = fp + 0x10;
+   kasan_disable_current();
frame->fp = *(unsigned long *)(fp);


It would be better to use READ_ONCE_NOCHECK() here.
See f7d27c35ddff7 ("x86/mm, kasan: Silence KASAN warnings in get_wchan()") 
which solves the same problem for x86.


Thanks for the suggestion, I'm going to try it soon.




+   kasan_enable_current();
frame->pc = *(unsigned long *)(fp + 8);


Why you left frame->pc out of scope? This line could trigger kasan as well.


Actually, it was not reported as frequently as the first one. In my 
first a couple of boot test, it was not triggered before I came up with 
the patch. The first one is triggered every time.


It will be fixed in v2 too.

Thanks,
Yang





  #ifdef CONFIG_FUNCTION_GRAPH_TRACER





Re: [RFC V5] Add gup trace points support

2016-02-08 Thread Shi, Yang

Hi Andrew,

This series already got acked from Steven and arch maintainers except 
for x86. How should I proceed? Any comment is appreciated.


Thanks,
Yang


On 1/14/2016 6:40 AM, Steven Rostedt wrote:


Andrew,

Do you want to pull in this series? You can add my Acked-by to the whole
set.

-- Steve


On Wed, 13 Jan 2016 10:14:24 -0800
"Shi, Yang"  wrote:


On 1/12/2016 12:10 PM, Steven Rostedt wrote:

On Tue, 12 Jan 2016 12:00:54 -0800
"Shi, Yang"  wrote:


Hi Steven,

Any more comments on this series? How should I proceed it?



The tracing part looks fine to me. Now you just need to get the arch
maintainers to ack each of the arch patches, and I can pull them in for
4.6. Too late for 4.5. Probably need Andrew Morton's ack for the
mm/gup.c patch.


Thanks Steven. Already sent email to x86, s390 and sparc maintainers.
Ralf already acked the MIPS part since v1.

Regards,
Yang



-- Steve







Re: [PATCH v2] arm64: disable kasan when accessing frame->fp in unwind_frame

2016-02-09 Thread Shi, Yang

On 2/9/2016 8:54 AM, Will Deacon wrote:

On Mon, Feb 08, 2016 at 09:13:09AM -0800, Yang Shi wrote:

When boot arm64 kernel with KASAN enabled, the below error is reported by
kasan:

BUG: KASAN: out-of-bounds in unwind_frame+0xec/0x260 at addr ffc064d57ba0
Read of size 8 by task pidof/499
page:ffbdc39355c0 count:0 mapcount:0 mapping:  (null) index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 2 PID: 499 Comm: pidof Not tainted 4.5.0-rc1 #119
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Call trace:
[] dump_backtrace+0x0/0x290
[] show_stack+0x24/0x30
[] dump_stack+0x8c/0xd8
[] kasan_report_error+0x558/0x588
[] kasan_report+0x60/0x70
[] __asan_load8+0x60/0x78
[] unwind_frame+0xec/0x260
[] get_wchan+0x110/0x160
[] do_task_stat+0xb44/0xb68
[] proc_tgid_stat+0x40/0x50
[] proc_single_show+0x88/0xd8
[] seq_read+0x370/0x770
[] __vfs_read+0xc8/0x1d8
[] vfs_read+0x94/0x168
[] SyS_read+0xb8/0x128
[] el0_svc_naked+0x24/0x28
Memory state around the buggy address:
  ffc064d57a80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 f4 f4
  ffc064d57b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

ffc064d57b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

   ^
  ffc064d57c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffc064d57c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Since the shadow byte pointed by the report is 0, so it may mean it is just hit
oob in non-current task. So, disable the instrumentation to silence these
warnings.


Curious, but how did you trigger this? I'm just trying to confirm that
mainline is affected, but my machine boots happily with KASAN and STACKTRACE
selected and I can cat /proc/self/{stack,stat} quite happily.

What am I missing?


I'm using mainline 4.5-rc1 kernel with gcc 5.2. And, my rootfs is NFS 
mounted.


Not sure if other kernel configs, i.e tracing stuff will have impact on 
the trigger since I'm not using the defconfig.


Thanks,
Yang



Will





Re: [PATCH v2] arm64: disable kasan when accessing frame->fp in unwind_frame

2016-02-09 Thread Shi, Yang

On 2/9/2016 9:23 AM, Will Deacon wrote:

On Tue, Feb 09, 2016 at 09:17:12AM -0800, Shi, Yang wrote:

On 2/9/2016 8:54 AM, Will Deacon wrote:

On Mon, Feb 08, 2016 at 09:13:09AM -0800, Yang Shi wrote:

When boot arm64 kernel with KASAN enabled, the below error is reported by
kasan:

BUG: KASAN: out-of-bounds in unwind_frame+0xec/0x260 at addr ffc064d57ba0
Read of size 8 by task pidof/499
page:ffbdc39355c0 count:0 mapcount:0 mapping:  (null) index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 2 PID: 499 Comm: pidof Not tainted 4.5.0-rc1 #119
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Call trace:
[] dump_backtrace+0x0/0x290
[] show_stack+0x24/0x30
[] dump_stack+0x8c/0xd8
[] kasan_report_error+0x558/0x588
[] kasan_report+0x60/0x70
[] __asan_load8+0x60/0x78
[] unwind_frame+0xec/0x260
[] get_wchan+0x110/0x160
[] do_task_stat+0xb44/0xb68
[] proc_tgid_stat+0x40/0x50
[] proc_single_show+0x88/0xd8
[] seq_read+0x370/0x770
[] __vfs_read+0xc8/0x1d8
[] vfs_read+0x94/0x168
[] SyS_read+0xb8/0x128
[] el0_svc_naked+0x24/0x28
Memory state around the buggy address:
  ffc064d57a80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 f4 f4
  ffc064d57b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

ffc064d57b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

   ^
  ffc064d57c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffc064d57c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Since the shadow byte pointed by the report is 0, so it may mean it is just hit
oob in non-current task. So, disable the instrumentation to silence these
warnings.


Curious, but how did you trigger this? I'm just trying to confirm that
mainline is affected, but my machine boots happily with KASAN and STACKTRACE
selected and I can cat /proc/self/{stack,stat} quite happily.

What am I missing?


I'm using mainline 4.5-rc1 kernel with gcc 5.2. And, my rootfs is NFS
mounted.

Not sure if other kernel configs, i.e tracing stuff will have impact on the
trigger since I'm not using the defconfig.


If you could put your .config somewhere, that would be helpful, please.


Attached it. BTW, I run the test on LS2085a RDB board which has 8 A57 
cores. Not sure if it could be reproduced on other boards easily.


Yang



Will



#
# Automatically generated file; DO NOT EDIT.
# Linux/arm64 4.5.0-rc1 Kernel Configuration
#
CONFIG_ARM64=y
CONFIG_64BIT=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=18
CONFIG_ARCH_MMAP_RND_BITS_MAX=24
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=11
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CSUM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ZONE_DMA=y
CONFIG_HAVE_GENERIC_RCU_GUP=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_SMP=y
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_KERNEL_MODE_NEON=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=3
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_CONSTRUCTORS=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
CONFIG_USELIB=y
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_SHOW_LEVEL=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_HANDLE_DOMAIN_IRQ=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_ARCH_HAS_TICK_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ_FULL_ALL=y
# CONFIG_NO_HZ_FULL_SYSIDLE is not set
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
# CONFIG_TASKSTATS is not set

#
# RCU Subsystem
#
CONFIG_PREEMPT_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
CONFIG_TASKS_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_CONTEXT_TRACKING=y
CONFIG_CONTEXT_TRACKING_FORCE=y
CONFIG_RCU_FANOUT=64
CONFI

Re: [PATCH v2] arm64: disable kasan when accessing frame->fp in unwind_frame

2016-02-09 Thread Shi, Yang

On 2/9/2016 9:31 AM, Shi, Yang wrote:

On 2/9/2016 9:23 AM, Will Deacon wrote:

On Tue, Feb 09, 2016 at 09:17:12AM -0800, Shi, Yang wrote:

On 2/9/2016 8:54 AM, Will Deacon wrote:

On Mon, Feb 08, 2016 at 09:13:09AM -0800, Yang Shi wrote:

When boot arm64 kernel with KASAN enabled, the below error is
reported by
kasan:

BUG: KASAN: out-of-bounds in unwind_frame+0xec/0x260 at addr
ffc064d57ba0
Read of size 8 by task pidof/499
page:ffbdc39355c0 count:0 mapcount:0 mapping:  (null)
index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 2 PID: 499 Comm: pidof Not tainted 4.5.0-rc1 #119
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Call trace:
[] dump_backtrace+0x0/0x290
[] show_stack+0x24/0x30
[] dump_stack+0x8c/0xd8
[] kasan_report_error+0x558/0x588
[] kasan_report+0x60/0x70
[] __asan_load8+0x60/0x78
[] unwind_frame+0xec/0x260
[] get_wchan+0x110/0x160
[] do_task_stat+0xb44/0xb68
[] proc_tgid_stat+0x40/0x50
[] proc_single_show+0x88/0xd8
[] seq_read+0x370/0x770
[] __vfs_read+0xc8/0x1d8
[] vfs_read+0x94/0x168
[] SyS_read+0xb8/0x128
[] el0_svc_naked+0x24/0x28
Memory state around the buggy address:
  ffc064d57a80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 f4 f4
  ffc064d57b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

ffc064d57b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

   ^
  ffc064d57c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffc064d57c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Since the shadow byte pointed by the report is 0, so it may mean it
is just hit
oob in non-current task. So, disable the instrumentation to silence
these
warnings.


Curious, but how did you trigger this? I'm just trying to confirm that
mainline is affected, but my machine boots happily with KASAN and
STACKTRACE
selected and I can cat /proc/self/{stack,stat} quite happily.

What am I missing?


I'm using mainline 4.5-rc1 kernel with gcc 5.2. And, my rootfs is NFS
mounted.

Not sure if other kernel configs, i.e tracing stuff will have impact
on the
trigger since I'm not using the defconfig.


If you could put your .config somewhere, that would be helpful, please.


Attached it. BTW, I run the test on LS2085a RDB board which has 8 A57
cores. Not sure if it could be reproduced on other boards easily.


This config has GCOV enabled, my test is run with it disabled. So, you 
may need disable it to replicate the trigger.


Regards,
Yang



Yang



Will







Very low responsiveness for pstore ftrace

2016-02-09 Thread Shi, Yang

Hi folks,

I'm newbie to pstore. I just did a quick try for pstore ftrace. But, I 
experience very serious responsiveness problem. When running "echo 1 > 
/sys/kernel/debug/pstore/record_ftrace", it took a couple of minutes to 
return (here return means to get prompt back) on my arm64 machine (8 A57 
cores). time command shows:


# time echo 1 > /sys/kernel/debug/pstore/record_ftrace
hrtimer: interrupt took 17532560 ns

real7m35.717s
user0m0.004s
sys 2m25.244s


I had a quick look at the code, it looks there is a long critical 
section with irq disabled.


If with CONFIG_PSTORE_CONSOLE enabled, the machine even can't response 
interrupts. The pstore console code also need acquire the same lock with 
irq disabled.


The low responsiveness sounds not reasonable to me.

Any idea?

Thanks,
Yang







Re: [PATCH] scsi: fdomain: add missing CONFIG_ to PCMCIA

2016-02-09 Thread Shi, Yang

On 2/9/2016 3:25 PM, One Thousand Gnomes wrote:

On Tue,  9 Feb 2016 14:16:49 -0800
Yang Shi  wrote:


There are some "#ifdef PCMCIA" in the driver code, but all of them missed
CONFIG_ prefix. Correct PCMCIA to CONFIG_PCMCIA.

Signed-off-by: Yang Shi 


NAK. This breaks the driver completely


Read drivers/scsi/pcmcia/fdomain*


Got it. Thanks for spotting out this. Sorry for the unthought change.

Yang



Alan





Re: [RFC PATCH] kernfs: create raw version kernfs_path_len and kernfs_path

2016-02-26 Thread Shi, Yang

On 2/26/2016 10:56 AM, Steven Rostedt wrote:

On Fri, 26 Feb 2016 10:15:05 -0800
Yang Shi  wrote:


commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints report cgroup
writeback, but it may trigger the below bug on -rt kernel since kernfs_path
and kernfs_path_len are called by tracepoints, which acquire sleeping lock.

BUG: sleeping function called from invalid context at 
kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[] wb_writeback+0xec/0x830

CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[] dump_backtrace+0x0/0x200
[] show_stack+0x24/0x30
[] dump_stack+0x88/0xa8
[] ___might_sleep+0x2ec/0x300
[] rt_spin_lock+0x38/0xb8
[] kernfs_path_len+0x30/0x90
[] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[] wb_writeback+0x620/0x830
[] wb_workfn+0x61c/0x950
[] process_one_work+0x3ac/0xb30
[] worker_thread+0x9c/0x7a8
[] kthread+0x190/0x1b0
[] ret_from_fork+0x10/0x30

Since kernfs_* functions are heavily used by cgroup, so it sounds not
reasonable to convert kernfs_rename_lock to raw lock.

Create raw version kernfs_path, kernfs_path_len and cgroup_path, which don't
acquire lock and are used by tracepoints only.



And what prevents name from being freed while the tracepoint is reading
it?

Perhaps we need this change as well:


Thanks for pointing this out. Yes, it looks we need this since the 
tracepoints are protected by rcu_read_lock_sched.


Will add this in V2.

Yang



-- Steve

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 996b7742c90b..d2ef153145c0 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1397,6 +1397,12 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct 
kernfs_node *new_parent,
kn->hash = kernfs_name_hash(kn->name, kn->ns);
kernfs_link_sibling(kn);

+   /*
+* Tracepoints may be reading the old name. They are protected
+* by rcu_read_lock_sched().
+*/
+   synchronize_sched();
+
kernfs_put(old_parent);
kfree_const(old_name);






Re: [RFC PATCH] kernfs: create raw version kernfs_path_len and kernfs_path

2016-02-26 Thread Shi, Yang

On 2/26/2016 11:37 AM, Shi, Yang wrote:

On 2/26/2016 10:56 AM, Steven Rostedt wrote:

On Fri, 26 Feb 2016 10:15:05 -0800
Yang Shi  wrote:


commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update
writeback
tracepoints to report cgroup") made writeback tracepoints report cgroup
writeback, but it may trigger the below bug on -rt kernel since
kernfs_path
and kernfs_path_len are called by tracepoints, which acquire sleeping
lock.

BUG: sleeping function called from invalid context at
kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[] wb_writeback+0xec/0x830

CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[] dump_backtrace+0x0/0x200
[] show_stack+0x24/0x30
[] dump_stack+0x88/0xa8
[] ___might_sleep+0x2ec/0x300
[] rt_spin_lock+0x38/0xb8
[] kernfs_path_len+0x30/0x90
[]
trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[] wb_writeback+0x620/0x830
[] wb_workfn+0x61c/0x950
[] process_one_work+0x3ac/0xb30
[] worker_thread+0x9c/0x7a8
[] kthread+0x190/0x1b0
[] ret_from_fork+0x10/0x30

Since kernfs_* functions are heavily used by cgroup, so it sounds not
reasonable to convert kernfs_rename_lock to raw lock.

Create raw version kernfs_path, kernfs_path_len and cgroup_path,
which don't
acquire lock and are used by tracepoints only.



And what prevents name from being freed while the tracepoint is reading
it?

Perhaps we need this change as well:


Thanks for pointing this out. Yes, it looks we need this since the
tracepoints are protected by rcu_read_lock_sched.

Will add this in V2.


BTW, it sounds this is not the only point where kernfs_node could be 
updated, __kernfs_remove should need synchronize_sched too.


Thanks,
Yang



Yang



-- Steve

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 996b7742c90b..d2ef153145c0 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1397,6 +1397,12 @@ int kernfs_rename_ns(struct kernfs_node *kn,
struct kernfs_node *new_parent,
  kn->hash = kernfs_name_hash(kn->name, kn->ns);
  kernfs_link_sibling(kn);

+/*
+ * Tracepoints may be reading the old name. They are protected
+ * by rcu_read_lock_sched().
+ */
+synchronize_sched();
+
  kernfs_put(old_parent);
  kfree_const(old_name);








Re: [RFC PATCH] kernfs: create raw version kernfs_path_len and kernfs_path

2016-02-26 Thread Shi, Yang

On 2/26/2016 11:59 AM, Shi, Yang wrote:

On 2/26/2016 11:37 AM, Shi, Yang wrote:

On 2/26/2016 10:56 AM, Steven Rostedt wrote:

On Fri, 26 Feb 2016 10:15:05 -0800
Yang Shi  wrote:


commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update
writeback
tracepoints to report cgroup") made writeback tracepoints report cgroup
writeback, but it may trigger the below bug on -rt kernel since
kernfs_path
and kernfs_path_len are called by tracepoints, which acquire sleeping
lock.

BUG: sleeping function called from invalid context at
kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[] wb_writeback+0xec/0x830

CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[] dump_backtrace+0x0/0x200
[] show_stack+0x24/0x30
[] dump_stack+0x88/0xa8
[] ___might_sleep+0x2ec/0x300
[] rt_spin_lock+0x38/0xb8
[] kernfs_path_len+0x30/0x90
[]
trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[] wb_writeback+0x620/0x830
[] wb_workfn+0x61c/0x950
[] process_one_work+0x3ac/0xb30
[] worker_thread+0x9c/0x7a8
[] kthread+0x190/0x1b0
[] ret_from_fork+0x10/0x30

Since kernfs_* functions are heavily used by cgroup, so it sounds not
reasonable to convert kernfs_rename_lock to raw lock.

Create raw version kernfs_path, kernfs_path_len and cgroup_path,
which don't
acquire lock and are used by tracepoints only.



And what prevents name from being freed while the tracepoint is reading
it?

Perhaps we need this change as well:


Thanks for pointing this out. Yes, it looks we need this since the
tracepoints are protected by rcu_read_lock_sched.

Will add this in V2.


BTW, it sounds this is not the only point where kernfs_node could be
updated, __kernfs_remove should need synchronize_sched too.


Should be synchronize_sched moved into kernfs_put itself, just before 
the kfree call since it is called at some places other than 
kernfs_remove and kernfs_rename_ns.


diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index a4b1db1..89c0c16 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -456,6 +456,12 @@ void kernfs_put(struct kernfs_node *kn)
if (kernfs_type(kn) == KERNFS_LINK)
kernfs_put(kn->symlink.target_kn);

+   /*
+* Tracepoints may be reading the old name. They are protected
+* by rcu_read_lock_sched().
+*/
+   synchronize_sched();
+
kfree_const(kn->name);

if (kn->iattr) {


But, kernfs_put might be called recursively, is it fine?

Thanks,
Yang



Thanks,
Yang



Yang



-- Steve

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 996b7742c90b..d2ef153145c0 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1397,6 +1397,12 @@ int kernfs_rename_ns(struct kernfs_node *kn,
struct kernfs_node *new_parent,
  kn->hash = kernfs_name_hash(kn->name, kn->ns);
  kernfs_link_sibling(kn);

+/*
+ * Tracepoints may be reading the old name. They are protected
+ * by rcu_read_lock_sched().
+ */
+synchronize_sched();
+
  kernfs_put(old_parent);
  kfree_const(old_name);










Re: [RFC PATCH] kernfs: create raw version kernfs_path_len and kernfs_path

2016-02-26 Thread Shi, Yang

On 2/26/2016 12:15 PM, Steven Rostedt wrote:

On Fri, 26 Feb 2016 11:59:40 -0800
"Shi, Yang"  wrote:



BTW, it sounds this is not the only point where kernfs_node could be
updated, __kernfs_remove should need synchronize_sched too.



Question is, can the kernfs of a cgroup be removed while the cgroup is
still active?


Hmm, it sounds unlikely to me.

Yang




I don't see the kernfs_rename_lock being held anywhere in that
remove. If it can be an issue there with this patch, then it's an issue
today, because the kernfs_mutex is not held by the tracepoint.

-- Steve





Re: [RFC V2 PATCH] kernfs: create raw version kernfs_path_len and kernfs_path

2016-02-26 Thread Shi, Yang

On 2/26/2016 3:01 PM, Greg KH wrote:

On Fri, Feb 26, 2016 at 01:47:39PM -0800, Yang Shi wrote:

commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints report cgroup
writeback, but it may trigger the below bug on -rt kernel since kernfs_path
and kernfs_path_len are called by tracepoints, which acquire sleeping lock.

BUG: sleeping function called from invalid context at 
kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[] wb_writeback+0xec/0x830

CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[] dump_backtrace+0x0/0x200
[] show_stack+0x24/0x30
[] dump_stack+0x88/0xa8
[] ___might_sleep+0x2ec/0x300
[] rt_spin_lock+0x38/0xb8
[] kernfs_path_len+0x30/0x90
[] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[] wb_writeback+0x620/0x830
[] wb_workfn+0x61c/0x950
[] process_one_work+0x3ac/0xb30
[] worker_thread+0x9c/0x7a8
[] kthread+0x190/0x1b0
[] ret_from_fork+0x10/0x30

Since kernfs_* functions are heavily used by cgroup, so it sounds not
reasonable to convert kernfs_rename_lock to raw lock.

Call synchronize_sched() when kernfs_node is updated since tracepoints are
protected by rcu_read_lock_sched.

Create raw version kernfs_path, kernfs_path_len and cgroup_path, which don't
acquire lock and are used by tracepoints only.

Signed-off-by: Yang Shi 
---
It should be applicable to both mainline and -rt kernel.
The change survives ftrace stress test in ltp.

v2:
   * Adopted Steven's suggestion to call synchronize_sched in kernfs_rename_ns.

  fs/kernfs/dir.c  | 30 +++---
  include/linux/cgroup.h   | 10 ++
  include/linux/kernfs.h   | 10 ++
  include/trace/events/writeback.h |  4 ++--
  4 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 996b774..70a9687 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,7 +44,7 @@ static int kernfs_name_locked(struct kernfs_node *kn, char 
*buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
  }

-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char 
*buf,
+static char * __must_check __kernfs_path(struct kernfs_node *kn, char *buf,
  size_t buflen)
  {
char *p = buf + buflen;
@@ -131,12 +131,33 @@ char *kernfs_path(struct kernfs_node *kn, char *buf, 
size_t buflen)
char *p;

spin_lock_irqsave(&kernfs_rename_lock, flags);
-   p = kernfs_path_locked(kn, buf, buflen);
+   p = __kernfs_path(kn, buf, buflen);
spin_unlock_irqrestore(&kernfs_rename_lock, flags);
return p;
  }
  EXPORT_SYMBOL_GPL(kernfs_path);

+/* Raw version. Used by tracepoints only without acquiring lock */
+size_t _kernfs_path_len(struct kernfs_node *kn)
+{
+   size_t len = 0;
+
+   do {
+   len += strlen(kn->name) + 1;
+   kn = kn->parent;
+   } while (kn && kn->parent);
+
+   return len;
+}
+
+char *_kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+   char *p;
+
+   p = __kernfs_path(kn, buf, buflen);
+   return p;
+}
+
  /**
   * pr_cont_kernfs_name - pr_cont name of a kernfs_node
   * @kn: kernfs_node of interest
@@ -168,7 +189,7 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)

spin_lock_irqsave(&kernfs_rename_lock, flags);

-   p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
+   p = __kernfs_path(kn, kernfs_pr_cont_buf,
   sizeof(kernfs_pr_cont_buf));
if (p)
pr_cont("%s", p);
@@ -1397,6 +1418,9 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct 
kernfs_node *new_parent,
kn->hash = kernfs_name_hash(kn->name, kn->ns);
kernfs_link_sibling(kn);

+   /* Tracepoints are protected by rcu_read_lock_sched */
+   synchronize_sched();
+
kernfs_put(old_parent);
kfree_const(old_name);

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2162dca..bbd88d3 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -538,6 +538,16 @@ static inline char * __must_check cgroup_path(struct 
cgroup *cgrp, char *buf,
return kernfs_path(cgrp->kn, buf, buflen);
  }

+/*
+ * Without acquiring kernfs_rename_lock in _kernfs_path.
+ * Used by tracepoints only.
+ */
+static inline char * __must_check _cgroup_path(struct cgroup *cgrp, char *buf,
+ size_t buflen)
+{
+   return _kernfs_path(cgrp->kn, buf, buflen);
+}
+
  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
  {
pr_cont_kernfs_name(cgrp->kn);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index af51df3..14c01cdc 1006

Re: [RFC PATCH] writeback: move list_lock down into the for loop

2016-02-29 Thread Shi, Yang

On 2/29/2016 7:06 AM, Michal Hocko wrote:

On Fri 26-02-16 08:46:25, Yang Shi wrote:

The list_lock was moved outside the for loop by commit
e8dfc30582995ae12454cda517b17d6294175b07 ("writeback: elevate queue_io()
into wb_writeback())", however, the commit log says "No behavior change", so
it sounds safe to have the list_lock acquired inside the for loop as it did
before.
Leave tracepoints outside the critical area since tracepoints already have
preempt disabled.


The patch says what but it completely misses the why part.


I'm just wondering the finer grained lock may reach a little better 
performance, i.e. more likely for preempt, lower latency.


Thanks,
Yang





Signed-off-by: Yang Shi 
---
Tested with ltp on 8 cores Cortex-A57 machine.

  fs/fs-writeback.c | 12 +++-
  1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1f76d89..9b7b5f6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1623,7 +1623,6 @@ static long wb_writeback(struct bdi_writeback *wb,
work->older_than_this = &oldest_jif;

blk_start_plug(&plug);
-   spin_lock(&wb->list_lock);
for (;;) {
/*
 * Stop writeback when nr_pages has been consumed
@@ -1661,15 +1660,19 @@ static long wb_writeback(struct bdi_writeback *wb,
oldest_jif = jiffies;

trace_writeback_start(wb, work);
+
+   spin_lock(&wb->list_lock);
if (list_empty(&wb->b_io))
queue_io(wb, work);
if (work->sb)
progress = writeback_sb_inodes(work->sb, wb, work);
else
progress = __writeback_inodes_wb(wb, work);
-   trace_writeback_written(wb, work);

wb_update_bandwidth(wb, wb_start);
+   spin_unlock(&wb->list_lock);
+
+   trace_writeback_written(wb, work);

/*
 * Did we write something? Try for more
@@ -1693,15 +1696,14 @@ static long wb_writeback(struct bdi_writeback *wb,
 */
if (!list_empty(&wb->b_more_io))  {
trace_writeback_wait(wb, work);
+   spin_lock(&wb->list_lock);
inode = wb_inode(wb->b_more_io.prev);
-   spin_lock(&inode->i_lock);
spin_unlock(&wb->list_lock);
+   spin_lock(&inode->i_lock);
/* This function drops i_lock... */
inode_sleep_on_writeback(inode);
-   spin_lock(&wb->list_lock);
}
}
-   spin_unlock(&wb->list_lock);
blk_finish_plug(&plug);

return nr_pages - work->nr_pages;
--
2.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 






Re: [RFC V2 PATCH] kernfs: create raw version kernfs_path_len and kernfs_path

2016-02-29 Thread Shi, Yang

On 2/27/2016 3:51 AM, Tejun Heo wrote:

Hello,

On Sat, Feb 27, 2016 at 6:45 AM, Thomas Gleixner  wrote:

It can be, but we can print out the ino and userland can match that up
with path if necessary.


Wouldn't be cgroup id the better choice?


AFAIK we aren't exposing cgroup id to userland anywhere right now.
Eventually, I think the right thing to do is using the same number for
both.


Thanks for all the comments.

Since the current tracepoints print the path length too by 
__trace_wb_cgroup_size and __trace_wbc_cgroup_size, but we can't get the 
path length if we switch to group ino. So, I'm supposed I have to drop 
all the *_size stuff.


And, when CONFIG_CGROUP_WRITEBACK is not enabled, 
__trace_wb_assign_cgroup and __trace_wbc_assign_cgroup return 
"strcpy(buf, "/")", so to get aligned with this, I need print out the 
ino of "/", right? But, the ROOT_INO may vary from different filesystems.


All of them will be addressed in V4, but it may experience some delay 
since I have to travel this week.


Regards,
Yang



Thanks.





Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()

2016-02-25 Thread Shi, Yang

On 2/24/2016 6:40 PM, Steven Rostedt wrote:

On Wed, 24 Feb 2016 14:47:23 -0800
Yang Shi  wrote:


commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints report cgroup
writeback, but it may trigger the below bug on -rt kernel due to the list_lock
held for the for loop in wb_writeback().


list_lock is a sleeping mutex, it's not disabling preemption. Moving it
doesn't make a difference.



BUG: sleeping function called from invalid context at 
kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3


Something else disabled preemption. And note, nothing in the tracepoint
should have called a sleeping function.


Yes, it makes me confused too. It sounds like the preempt_ip address is 
not that accurate.






INFO: lockdep is turned off.
Preemption disabled at:[] wb_writeback+0xec/0x830

CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[] dump_backtrace+0x0/0x200
[] show_stack+0x24/0x30
[] dump_stack+0x88/0xa8
[] ___might_sleep+0x2ec/0x300
[] rt_spin_lock+0x38/0xb8
[] kernfs_path_len+0x30/0x90
[] trace_event_raw_event_writeback_work_class+0xe8/0x2e8


How accurate is this trace back? Here's the code that is executed in
this tracepoint:

TP_fast_assign(
struct device *dev = bdi->dev;
if (!dev)
dev = default_backing_dev_info.dev;
strncpy(__entry->name, dev_name(dev), 32);
__entry->nr_pages = work->nr_pages;
__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
__entry->sync_mode = work->sync_mode;
__entry->for_kupdate = work->for_kupdate;
__entry->range_cyclic = work->range_cyclic;
__entry->for_background  = work->for_background;
__entry->reason = work->reason;
),

See anything that would sleep?


According to the stack backtrace, kernfs_path_len calls slepping lock, 
which is called by __trace_wb_cgroup_size(wb) in __dynamic_array(char, 
cgroup, __trace_wb_cgroup_size(wb)).


The below is the definition:

DECLARE_EVENT_CLASS(writeback_work_class,
TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
TP_ARGS(wb, work),
TP_STRUCT__entry(
__array(char, name, 32)
__field(long, nr_pages)
__field(dev_t, sb_dev)
__field(int, sync_mode)
__field(int, for_kupdate)
__field(int, range_cyclic)
__field(int, for_background)
__field(int, reason)
__dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))

Thanks,
Yang




[] wb_writeback+0x620/0x830
[] wb_workfn+0x61c/0x950
[] process_one_work+0x3ac/0xb30
[] worker_thread+0x9c/0x7a8
[] kthread+0x190/0x1b0
[] ret_from_fork+0x10/0x30

The list_lock was moved outside the for loop by commit
e8dfc30582995ae12454cda517b17d6294175b07 ("writeback: elevate queue_io()
into wb_writeback())", however, the commit log says "No behavior change", so
it sounds safe to have the list_lock acquired inside the for loop as it did
before.

Just acquire list_lock at the necessary points and keep all writeback
tracepoints outside the critical area protected by list_lock in
wb_writeback().


But list_lock itself is a sleeping lock. This doesn't make sense.

This is not the bug you are looking for.

-- Steve



Signed-off-by: Yang Shi 
---
  fs/fs-writeback.c | 12 +++-
  1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1f76d89..9b7b5f6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1623,7 +1623,6 @@ static long wb_writeback(struct bdi_writeback *wb,
work->older_than_this = &oldest_jif;

blk_start_plug(&plug);
-   spin_lock(&wb->list_lock);
for (;;) {
/*
 * Stop writeback when nr_pages has been consumed
@@ -1661,15 +1660,19 @@ static long wb_writeback(struct bdi_writeback *wb,
oldest_jif = jiffies;

trace_writeback_start(wb, work);
+
+   spin_lock(&wb->list_lock);
if (list_empty(&wb->b_io))
queue_io(wb, work);
if (work->sb)
progress = writeback_sb_inodes(work->sb, wb, work);
else
progress = __writeback_inodes_wb(wb, work);
-   trace_writeback_written(wb, work);

wb_update_bandwidth(wb, wb_start);
+   spin_unlock(&wb->list_lock);
+
+   trace_writeback_written(wb, work);

/*
 * Did we write something? Try for more
@@ -1693,15 +1696,14 @@ static long wb_writeback(struct bdi_writeback *wb,
 

Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()

2016-02-25 Thread Shi, Yang

On 2/25/2016 11:54 AM, Steven Rostedt wrote:

On Thu, 25 Feb 2016 11:38:48 -0800
"Shi, Yang"  wrote:


On 2/24/2016 6:40 PM, Steven Rostedt wrote:

On Wed, 24 Feb 2016 14:47:23 -0800
Yang Shi  wrote:


commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints report cgroup
writeback, but it may trigger the below bug on -rt kernel due to the list_lock
held for the for loop in wb_writeback().


list_lock is a sleeping mutex, it's not disabling preemption. Moving it
doesn't make a difference.



BUG: sleeping function called from invalid context at 
kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3


Something else disabled preemption. And note, nothing in the tracepoint
should have called a sleeping function.


Yes, it makes me confused too. It sounds like the preempt_ip address is
not that accurate.


Yep, but the change you made doesn't look to be the fix.


Actually, regardless whether this is the right fix for the splat, it 
makes me be wondering if the spin lock which protects the whole for loop 
is really necessary. It sounds feasible to move it into the for loop and 
just protect the necessary area.










INFO: lockdep is turned off.
Preemption disabled at:[] wb_writeback+0xec/0x830


Can you disassemble the vmlinux file to see exactly where that call is.
I use gdb to find the right locations.

  gdb> li *0xffc000374a5c
  gdb> disass 0xffc000374a5c


I use gdb to get the code too.

It does point to the spin_lock.

(gdb) list *0xffc000374a5c
0xffc000374a5c is in wb_writeback (fs/fs-writeback.c:1621).
1616
1617oldest_jif = jiffies;
1618work->older_than_this = &oldest_jif;
1619
1620blk_start_plug(&plug);
1621spin_lock(&wb->list_lock);
1622for (;;) {
1623/*
1624 * Stop writeback when nr_pages has been consumed
1625 */


The disassemble:
   0xffc000374a58 <+232>:   bl  0xffc0001300b0 


   0xffc000374a5c <+236>:   mov x0, x22
   0xffc000374a60 <+240>:   bl  0xffc000d5d518 





CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[] dump_backtrace+0x0/0x200
[] show_stack+0x24/0x30
[] dump_stack+0x88/0xa8
[] ___might_sleep+0x2ec/0x300
[] rt_spin_lock+0x38/0xb8
[] kernfs_path_len+0x30/0x90
[] trace_event_raw_event_writeback_work_class+0xe8/0x2e8


How accurate is this trace back? Here's the code that is executed in
this tracepoint:

TP_fast_assign(
struct device *dev = bdi->dev;
if (!dev)
dev = default_backing_dev_info.dev;
strncpy(__entry->name, dev_name(dev), 32);
__entry->nr_pages = work->nr_pages;
__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
__entry->sync_mode = work->sync_mode;
__entry->for_kupdate = work->for_kupdate;
__entry->range_cyclic = work->range_cyclic;
__entry->for_background  = work->for_background;
__entry->reason = work->reason;
),

See anything that would sleep?


According to the stack backtrace, kernfs_path_len calls slepping lock,
which is called by __trace_wb_cgroup_size(wb) in __dynamic_array(char,
cgroup, __trace_wb_cgroup_size(wb)).

The below is the definition:

DECLARE_EVENT_CLASS(writeback_work_class,
  TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
  TP_ARGS(wb, work),
  TP_STRUCT__entry(
  __array(char, name, 32)
  __field(long, nr_pages)
  __field(dev_t, sb_dev)
  __field(int, sync_mode)
  __field(int, for_kupdate)
  __field(int, range_cyclic)
  __field(int, for_background)
  __field(int, reason)
  __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))



Ah, thanks for pointing that out. I missed that.


It sounds not correct if tracepoint doesn't allow sleep.

I considered to change sleeping lock to raw lock in kernfs_* functions, 
but it sounds not reasonable since they are used heavily by cgroup.


Thanks,
Yang



-- Steve





Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()

2016-02-25 Thread Shi, Yang

On 2/25/2016 3:31 PM, Steven Rostedt wrote:

On Thu, 25 Feb 2016 15:16:54 -0800
"Shi, Yang"  wrote:



Actually, regardless whether this is the right fix for the splat, it
makes me be wondering if the spin lock which protects the whole for loop
is really necessary. It sounds feasible to move it into the for loop and
just protect the necessary area.


That's a separate issue, which may have its own merits that should be
decided by the writeback folks.


Yes, definitely. I will rework my commit log for this part.













INFO: lockdep is turned off.
Preemption disabled at:[] wb_writeback+0xec/0x830


Can you disassemble the vmlinux file to see exactly where that call is.
I use gdb to find the right locations.

   gdb> li *0xffc000374a5c
   gdb> disass 0xffc000374a5c


I use gdb to get the code too.

It does point to the spin_lock.

(gdb) list *0xffc000374a5c
0xffc000374a5c is in wb_writeback (fs/fs-writeback.c:1621).
1616
1617oldest_jif = jiffies;
1618work->older_than_this = &oldest_jif;
1619
1620blk_start_plug(&plug);
1621spin_lock(&wb->list_lock);
1622for (;;) {
1623/*
1624 * Stop writeback when nr_pages has been consumed
1625 */


The disassemble:
 0xffc000374a58 <+232>:   bl  0xffc0001300b0


The above is the place it recorded. But I just realized, this isn't the
issue. I know where the problem is.




 0xffc000374a5c <+236>:   mov x0, x22
 0xffc000374a60 <+240>:   bl  0xffc000d5d518 








DECLARE_EVENT_CLASS(writeback_work_class,
   TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
   TP_ARGS(wb, work),
   TP_STRUCT__entry(
   __array(char, name, 32)
   __field(long, nr_pages)
   __field(dev_t, sb_dev)
   __field(int, sync_mode)
   __field(int, for_kupdate)
   __field(int, range_cyclic)
   __field(int, for_background)
   __field(int, reason)
   __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))



Ah, thanks for pointing that out. I missed that.


It sounds not correct if tracepoint doesn't allow sleep.

I considered to change sleeping lock to raw lock in kernfs_* functions,
but it sounds not reasonable since they are used heavily by cgroup.


It is the kernfs_* that can't sleep. Tracepoints use
rcu_read_lock_sched_notrace(), which disables preemption, and not only
that, hides itself from lockdep as the last place to disable preemption.


Ah, thanks for pointing out this.



Is there a way to not use the kernfs_* function? At least for -rt?


I'm not quite sure if there is straightforward replacement. However, I'm 
wondering if lock free version could be used by tracing.


For example, create __kernfs_path_len which doesn't acquire any lock for 
writeback tracing as long as there is not any race condition.


At least we could rule out preemption.

Thanks,
Yang



-- Steve





Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()

2016-02-25 Thread Shi, Yang

On 2/25/2016 3:47 PM, Shi, Yang wrote:

On 2/25/2016 3:31 PM, Steven Rostedt wrote:

On Thu, 25 Feb 2016 15:16:54 -0800
"Shi, Yang"  wrote:



Actually, regardless whether this is the right fix for the splat, it
makes me be wondering if the spin lock which protects the whole for loop
is really necessary. It sounds feasible to move it into the for loop and
just protect the necessary area.


That's a separate issue, which may have its own merits that should be
decided by the writeback folks.


Yes, definitely. I will rework my commit log for this part.













INFO: lockdep is turned off.
Preemption disabled at:[] wb_writeback+0xec/0x830


Can you disassemble the vmlinux file to see exactly where that call is.
I use gdb to find the right locations.

   gdb> li *0xffc000374a5c
   gdb> disass 0xffc000374a5c


I use gdb to get the code too.

It does point to the spin_lock.

(gdb) list *0xffc000374a5c
0xffc000374a5c is in wb_writeback (fs/fs-writeback.c:1621).
1616
1617oldest_jif = jiffies;
1618work->older_than_this = &oldest_jif;
1619
1620blk_start_plug(&plug);
1621spin_lock(&wb->list_lock);
1622for (;;) {
1623/*
1624 * Stop writeback when nr_pages has been
consumed
1625 */


The disassemble:
 0xffc000374a58 <+232>:   bl  0xffc0001300b0


The above is the place it recorded. But I just realized, this isn't the
issue. I know where the problem is.




 0xffc000374a5c <+236>:   mov x0, x22
 0xffc000374a60 <+240>:   bl  0xffc000d5d518









DECLARE_EVENT_CLASS(writeback_work_class,
   TP_PROTO(struct bdi_writeback *wb, struct
wb_writeback_work *work),
   TP_ARGS(wb, work),
   TP_STRUCT__entry(
   __array(char, name, 32)
   __field(long, nr_pages)
   __field(dev_t, sb_dev)
   __field(int, sync_mode)
   __field(int, for_kupdate)
   __field(int, range_cyclic)
   __field(int, for_background)
   __field(int, reason)
   __dynamic_array(char, cgroup,
__trace_wb_cgroup_size(wb))



Ah, thanks for pointing that out. I missed that.


It sounds not correct if tracepoint doesn't allow sleep.

I considered to change sleeping lock to raw lock in kernfs_* functions,
but it sounds not reasonable since they are used heavily by cgroup.


It is the kernfs_* that can't sleep. Tracepoints use
rcu_read_lock_sched_notrace(), which disables preemption, and not only
that, hides itself from lockdep as the last place to disable preemption.


Ah, thanks for pointing out this.



Is there a way to not use the kernfs_* function? At least for -rt?


I'm not quite sure if there is straightforward replacement. However, I'm
wondering if lock free version could be used by tracing.

For example, create __kernfs_path_len which doesn't acquire any lock for
writeback tracing as long as there is not any race condition.

At least we could rule out preemption.


Can we disable irqs in tracepoints since spin_lock_irqsave is used by 
kernfs_* functions.


Thanks,
Yang



Thanks,
Yang



-- Steve







Re: [BUG linux-next] Kernel panic found with linux-next-20160414

2016-04-20 Thread Shi, Yang

On 4/20/2016 1:01 AM, Hugh Dickins wrote:

On Tue, 19 Apr 2016, Shi, Yang wrote:

Hi folks,

When I ran ltp on linux-next-20160414 on my ARM64 machine, I got the below
kernel panic:

Unable to handle kernel paging request at virtual address ffc007846000
pgd = ffc01e21d000
[ffc007846000] *pgd=, *pud=
Internal error: Oops: 9647 [#11] PREEMPT SMP
Modules linked in: loop
CPU: 7 PID: 274 Comm: systemd-journal Tainted: G  D
4.6.0-rc3-next-20160414-WR8.0.0.0_standard+ #9
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
task: ffc01e3fcf80 ti: ffc01ea8c000 task.ti: ffc01ea8c000
PC is at copy_page+0x38/0x120
LR is at migrate_page_copy+0x604/0x1660
pc : [] lr : [] pstate: 2145
sp : ffc01ea8ecd0
x29: ffc01ea8ecd0 x28: 
x27: 1ff7b80240f8 x26: ffc018196f20
x25: ffbdc01e1180 x24: ffbdc01e1180
x23:  x22: ffc01e3fcf80
x21: ffc00481f000 x20: ff900a31d000
x19: ffbdc01207c0 x18: 0f00
x17:  x16: 
x15:  x14: 
x13:  x12: 
x11:  x10: 
x9 :  x8 : 
x7 :  x6 : 
x5 :  x4 : 
x3 :  x2 : 
x1 : ffc00481f080 x0 : ffc007846000

Call trace:
Exception stack(0xffc021fc2ed0 to 0xffc021fc2ff0)
2ec0:   ffbdc00887c0 ff900a31d000
2ee0: ffc021fc30f0 ff9008ff2318 2145 0025
2f00: ffbdc025a280 ffc020adc4c0 41b58ab3 ff900a085fd0
2f20: ff9008200658   ffbdc00887c0
2f40: ff900b0f1320 ffc021fc3078 41b58ab3 ff900a0864f8
2f60: ff9008210010 ffc021fb8960 ff900867bacc 1ff8043f712d
2f80: ffc021fc2fb0 ff9008210564 ffc021fc3070 ffc021fb8940
2fa0: 08221f78 ff900862f9c8 ffc021fc2fe0 ff9008215dc8
2fc0: 1ff8043f8602 ffc021fc ffc00968a000 ffc00221f080
2fe0: f9407e11d1f0 d61f02209103e210
[] copy_page+0x38/0x120
[] migrate_page+0x74/0x98
[] nfs_migrate_page+0x58/0x80
[] move_to_new_page+0x15c/0x4d8
[] migrate_pages+0x7c8/0x11f0
[] compact_zone+0xdfc/0x2570
[] compact_zone_order+0xe0/0x170
[] try_to_compact_pages+0x2e8/0x8f8
[] __alloc_pages_direct_compact+0x100/0x540
[] __alloc_pages_nodemask+0xc40/0x1c58
[] khugepaged+0x468/0x19c8
[] kthread+0x248/0x2c0
[] ret_from_fork+0x10/0x40
Code: d281f012 91020021 f1020252 d503201f (a8000c02)


I did some initial investigation and found it is caused by DEBUG_PAGEALLOC
and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT. And, mainline 4.6-rc3 works well.

It should be not arch specific although I got it caught on ARM64. I suspect
this might be caused by Hugh's huge tmpfs patches.


Thanks for testing.  It might be caused by my patches, but I don't think
that's very likely.  This is page migraton for compaction, in the service
of anon THP's khugepaged; and I wonder if you were even exercising huge
tmpfs when running LTP here (it certainly can be done: I like to mount a
huge tmpfs on /opt/ltp and install there, with shmem_huge 2 so any other
tmpfs mounts are also huge).


Some further investigation shows I got the panic even though I don't 
have tmpfs mounted with huge=1 or set shmem_huge to 2.




There are compaction changes in linux-next too, but I don't see any
reason why they'd cause this.  I don't know arm64 traces enough to know
whether it's the source page or the destination page for the copy, but
it looks as if it has been freed (and DEBUG_PAGEALLOC unmapped) before
reaching migration's copy.


The fault address is passed by x0, which is dest in the implementation 
of copy_page, so it is the destination page.




Needs more debugging, I'm afraid: is it reproducible?


Yes, as long as I enable those two PAGEALLOC debug options, I can get 
the panic once I run ltp. But, it is not caused any specific ltp test 
case directly, the panic happens randomly during ltp is running.


Thanks,
Yang



Hugh





[BUG linux-next] KASAN bug is raised on linux-next-20160414 with huge tmpfs on

2016-04-20 Thread Shi, Yang

Hi folks,

When I run the below test on my ARM64 machine with NFS mounted rootfs, I 
got KASAN bug report. The test runs well if mnt is not mounted with 
"huge=1".


# mount -t tmpfs -o huge=1 tmpfs /mnt
# cp -a /opt/ltp /mnt/

BUG: KASAN: use-after-free in nfs_readdir+0x2c4/0x848 at addr 
8b7f4000

Read of size 4 by task crond/446
page:7bffc02dfd00 count:2 mapcount:0 mapping:80001c2cae98 index:0x0
flags: 0x6c(referenced|uptodate|lru|active)
page dumped because: kasan: bad access detected
page->mem_cgroup:80002402da80
CPU: 0 PID: 446 Comm: crond Tainted: GW 
4.6.0-rc3-next-20160414-WR8.0.0.0_standard+ #13

Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Call trace:
[] dump_backtrace+0x0/0x2b8
[] show_stack+0x24/0x30
[] dump_stack+0xb0/0xe8
[] kasan_report_error+0x518/0x5c0
[] kasan_report+0x60/0x70
[] __asan_load4+0x64/0x80
[] nfs_readdir+0x2c4/0x848
[] iterate_dir+0x120/0x1d8
[] SyS_getdents64+0xdc/0x170
[] __sys_trace_return+0x0/0x4
Memory state around the buggy address:
 8b7f3f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 8b7f3f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>8b7f4000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
   ^
 8b7f4080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 8b7f4100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

BUG: KASAN: use-after-free in nfs_do_filldir+0x88/0x298 at addr 
8b7f4000

Read of size 4 by task crond/446
page:7bffc02dfd00 count:2 mapcount:0 mapping:80001c2cae98 index:0x0
flags: 0x6c(referenced|uptodate|lru|active)
page dumped because: kasan: bad access detected
page->mem_cgroup:80002402da80
CPU: 0 PID: 446 Comm: crond Tainted: GB   W 
4.6.0-rc3-next-20160414-WR8.0.0.0_standard+ #13

Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Call trace:
[] dump_backtrace+0x0/0x2b8
[] show_stack+0x24/0x30
[] dump_stack+0xb0/0xe8
[] kasan_report_error+0x518/0x5c0
[] kasan_report+0x60/0x70
[] __asan_load4+0x64/0x80
[] nfs_do_filldir+0x88/0x298
[] nfs_readdir+0x488/0x848
[] iterate_dir+0x120/0x1d8
[] SyS_getdents64+0xdc/0x170
[] __sys_trace_return+0x0/0x4
Memory state around the buggy address:
 8b7f3f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 8b7f3f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>8b7f4000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
   ^
 8b7f4080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 8b7f4100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

Thanks,
Yang


Re: [PATCH] mm: move huge_pmd_set_accessed out of huge_memory.c

2016-04-20 Thread Shi, Yang

Hi folks,

I didn't realize pmd_* functions are protected by 
CONFIG_TRANSPARENT_HUGEPAGE on the most architectures before I made this 
change.


Before I fix all the affected architectures code, I want to check if you 
guys think this change is worth or not?


Thanks,
Yang

On 4/20/2016 11:24 AM, Yang Shi wrote:

huge_pmd_set_accessed is only called by __handle_mm_fault from memory.c,
move the definition to memory.c and make it static like create_huge_pmd and
wp_huge_pmd.

Signed-off-by: Yang Shi 
---
  include/linux/huge_mm.h |  4 
  mm/huge_memory.c| 23 ---
  mm/memory.c | 23 +++
  3 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7008623..c218ab7b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -8,10 +8,6 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
  extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 struct vm_area_struct *vma);
-extern void huge_pmd_set_accessed(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pmd_t orig_pmd, int dirty);
  extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct 
*vma,
   unsigned long address, pmd_t *pmd,
   pmd_t orig_pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fecbbc5..6c14cb6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1137,29 +1137,6 @@ out:
return ret;
  }

-void huge_pmd_set_accessed(struct mm_struct *mm,
-  struct vm_area_struct *vma,
-  unsigned long address,
-  pmd_t *pmd, pmd_t orig_pmd,
-  int dirty)
-{
-   spinlock_t *ptl;
-   pmd_t entry;
-   unsigned long haddr;
-
-   ptl = pmd_lock(mm, pmd);
-   if (unlikely(!pmd_same(*pmd, orig_pmd)))
-   goto unlock;
-
-   entry = pmd_mkyoung(orig_pmd);
-   haddr = address & HPAGE_PMD_MASK;
-   if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty))
-   update_mmu_cache_pmd(vma, address, pmd);
-
-unlock:
-   spin_unlock(ptl);
-}
-
  static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 93897f2..6ced4eb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3287,6 +3287,29 @@ static int wp_huge_pmd(struct mm_struct *mm, struct 
vm_area_struct *vma,
return VM_FAULT_FALLBACK;
  }

+static void huge_pmd_set_accessed(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ pmd_t *pmd, pmd_t orig_pmd,
+ int dirty)
+{
+   spinlock_t *ptl;
+   pmd_t entry;
+   unsigned long haddr;
+
+   ptl = pmd_lock(mm, pmd);
+   if (unlikely(!pmd_same(*pmd, orig_pmd)))
+   goto unlock;
+
+   entry = pmd_mkyoung(orig_pmd);
+   haddr = address & HPAGE_PMD_MASK;
+   if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty))
+   update_mmu_cache_pmd(vma, address, pmd);
+
+unlock:
+   spin_unlock(ptl);
+}
+
  /*
   * These routines also need to handle stuff like marking pages dirty
   * and/or accessed for architectures that don't do it in hardware (most





[BUG] set_pte_at: racy dirty state clearing warning

2016-04-20 Thread Shi, Yang

Hi Will and Catalin,

When I enable memory comact via

# echo 1 > /proc/sys/vm/compact_memory

I got the below WARNING:

set_pte_at: racy dirty state clearing: 0x006899371bd3 -> 
0x006899371fd3

[ cut here ]
WARNING: CPU: 5 PID: 294 at ./arch/arm64/include/asm/pgtable.h:227 
ptep_set_access_flags+0x138/0x1b8

Modules linked in:

CPU: 5 PID: 294 Comm: systemd-journal Not tainted 
4.6.0-rc3-next-20160414 #13

Hardware name: Freescale Layerscape 2085a RDB Board (DT)
task: 80001e4f8080 ti: 80001e8b4000 task.ti: 80001e8b4000
PC is at ptep_set_access_flags+0x138/0x1b8
LR is at ptep_set_access_flags+0x138/0x1b8
pc : [] lr : [] pstate: 2145
sp : 80001e8b7bc0
x29: 80001e8b7bc0 x28: 80001e843ac8
x27: 0040 x26: 80001e9ae0d8
x25: 29901000 x24: 80001f32a938
x23: 0001 x22: 80001e9ae088
x21: 7eb59000 x20: 80001e843ac8
x19: 006899371fd3 x18: fe09
x17: 7ea48c88 x16: cb5afb20
x15: 003b9aca x14: 307830203e2d2033
x13: 6462313733393930 x12: 3030303836303078
x11: 30203a676e697261 x10: 656c632065746174
x9 : 7320797472696420 x8 : 79636172203a7461
x7 : 5f6574705f746573 x6 : 28300ab8
x5 : 0003 x4 : 
x3 : 0003 x2 : 13d16f64
x1 : dfff2000 x0 : 004f

---[ end trace d75cd9bb88364c80 ]---
Call trace:
Exception stack(0x80001e8b79a0 to 0x80001e8b7ac0)
79a0: 006899371fd3 80001e843ac8 80001e8b7bc0 28497b70
79c0: 2145 003d 29901000 28301558
79e0: 41b58ab3 296870d0 28200668 292b0e40
7a00: 0001 80001f32a938 29901000 80001e9ae0d8
7a20: 0040 80001e843ac8 29901000 80001e9ae0d8
7a40: 2a9dcf60  0a6d9320 2000
7a60: 80001e8b7bc0 80001e8b7bc0 80001e8b7b80 ffc8
7a80: 80001e8b7ad0 28415418 80001e8b4000 13d16f64
7aa0: 004f dfff2000 13d16f64 0003
[] ptep_set_access_flags+0x138/0x1b8
[] handle_mm_fault+0xa24/0xfa0
[] do_page_fault+0x3d4/0x4c0
[] do_mem_abort+0xac/0x140


My kernel has ARM64_HW_AFDBM enabled, but LS2085 is not ARMv8.1.

The code shows it just check if ARM64_HW_AFDBM is enabled or not, but 
doesn't check if the CPU really has such capability.


So, it might be better to have the capability checked runtime?

Thanks,
Yang


Re: [PATCH] mm: move huge_pmd_set_accessed out of huge_memory.c

2016-04-21 Thread Shi, Yang

On 4/21/2016 12:30 AM, Kirill A. Shutemov wrote:

On Wed, Apr 20, 2016 at 02:00:11PM -0700, Shi, Yang wrote:

Hi folks,

I didn't realize pmd_* functions are protected by
CONFIG_TRANSPARENT_HUGEPAGE on the most architectures before I made this
change.

Before I fix all the affected architectures code, I want to check if you
guys think this change is worth or not?

Thanks,
Yang

On 4/20/2016 11:24 AM, Yang Shi wrote:

huge_pmd_set_accessed is only called by __handle_mm_fault from memory.c,
move the definition to memory.c and make it static like create_huge_pmd and
wp_huge_pmd.

Signed-off-by: Yang Shi 


On pte side we have the same functionality open-coded. Should we do the
same for pmd? Or change pte side the same way?


Sorry, I don't quite understand you. Do you mean pte_* functions?

Thanks,
Yang







Re: [PATCH] mm: move huge_pmd_set_accessed out of huge_memory.c

2016-04-21 Thread Shi, Yang

On 4/21/2016 2:15 AM, Hugh Dickins wrote:

On Wed, 20 Apr 2016, Shi, Yang wrote:


Hi folks,

I didn't realize pmd_* functions are protected by CONFIG_TRANSPARENT_HUGEPAGE
on the most architectures before I made this change.

Before I fix all the affected architectures code, I want to check if you guys
think this change is worth or not?


Thanks for asking: no, it is not worthwhile.

I would much prefer not to have to consider these trivial cleanups
in the huge memory area at this time.  Kirill and I have urgent work
to do in this area, and coping with patch conflicts between different
versions of the source will not help any of us.


Thanks for your suggestion. I would consider to put such cleanup work on 
the back burner.


Yang



Thanks,
Hugh



Thanks,
Yang

On 4/20/2016 11:24 AM, Yang Shi wrote:

huge_pmd_set_accessed is only called by __handle_mm_fault from memory.c,
move the definition to memory.c and make it static like create_huge_pmd and
wp_huge_pmd.

Signed-off-by: Yang Shi 
---
   include/linux/huge_mm.h |  4 
   mm/huge_memory.c| 23 ---
   mm/memory.c | 23 +++
   3 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7008623..c218ab7b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -8,10 +8,6 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct
*mm,
   extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct
*src_mm,
 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 struct vm_area_struct *vma);
-extern void huge_pmd_set_accessed(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pmd_t orig_pmd, int dirty);
   extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct
vm_area_struct *vma,
   unsigned long address, pmd_t *pmd,
   pmd_t orig_pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fecbbc5..6c14cb6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1137,29 +1137,6 @@ out:
return ret;
   }

-void huge_pmd_set_accessed(struct mm_struct *mm,
-  struct vm_area_struct *vma,
-  unsigned long address,
-  pmd_t *pmd, pmd_t orig_pmd,
-  int dirty)
-{
-   spinlock_t *ptl;
-   pmd_t entry;
-   unsigned long haddr;
-
-   ptl = pmd_lock(mm, pmd);
-   if (unlikely(!pmd_same(*pmd, orig_pmd)))
-   goto unlock;
-
-   entry = pmd_mkyoung(orig_pmd);
-   haddr = address & HPAGE_PMD_MASK;
-   if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty))
-   update_mmu_cache_pmd(vma, address, pmd);
-
-unlock:
-   spin_unlock(ptl);
-}
-
   static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 93897f2..6ced4eb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3287,6 +3287,29 @@ static int wp_huge_pmd(struct mm_struct *mm, struct
vm_area_struct *vma,
return VM_FAULT_FALLBACK;
   }

+static void huge_pmd_set_accessed(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ pmd_t *pmd, pmd_t orig_pmd,
+ int dirty)
+{
+   spinlock_t *ptl;
+   pmd_t entry;
+   unsigned long haddr;
+
+   ptl = pmd_lock(mm, pmd);
+   if (unlikely(!pmd_same(*pmd, orig_pmd)))
+   goto unlock;
+
+   entry = pmd_mkyoung(orig_pmd);
+   haddr = address & HPAGE_PMD_MASK;
+   if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty))
+   update_mmu_cache_pmd(vma, address, pmd);
+
+unlock:
+   spin_unlock(ptl);
+}
+
   /*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most




[BUG linux-next] kernel NULL pointer dereference on linux-next-20160420

2016-04-21 Thread Shi, Yang

Hi folks,

I did the below test with huge tmpfs on linux-next-20160420:

# mount -t tmpfs huge=1 tmpfs /mnt
# cd /mnt
Then clone linux kernel

Then I got the below bug, such test works well on non-huge tmpfs.

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [] release_freepages+0x18/0xa0
PGD 0
Oops:  [#1] PREEMPT SMP
Modules linked in:
CPU: 6 PID: 110 Comm: kcompactd0 Not tainted 
4.6.0-rc4-next-20160420-WR7.0.0.0_standard #4
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009

task: 880361708040 ti: 880361704000 task.ti: 880361704000
RIP: 0010:[]  [] 
release_freepages+0x18/0xa0

RSP: 0018:880361707cf8  EFLAGS: 00010282
RAX:  RBX: 88036ffde7c0 RCX: 0009
RDX: 1bf1 RSI: 000f RDI: 880361707dd0
RBP: 880361707d20 R08: 0007 R09: 1600
R10: 88036ffde7c0 R11:  R12: 
R13: 880361707dd0 R14: 880361707dc0 R15: 880361707db0
FS:  () GS:880363cc() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 02206000 CR4: 06e0
Stack:
 88036ffde7c0  1a00 880361707dc0
 880361707db0 880361707da0 8119f13d 81196239
  880361708040 0001 0010
Call Trace:
 [] compact_zone+0x55d/0x9f0
 [] ? fragmentation_index+0x19/0x70
 [] kcompactd_do_work+0x10f/0x230
 [] kcompactd+0x90/0x1e0
 [] ? wait_woken+0xa0/0xa0
 [] ? kcompactd_do_work+0x230/0x230
 [] kthread+0xdd/0x100
 [] ret_from_fork+0x22/0x40
 [] ? kthread_create_on_node+0x180/0x180
Code: c1 fa 06 31 f6 e8 a9 9b fd ff eb 98 0f 1f 80 00 00 00 00 66 66 66 
66 90 55 48 89 e5 41 57 41 56 41 55 49 89 fd 41 54 53 48 8b 07 <48> 8b 
10 48 8d 78 e0 49 39 c5 4c 8d 62 e0 74 70 49 be 00 00 00

RIP  [] release_freepages+0x18/0xa0
 RSP 
CR2: 
---[ end trace 855da7e142f7311f ]---


Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

2016-04-18 Thread Shi, Yang

Hi Kirill,

Finally, I got some time to look into and try yours and Hugh's patches, 
got two problems.


1. A quick boot up test on my ARM64 machine with your v7 tree shows some 
unexpected error:


systemd-journald[285]: Failed to save stream data 
/run/systemd/journal/streams/8:16863: No space left on device
systemd-journald[285]: Failed to save stream data 
/run/systemd/journal/streams/8:16865: No space left on device
 Starting DNS forwarder and DHCP server.systemd-journald[285]: 
Failed to save stream data /run/systemd/journal/streams/8:16867: No 
space left on device

..
systemd-journald[285]: Failed to save stream data 
/run/systemd/journal/streams/8:16869: No space left on device

 Starting Postfix Mail Transport Agent...
systemd-journald[285]: Failed to save stream data 
/run/systemd/journal/streams/8:16871: No space left on device

 Starting Berkeley Internet Name Domain (DNS)...
 Starting Wait for Network to be Configured...
systemd-journald[285]: Failed to save stream data 
/run/systemd/journal/streams/8:2422: No space left on device

[  OK  ] Started /etc/rc.local Compatibility.
[FAILED] Failed to start DNS forwarder and DHCP server.
See 'systemctl status dnsmasq.service' for details.
systemd-journald[285]: Failed to save stream data 
/run/systemd/journal/streams/8:2425: No space left on device

[  OK  ] Started Serial Getty on ttyS1.
[  OK  ] Started Serial Getty on ttyS0.
[  OK  ] Started Getty on tty1.
systemd-journald[285]: Failed to save stream data 
/run/systemd/journal/streams/8:2433: No space left on device

[FAILED] Failed to start Berkeley Internet Name Domain (DNS).
See 'systemctl status named.service' for details.


The /run dir is mounted as tmpfs.

x86 boot doesn't get such error. And, Hugh's patches don't have such 
problem.


2. I ran my THP test (generated a program with 4MB text section) on both 
x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got 
the program execution time reduced by ~12% on x86-64, it looks very 
impressive.


But, on ARM64, there is just ~3% change, and sometimes huge tmpfs may 
show even worse data than non-hugepage.


Both yours and Hugh's patches has the same behavior.

Any idea?

Thanks,
Yang


On 4/15/2016 5:23 PM, Kirill A. Shutemov wrote:

This is probably the last update before the mm summit. Main forcus is on
khugepaged stability.

khugepaged is in more reasonable shape now. I missed quite a few corner
cases on first try. I run this version via LTP, trinity and syzkaller
without crashes so far.

The patchset is on top of v4.6-rc3 plus Hugh's "easy preliminaries to
THPagecache" and Ebru's khugepaged swapin patches form -mm tree.

Git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v7

== Changelog ==

v7:
   - khugepaged updates:
 + fix page leak/page cache corruption on collapse fail;
 + filter out VMAs not suitable for huge pages due misaligned vm_pgoff;
 + fix build without CONFIG_SHMEM;
 + drop few over-protective checks;
   - fix bogus VM_BUG_ON() in __delete_from_page_cache();

v6:
   - experimental collapse support;
   - fix swapout mapped huge pages;
   - fix page leak in faularound code;
   - fix exessive huge page allocation with huge=within_size;
   - rename VM_NO_THP to VM_NO_KHUGEPAGED;
   - fix condition in hugepage_madvise();
   - accounting reworked again;

v5:
   - add FileHugeMapped to /proc/PID/smaps;
   - make FileHugeMapped in meminfo aligned with other fields;
   - Documentation/vm/transhuge.txt updated;

v4:
   - first four patch were applied to -mm tree;
   - drop pages beyond i_size on split_huge_pages;
   - few small random bugfixes;

v3:
   - huge= mountoption now can have values always, within_size, advice and
 never;
   - sysctl handle is replaced with sysfs knob;
   - MADV_HUGEPAGE/MADV_NOHUGEPAGE is now respected on page allocation via
 page fault;
   - mlock() handling had been fixed;
   - bunch of smaller bugfixes and cleanups.

== Design overview ==

Huge pages are allocated by shmem when it's allowed (by mount option) and
there's no entries for the range in radix-tree. Huge page is represented by
HPAGE_PMD_NR entries in radix-tree.

MM core maps a page with PMD if ->fault() returns huge page and the VMA is
suitable for huge pages (size, alignment). There's no need into two
requests to file system: filesystem returns huge page if it can,
graceful fallback to small pages otherwise.

As with DAX, split_huge_pmd() is implemented by unmapping the PMD: we can
re-fault the page with PTEs later.

Basic scheme for split_huge_page() is the same as for anon-THP.
Few differences:

   - File pages are on radix-tree, so we have head->_count offset by
 HPAGE_PMD_NR. The count got distributed to small pages during split.

   - mapping->tree_lock prevents non-lockless access to pages under split
 over radix-tree;

   - Lockless access is prevented by setting the head->_count to 0 during
 split, so get_page_unless_z

Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

2016-04-19 Thread Shi, Yang

On 4/19/2016 7:33 AM, Jerome Marchand wrote:

On 04/19/2016 12:55 AM, Shi, Yang wrote:

2. I ran my THP test (generated a program with 4MB text section) on both
x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got
the program execution time reduced by ~12% on x86-64, it looks very
impressive.

But, on ARM64, there is just ~3% change, and sometimes huge tmpfs may
show even worse data than non-hugepage.

Both yours and Hugh's patches has the same behavior.

Any idea?


Just a shot in the dark, but what page size do you use? If you use 4k
pages, then hugepage size should be the same as on x86 and a similar


I do use 4K pages for both x86-64 and ARM64 in my testing.

Thanks,
Yang


behavior could be expected. Otherwise, hugepages would be too big to be
taken advantage of by your test program.

Jerome





KASAN can't change FRAME_WARN

2016-04-19 Thread Shi, Yang

Hi Andrey,

When I enable KASAN for 4.5 and 4.6 (I didn't try with older versions), 
I got FRAME_WARN warning for frame size exceeds 2048 bytes.


Then I found the kconfig looks like:

range 0 8192
default 0 if KASAN
default 1024 if !64BIT
default 2048 if 64BIT

In my understanding, FRAME_WARN should be 0 once KASAN is enabled, but 
it is still 2048. I tried a couple of fixes, i.e.


default 0 if KASAN
default 1024 if (!KASAN && !64BIT)
default 2048 if (!KASAN && 64BIT)

But, nothing works, so I have to add "depends on !KASAN" to disable 
FRAME_WARN completely, but it causes the kernel image size increased.


Any hint is appreciated.

Thanks,
Yang


[BUG linux-next] Kernel panic found with linux-next-20160414

2016-04-19 Thread Shi, Yang

Hi folks,

When I ran ltp on linux-next-20160414 on my ARM64 machine, I got the 
below kernel panic:


Unable to handle kernel paging request at virtual address ffc007846000
pgd = ffc01e21d000
[ffc007846000] *pgd=, *pud=
Internal error: Oops: 9647 [#11] PREEMPT SMP
Modules linked in: loop
CPU: 7 PID: 274 Comm: systemd-journal Tainted: G  D 
4.6.0-rc3-next-20160414-WR8.0.0.0_standard+ #9

Hardware name: Freescale Layerscape 2085a RDB Board (DT)
task: ffc01e3fcf80 ti: ffc01ea8c000 task.ti: ffc01ea8c000
PC is at copy_page+0x38/0x120
LR is at migrate_page_copy+0x604/0x1660
pc : [] lr : [] pstate: 2145
sp : ffc01ea8ecd0
x29: ffc01ea8ecd0 x28: 
x27: 1ff7b80240f8 x26: ffc018196f20
x25: ffbdc01e1180 x24: ffbdc01e1180
x23:  x22: ffc01e3fcf80
x21: ffc00481f000 x20: ff900a31d000
x19: ffbdc01207c0 x18: 0f00
x17:  x16: 
x15:  x14: 
x13:  x12: 
x11:  x10: 
x9 :  x8 : 
x7 :  x6 : 
x5 :  x4 : 
x3 :  x2 : 
x1 : ffc00481f080 x0 : ffc007846000

Call trace:
Exception stack(0xffc021fc2ed0 to 0xffc021fc2ff0)
2ec0:   ffbdc00887c0 ff900a31d000
2ee0: ffc021fc30f0 ff9008ff2318 2145 0025
2f00: ffbdc025a280 ffc020adc4c0 41b58ab3 ff900a085fd0
2f20: ff9008200658   ffbdc00887c0
2f40: ff900b0f1320 ffc021fc3078 41b58ab3 ff900a0864f8
2f60: ff9008210010 ffc021fb8960 ff900867bacc 1ff8043f712d
2f80: ffc021fc2fb0 ff9008210564 ffc021fc3070 ffc021fb8940
2fa0: 08221f78 ff900862f9c8 ffc021fc2fe0 ff9008215dc8
2fc0: 1ff8043f8602 ffc021fc ffc00968a000 ffc00221f080
2fe0: f9407e11d1f0 d61f02209103e210
[] copy_page+0x38/0x120
[] migrate_page+0x74/0x98
[] nfs_migrate_page+0x58/0x80
[] move_to_new_page+0x15c/0x4d8
[] migrate_pages+0x7c8/0x11f0
[] compact_zone+0xdfc/0x2570
[] compact_zone_order+0xe0/0x170
[] try_to_compact_pages+0x2e8/0x8f8
[] __alloc_pages_direct_compact+0x100/0x540
[] __alloc_pages_nodemask+0xc40/0x1c58
[] khugepaged+0x468/0x19c8
[] kthread+0x248/0x2c0
[] ret_from_fork+0x10/0x40
Code: d281f012 91020021 f1020252 d503201f (a8000c02)


I did some initial investigation and found it is caused by 
DEBUG_PAGEALLOC and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT. And, mainline 
4.6-rc3 works well.


It should be not arch specific although I got it caught on ARM64. I 
suspect this might be caused by Hugh's huge tmpfs patches.


Thanks,
Yang


Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

2016-04-19 Thread Shi, Yang

On 4/19/2016 9:50 AM, Andrea Arcangeli wrote:

Hello,

On Mon, Apr 18, 2016 at 03:55:44PM -0700, Shi, Yang wrote:

Hi Kirill,

Finally, I got some time to look into and try yours and Hugh's patches,
got two problems.


One thing that come to mind to test is this: qemu with -machine
accel=kvm -mem-path=/dev/shm/,share=on .


Thanks for the suggestion, I will definitely have a try with KVM.

It would be better if Kirill and Hugh could share what benchmark they 
ran and how much they got improved since my test case is very simple and 
may just cover a small part of it.


Yang



The THP Compound approach in tmpfs may just happen to work already
with KVM (or at worst it'd require minor adjustments) because it uses
the exact same model KVM is already aware about from THP in anonymous
memory, example from arch/x86/kvm/mmu.c:

static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
gfn_t *gfnp, kvm_pfn_t *pfnp,
int *levelp)
{
kvm_pfn_t pfn = *pfnp;
gfn_t gfn = *gfnp;
int level = *levelp;

/*
 * Check if it's a transparent hugepage. If this would be an
 * hugetlbfs page, level wouldn't be set to
 * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
 * here.
 */
if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
level == PT_PAGE_TABLE_LEVEL &&
PageTransCompound(pfn_to_page(pfn)) &&
!mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {

Not using two different models between THP in tmpfs and THP in anon is
essential not just to significantly reduce the size of the kernel
code, but also because THP knowledge can't be self contained in the
mm/shmem.c file. Having to support two different models would
complicate things for secondary MMU drivers (i.e. mmu notifer users)
like KVM who also need to create huge mapping in the shadow pagetable
layer in arch/x86/kvm if the primary MMU allows for it.


x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got
the program execution time reduced by ~12% on x86-64, it looks very
impressive.


Agreed, both patchset are impressive works and achieving amazing
results!

My view is that in terms of long-lived computation from userland point
of view, both models are malleable enough and could achieve everything
we need in the end, but as far as the overall kernel efficiency is
concerned the compound model will always retain a slight advantage in
performance by leveraging a native THP compound refcounting that
requires just one atomic_inc/dec per THP mapcount instead of 512 of
them. Other advantages of the compound model is that it's half in code
size despite already including khugepaged (i.e. the same
split_huge_page works for both tmpfs and anon) and like said above it
won't introduce much complications for drivers like KVM as the model
didn't change.

Thanks,
Andrea





Warning triggered by lockdep checks for sock_owned_by_user on linux-next-20160420

2016-04-22 Thread Shi, Yang

Hi David,

When I ran some test on a nfs mounted rootfs, I got the below warning 
with LOCKDEP enabled on linux-next-20160420:


WARNING: CPU: 9 PID: 0 at include/net/sock.h:1408 
udp_queue_rcv_skb+0x3d0/0x660

Modules linked in:
CPU: 9 PID: 0 Comm: swapper/9 Tainted: G  D 
4.6.0-rc4-next-20160420-WR7.0.0.0_standard+ #6
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009

  88066fd03a70 8155855f 
  88066fd03ab0 81062803 058061318ec8
 88065d1e39c0 880661318e40  880661318ec8
Call Trace:
   [] dump_stack+0x67/0x98
Checking out fil [] __warn+0xd3/0xf0
 [] warn_slowpath_null+0x1d/0x20
 [] udp_queue_rcv_skb+0x3d0/0x660
 [] __udp4_lib_rcv+0x4dc/0xc00
 [] udp_rcv+0x1a/0x20
 [] ip_local_deliver_finish+0xd1/0x2e0
es:  57% (30585/ [] ? ip_local_deliver_finish+0x3f/0x2e0
 [] ip_local_deliver+0xc2/0xd0
 [] ip_rcv_finish+0x1e2/0x5a0
 [] ip_rcv+0x2dc/0x410
 [] ? __pskb_pull_tail+0x82/0x400
 [] __netif_receive_skb_core+0x3a8/0xa80
 [] ? netif_receive_skb_internal+0x1b/0xf0
 [] __netif_receive_skb+0x1d/0x60
 [] netif_receive_skb_internal+0x55/0xf0
 [] ? netif_receive_skb_internal+0x1b/0xf0
 [] napi_gro_receive+0xc2/0x180
 [] igb_poll+0x5ea/0xdf0
 [] net_rx_action+0x15c/0x3d0
 [] __do_softirq+0x161/0x413
 [] irq_exit+0xd1/0x110
 [] do_IRQ+0x62/0xf0
 [] common_interrupt+0x8e/0x8e
   [] ? cpuidle_enter_state+0xc6/0x290
 [] cpuidle_enter+0x17/0x20
 [] call_cpuidle+0x33/0x50
 [] cpu_startup_entry+0x229/0x3b0
 [] start_secondary+0x144/0x150
---[ end trace ba508c424f0d52bf ]---


The warning is triggered by commit 
fafc4e1ea1a4c1eb13a30c9426fb799f5efacbc3 ("sock: tigthen lockdep checks 
for sock_owned_by_user"), which checks if slock is held before locking 
"owned".


It looks good to lock_sock which is just called lock_sock_nested. But, 
bh_lock_sock is different, which just calls spin_lock so it doesn't 
touch dep_map then the check will fail even though it is locked.


So, I'm wondering what a right fix for it should be:

1. Replace bh_lock_sock to bh_lock_sock_nested in the protocols 
implementation, but there are a lot places calling it.


2. Just like lock_sock, just call bh_lock_sock_nested instead of spin_lock.

Or the both approach is wrong or not ideal?

Thanks,
Yang


Re: Warning triggered by lockdep checks for sock_owned_by_user on linux-next-20160420

2016-04-25 Thread Shi, Yang

On 4/22/2016 9:50 PM, Eric Dumazet wrote:

On Fri, 2016-04-22 at 21:02 -0700, Shi, Yang wrote:

Hi David,

When I ran some test on a nfs mounted rootfs, I got the below warning
with LOCKDEP enabled on linux-next-20160420:

WARNING: CPU: 9 PID: 0 at include/net/sock.h:1408
udp_queue_rcv_skb+0x3d0/0x660
Modules linked in:
CPU: 9 PID: 0 Comm: swapper/9 Tainted: G  D
4.6.0-rc4-next-20160420-WR7.0.0.0_standard+ #6
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS
S5500.86B.01.10.0025.030220091519 03/02/2009
    88066fd03a70 8155855f 
    88066fd03ab0 81062803 058061318ec8
   88065d1e39c0 880661318e40  880661318ec8
Call Trace:
 [] dump_stack+0x67/0x98
Checking out fil [] __warn+0xd3/0xf0
   [] warn_slowpath_null+0x1d/0x20
   [] udp_queue_rcv_skb+0x3d0/0x660
   [] __udp4_lib_rcv+0x4dc/0xc00
   [] udp_rcv+0x1a/0x20
   [] ip_local_deliver_finish+0xd1/0x2e0
es:  57% (30585/ [] ? ip_local_deliver_finish+0x3f/0x2e0
   [] ip_local_deliver+0xc2/0xd0
   [] ip_rcv_finish+0x1e2/0x5a0
   [] ip_rcv+0x2dc/0x410
   [] ? __pskb_pull_tail+0x82/0x400
   [] __netif_receive_skb_core+0x3a8/0xa80
   [] ? netif_receive_skb_internal+0x1b/0xf0
   [] __netif_receive_skb+0x1d/0x60
   [] netif_receive_skb_internal+0x55/0xf0
   [] ? netif_receive_skb_internal+0x1b/0xf0
   [] napi_gro_receive+0xc2/0x180
   [] igb_poll+0x5ea/0xdf0
   [] net_rx_action+0x15c/0x3d0
   [] __do_softirq+0x161/0x413
   [] irq_exit+0xd1/0x110
   [] do_IRQ+0x62/0xf0
   [] common_interrupt+0x8e/0x8e
 [] ? cpuidle_enter_state+0xc6/0x290
   [] cpuidle_enter+0x17/0x20
   [] call_cpuidle+0x33/0x50
   [] cpu_startup_entry+0x229/0x3b0
   [] start_secondary+0x144/0x150
---[ end trace ba508c424f0d52bf ]---


The warning is triggered by commit
fafc4e1ea1a4c1eb13a30c9426fb799f5efacbc3 ("sock: tigthen lockdep checks
for sock_owned_by_user"), which checks if slock is held before locking
"owned".

It looks good to lock_sock which is just called lock_sock_nested. But,
bh_lock_sock is different, which just calls spin_lock so it doesn't
touch dep_map then the check will fail even though it is locked.


?? spin_lock() definitely is lockdep friendly.


Yes, this is what I thought too. But, I didn't figure out why the 
warning was still reported even though spin_lock is called.






So, I'm wondering what a right fix for it should be:

1. Replace bh_lock_sock to bh_lock_sock_nested in the protocols
implementation, but there are a lot places calling it.

2. Just like lock_sock, just call bh_lock_sock_nested instead of spin_lock.

Or the both approach is wrong or not ideal?


I sent a patch yesterday, I am not sure what the status is.


Thanks for the patch. I just found your original patch and the 
discussion with Valdis. I think I ran into the same problem. There is 
kernel BUG is triggered before the warning, but "lockdep is off" 
information is not printed out, although is it really off.


Just tried your patch, it works for me.

Thanks,
Yang



diff --git a/include/net/sock.h b/include/net/sock.h
index d997ec13a643..db8301c76d50 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1350,7 +1350,8 @@ static inline bool lockdep_sock_is_held(const struct sock 
*csk)
  {
struct sock *sk = (struct sock *)csk;

-   return lockdep_is_held(&sk->sk_lock) ||
+   return !debug_locks ||
+  lockdep_is_held(&sk->sk_lock) ||
   lockdep_is_held(&sk->sk_lock.slock);
  }
  #endif









Re: [linux-next PATCH] sched: cgroup: enable interrupt before calling threadgroup_change_begin

2016-04-25 Thread Shi, Yang

On 4/23/2016 2:14 AM, Peter Zijlstra wrote:

On Fri, Apr 22, 2016 at 08:56:28PM -0700, Yang Shi wrote:

When kernel oops happens in some kernel thread, i.e. kcompactd in the test,
the below bug might be triggered by the oops handler:


What are you trying to fix? You already oopsed the thing is wrecked.


Actually, I ran into the below kernel BUG:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [] release_freepages+0x18/0xa0
PGD 0
Oops:  [#1] PREEMPT SMP
Modules linked in:
CPU: 6 PID: 110 Comm: kcompactd0 Not tainted 4.6.0-rc4-next-20160420 #4
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009

task: 880361732680 ti: 88036173c000 task.ti: 88036173c000
RIP: 0010:[]  [] 
release_freepages+0x18/0xa0

RSP: 0018:88036173fcf8  EFLAGS: 00010282
RAX:  RBX: 88036ffde7c0 RCX: 0009
RDX: 1bf1 RSI: 000f RDI: 88036173fdd0
RBP: 88036173fd20 R08: 0007 R09: 1600
R10: 88036ffde7c0 R11:  R12: 
R13: 88036173fdd0 R14: 88036173fdc0 R15: 88036173fdb0
FS:  () GS:880363cc() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 02206000 CR4: 06e0
Stack:
 88036ffde7c0  1a00 88036173fdc0
 88036173fdb0 88036173fda0 8119f13d 81196239
  880361732680 0001 0010
Call Trace:
 [] compact_zone+0x55d/0x9f0
 [] ? fragmentation_index+0x19/0x70
 [] kcompactd_do_work+0x10f/0x230
 [] kcompactd+0x90/0x1e0
 [] ? wait_woken+0xa0/0xa0
 [] ? kcompactd_do_work+0x230/0x230
 [] kthread+0xdd/0x100
 [] ret_from_fork+0x22/0x40
 [] ? kthread_create_on_node+0x180/0x180
Code: c1 fa 06 31 f6 e8 a9 9b fd ff eb 98 0f 1f 80 00 00 00 00 66 66 66 
66 90 55 48 89 e5 41 57 41 56 41 55 49 89 fd 41 54 53 48 8b 07 <48> 8b 
10 48 8d 78 e0 49 39 c5 4c 8d 62 e0 74 70 49 be 00 00 00

RIP  [] release_freepages+0x18/0xa0
 RSP 
CR2: 
---[ end trace 2e96d09e0ba6342f ]---

Then the "schedule in atomic context" bug is triggered which cause the 
system hang. But, the system is still alive without the "schedule in 
atomic context" bug. The previous null pointer deference issue doesn't 
bring the system down other than killing the compactd kthread.


Thanks,
Yang







Re: [linux-next PATCH] sched: cgroup: enable interrupt before calling threadgroup_change_begin

2016-04-25 Thread Shi, Yang

On 4/25/2016 10:35 AM, Shi, Yang wrote:

On 4/23/2016 2:14 AM, Peter Zijlstra wrote:

On Fri, Apr 22, 2016 at 08:56:28PM -0700, Yang Shi wrote:

When kernel oops happens in some kernel thread, i.e. kcompactd in the
test,
the below bug might be triggered by the oops handler:


What are you trying to fix? You already oopsed the thing is wrecked.


Actually, I ran into the below kernel BUG:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [] release_freepages+0x18/0xa0
PGD 0
Oops:  [#1] PREEMPT SMP
Modules linked in:
CPU: 6 PID: 110 Comm: kcompactd0 Not tainted 4.6.0-rc4-next-20160420 #4
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS
S5500.86B.01.10.0025.030220091519 03/02/2009
task: 880361732680 ti: 88036173c000 task.ti: 88036173c000
RIP: 0010:[]  []
release_freepages+0x18/0xa0
RSP: 0018:88036173fcf8  EFLAGS: 00010282
RAX:  RBX: 88036ffde7c0 RCX: 0009
RDX: 1bf1 RSI: 000f RDI: 88036173fdd0
RBP: 88036173fd20 R08: 0007 R09: 1600
R10: 88036ffde7c0 R11:  R12: 
R13: 88036173fdd0 R14: 88036173fdc0 R15: 88036173fdb0
FS:  () GS:880363cc()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 02206000 CR4: 06e0
Stack:
  88036ffde7c0  1a00 88036173fdc0
  88036173fdb0 88036173fda0 8119f13d 81196239
   880361732680 0001 0010
Call Trace:
  [] compact_zone+0x55d/0x9f0
  [] ? fragmentation_index+0x19/0x70
  [] kcompactd_do_work+0x10f/0x230
  [] kcompactd+0x90/0x1e0
  [] ? wait_woken+0xa0/0xa0
  [] ? kcompactd_do_work+0x230/0x230
  [] kthread+0xdd/0x100
  [] ret_from_fork+0x22/0x40
  [] ? kthread_create_on_node+0x180/0x180
Code: c1 fa 06 31 f6 e8 a9 9b fd ff eb 98 0f 1f 80 00 00 00 00 66 66 66
66 90 55 48 89 e5 41 57 41 56 41 55 49 89 fd 41 54 53 48 8b 07 <48> 8b
10 48 8d 78 e0 49 39 c5 4c 8d 62 e0 74 70 49 be 00 00 00
RIP  [] release_freepages+0x18/0xa0
  RSP 
CR2: 
---[ end trace 2e96d09e0ba6342f ]---

Then the "schedule in atomic context" bug is triggered which cause the
system hang. But, the system is still alive without the "schedule in
atomic context" bug. The previous null pointer deference issue doesn't
bring the system down other than killing the compactd kthread.


BTW, I don't have "panic on oops" set. So, the kernel doesn't panic.

Thanks,
Yang



Thanks,
Yang









Re: [PATCH] panic: lockdep: correct lock debugging state check

2016-04-26 Thread Shi, Yang

On 4/26/2016 5:39 AM, Peter Zijlstra wrote:

On Mon, Apr 25, 2016 at 08:36:37PM -0700, Yang Shi wrote:

When kernel oops happens, lock debugging is turned off by debug_locks_off()
in oops_enter() via calling __debug_locks_off() which set debug_locks to 0
via xchg(). But, calling to __debug_locks_off() to check lock debugging state
in add_taint() called by oops_end() will always return false since xchg()
returns the old value of debug_locks which is cleared in oops_enter() already.

This prevents add_taint() from printing out lock debugging disable information
although LOCKDEP_NOW_UNRELIABLE is passed to it.

Check lock debugging state via !debug_locks to fix this. Although
!__debug_locks_off() could do the same thing, it may look confusing.


What are you smoking? This is the second completely insane patch you
send this week.

This breaks add_taint() and gains us nothing except trivialities. Who


I apologize in advance, if I misunderstand the code and please ignore 
all the bullshit below.


In my understanding, add_taint() should call that pr_warn if 
LOCKDEP_NOW_UNRELIABLE is passed and lock debugging is disabled. This is 
what the code tells me.


LOCKDEP_NOW_UNRELIABLE is passed via lock_ok parameter, lock debugging 
is turned off by debug_locks_off() already, so it should print out 
something, but it doesn't since __debug_locks_off() always returns 0.


So, it looks the if statement logic is broken.

There are alternatives to fix it, I may pick up the not ideal one.


bloody cares about that print if you've just had an OOPS.


I do agree not too many people care about that print and such 
information is too trivial to draw attention from people. However, it 
doesn't mean oops print is a perfect place to hide something wrong. I 
just happened to find this by checking the oops information to try to 
get some clue for another issue. Then I thought it is just a quick fix, 
why not I should do that to make kernel better.


Thanks,
Yang







Re: [PATCH] arm64: remove redundant FRAME_POINTER kconfig option

2015-11-06 Thread Shi, Yang

On 11/6/2015 8:25 AM, Will Deacon wrote:

On Fri, Nov 06, 2015 at 04:21:09PM +, Catalin Marinas wrote:

On Fri, Nov 06, 2015 at 12:50:02PM +, Mark Rutland wrote:

On Fri, Nov 06, 2015 at 12:30:09PM +, Will Deacon wrote:

On Wed, Nov 04, 2015 at 09:37:51AM -0800, Yang Shi wrote:

FRAME_POINTER is defined in lib/Kconfig.debug, it is unnecessary to redefine
it in arch/arm64/Kconfig.debug.


It might be worth noting that this adds a dependency on DEBUG_KERNEL
for building with frame pointers. I'm ok with that (it appears to be
enabled in defconfig and follows the vast majority of other archs) but
it is a change in behaviour.

With that:

   Acked-by: Will Deacon 


The code in arch/arm64/kernel/stacktrace.c assumes we have frame
pointers regardless of FRAME_POINTER. Depending on what the compiler
decides to use x29 for, we could get some weird fake unwinding and/or
dodgy memory accesses.

I think we should first audit the uses of frame pointers to ensure that
they are guarded for !FRAME_POINTER.


Or we just select FRAME_POINTER in the ARM64 Kconfig entry.


Yang, did you see any benefit disabling frame pointers, or was this patch
purely based on you spotting a duplicate Kconfig entry?


It just spots a duplicate Kconfig entry.

FRAME_POINTER is defined in both lib/Kconfig.debug and 
arch/arm64/Kconfig.debug.


The lib/Kconfig.debug one looks like:

config FRAME_POINTER
bool "Compile the kernel with frame pointers"
depends on DEBUG_KERNEL && \
(CRIS || M68K || FRV || UML || \
 AVR32 || SUPERH || BLACKFIN || MN10300 || METAG) || \
ARCH_WANT_FRAME_POINTERS
default y if (DEBUG_INFO && UML) || ARCH_WANT_FRAME_POINTERS
help
  If you say Y here the resulting kernel image will be slightly
  larger and slower, but it gives very useful debugging information
  in case of kernel bugs. (precise oopses/stacktraces/warnings)

The common one just depends on DEBUG_KERNEL && ARCH_WANT_FRAME_POINTERS.

ARCH_WANT_FRAME_POINTERS is selected by ARM64 kconfig entry.

To answer Catalin's question about:


However, the patch would allow one to
disable FRAME_POINTERS (not sure it has any effect on the aarch64 gcc
though).


No, it doesn't. Actually, FRAME_POINTER could be disabled regardless of 
the patch.


Thanks,
Yang



Will



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: remove redundant FRAME_POINTER kconfig option

2015-11-06 Thread Shi, Yang

On 11/6/2015 9:35 AM, Catalin Marinas wrote:

On Fri, Nov 06, 2015 at 09:23:38AM -0800, Shi, Yang wrote:

On 11/6/2015 8:25 AM, Will Deacon wrote:

However, the patch would allow one to
disable FRAME_POINTERS (not sure it has any effect on the aarch64 gcc
though).


No, it doesn't. Actually, FRAME_POINTER could be disabled regardless of the
patch.


In which case I suggest that we always select it just as a clearer
statement that the feature cannot be disabled (and you never know what
the compiler people decide to do in the future).


Do you mean select FRAME_POINTER in ARCH_WANT_FRAME_POINTERS?

Yes, we could, but this may cause other architectures which select 
ARCH_WANT_FRAME_POINTERS to have FRAME_POINTER selected too.


Or we could do:

select FRAME_POINTER is ARM64

Thanks,
Yang





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: remove redundant FRAME_POINTER kconfig option

2015-11-06 Thread Shi, Yang

On 11/6/2015 9:51 AM, Catalin Marinas wrote:

On Fri, Nov 06, 2015 at 09:39:07AM -0800, Shi, Yang wrote:

On 11/6/2015 9:35 AM, Catalin Marinas wrote:

On Fri, Nov 06, 2015 at 09:23:38AM -0800, Shi, Yang wrote:

On 11/6/2015 8:25 AM, Will Deacon wrote:

However, the patch would allow one to
disable FRAME_POINTERS (not sure it has any effect on the aarch64 gcc
though).


No, it doesn't. Actually, FRAME_POINTER could be disabled regardless of the
patch.


In which case I suggest that we always select it just as a clearer
statement that the feature cannot be disabled (and you never know what
the compiler people decide to do in the future).


Do you mean select FRAME_POINTER in ARCH_WANT_FRAME_POINTERS?

Yes, we could, but this may cause other architectures which select
ARCH_WANT_FRAME_POINTERS to have FRAME_POINTER selected too.


This would have been the ideal option, something like:

--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -322,7 +322,7 @@ config ARCH_WANT_FRAME_POINTERS
help

  config FRAME_POINTER
-   bool "Compile the kernel with frame pointers"
+   bool "Compile the kernel with frame pointers" if 
!ARCH_WANT_FRAME_POINTERS
depends on DEBUG_KERNEL && \
(CRIS || M68K || FRV || UML || \
 AVR32 || SUPERH || BLACKFIN || MN10300 || METAG) || \

But, as you said, we would need to check the other architectures
selecting ARCH_WANT_FRAME_POINTERS.


How about:

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 1d1521c..709255a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -319,6 +319,7 @@ config DEBUG_SECTION_MISMATCH
 #
 config ARCH_WANT_FRAME_POINTERS
bool
+   select FRAME_POINTER if ARM64
help

 config FRAME_POINTER

If other architectures want the same behavior, they could easily append 
to the is statement. If all arches which selects 
ARCH_WANT_FRAME_POINTERS, the if statement could be just removed.


Yang



In the meantime:

--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -27,6 +27,7 @@ config ARM64
select CPU_PM if (SUSPEND || CPU_IDLE)
select DCACHE_WORD_ACCESS
select EDAC_SUPPORT
+   select FRAME_POINTER
select GENERIC_ALLOCATOR
select GENERIC_CLOCKEVENTS
select GENERIC_CLOCKEVENTS_BROADCAST



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm64: bpf: fix JIT stack setup

2015-11-06 Thread Shi, Yang

Please ignore this one, forgot to cc to linux-arm-kernel list.

Sorry for the inconvenience.

Yang


On 11/6/2015 9:34 PM, Yang Shi wrote:

ARM64 JIT used FP (x29) as eBPF fp register, but FP is subjected to
change during function call so it may cause the BPF prog stack base address
change too. Whenever, it pointed to the bottom of BPF prog stack instead of
the top.

So, when copying data via bpf_probe_read, it will be copied to (SP - offset),
then it may overwrite the saved FP/LR.

Use x25 to replace FP as BPF stack base register (fp). Since x25 is callee
saved register, so it will keep intact during function call.
It is initialized in BPF prog prologue when BPF prog is started to run
everytime. When BPF prog exits, it could be just tossed.

Other than this the BPf prog stack base need to be setup before function
call stack.

So, the BPF stack layout looks like:

  high
  original A64_SP =>   0:+-+ BPF prologue
 | | FP/LR and callee saved registers
  BPF fp register => +64:+-+
 | |
 | ... | BPF prog stack
 | |
 | |
  current A64_SP =>  +-+
 | |
 | ... | Function call stack
 | |
 +-+
   low

Signed-off-by: Yang Shi 
CC: Zi Shen Lim 
CC: Xi Wang 
---
  arch/arm64/net/bpf_jit_comp.c | 38 +++---
  1 file changed, 31 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index a44e529..6809647 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -50,7 +50,7 @@ static const int bpf2a64[] = {
[BPF_REG_8] = A64_R(21),
[BPF_REG_9] = A64_R(22),
/* read-only frame pointer to access stack */
-   [BPF_REG_FP] = A64_FP,
+   [BPF_REG_FP] = A64_R(25),
/* temporary register for internal BPF JIT */
[TMP_REG_1] = A64_R(23),
[TMP_REG_2] = A64_R(24),
@@ -155,18 +155,42 @@ static void build_prologue(struct jit_ctx *ctx)
stack_size += 4; /* extra for skb_copy_bits buffer */
stack_size = STACK_ALIGN(stack_size);

+   /*
+* BPF prog stack layout
+*
+* high
+* original A64_SP =>   0:+-+ BPF prologue
+*| | FP/LR and callee saved registers
+* BPF fp register => +64:+-+
+*| |
+ *| ... | BPF prog stack
+*| |
+*| |
+* current A64_SP =>  +-+
+*| |
+*| ... | Function call stack
+*| |
+*+-+
+*  low
+*
+*/
+
+   /* Save FP and LR registers to stay align with ARM64 AAPCS */
+   emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
+
/* Save callee-saved register */
emit(A64_PUSH(r6, r7, A64_SP), ctx);
emit(A64_PUSH(r8, r9, A64_SP), ctx);
if (ctx->tmp_used)
emit(A64_PUSH(tmp1, tmp2, A64_SP), ctx);

-   /* Set up BPF stack */
-   emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
-
-   /* Set up frame pointer */
+   /* Set up BPF prog stack base register (x25) */
emit(A64_MOV(1, fp, A64_SP), ctx);

+   /* Set up function call stack */
+   emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
+   emit(A64_MOV(1, A64_FP, A64_SP), ctx);
+
/* Clear registers A and X */
emit_a64_mov_i64(ra, 0, ctx);
emit_a64_mov_i64(rx, 0, ctx);
@@ -196,8 +220,8 @@ static void build_epilogue(struct jit_ctx *ctx)
emit(A64_POP(r8, r9, A64_SP), ctx);
emit(A64_POP(r6, r7, A64_SP), ctx);

-   /* Restore frame pointer */
-   emit(A64_MOV(1, fp, A64_SP), ctx);
+   /* Restore FP/LR registers */
+   emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);

/* Set return value */
emit(A64_MOV(1, A64_R(0), r0), ctx);



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bpf: convert hashtab lock to raw lock

2015-11-02 Thread Shi, Yang

On 11/2/2015 12:59 AM, Thomas Gleixner wrote:

On Sun, 1 Nov 2015, Alexei Starovoitov wrote:

On Sat, Oct 31, 2015 at 09:47:36AM -0400, Steven Rostedt wrote:

On Fri, 30 Oct 2015 17:03:58 -0700
Alexei Starovoitov  wrote:


On Fri, Oct 30, 2015 at 03:16:26PM -0700, Yang Shi wrote:

When running bpf samples on rt kernel, it reports the below warning:

BUG: sleeping function called from invalid context at 
kernel/locking/rtmutex.c:917
in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping
Preemption disabled at:[] kprobe_perf_func+0x30/0x228

...

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 83c209d..972b76b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -17,7 +17,7 @@
  struct bpf_htab {
struct bpf_map map;
struct hlist_head *buckets;
-   spinlock_t lock;
+   raw_spinlock_t lock;


How do we address such things in general?
I bet there are tons of places around the kernel that
call spin_lock from atomic.
I'd hate to lose the benefits of lockdep of non-raw spin_lock
just to make rt happy.


You wont lose any benefits of lockdep. Lockdep still checks
raw_spin_lock(). The only difference between raw_spin_lock and
spin_lock is that in -rt spin_lock turns into an rt_mutex() and
raw_spin_lock stays a spin lock.


I see. The patch makes sense then.
Would be good to document this peculiarity of spin_lock.


I'm working on a document.


Thanks Steven and Thomas for your elaboration and comment.

Yang



Thanks,

tglx



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bpf: convert hashtab lock to raw lock

2015-11-02 Thread Shi, Yang

On 10/31/2015 11:37 AM, Daniel Borkmann wrote:

On 10/31/2015 02:47 PM, Steven Rostedt wrote:

On Fri, 30 Oct 2015 17:03:58 -0700
Alexei Starovoitov  wrote:

On Fri, Oct 30, 2015 at 03:16:26PM -0700, Yang Shi wrote:

When running bpf samples on rt kernel, it reports the below warning:

BUG: sleeping function called from invalid context at
kernel/locking/rtmutex.c:917
in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping
Preemption disabled at:[] kprobe_perf_func+0x30/0x228

...

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 83c209d..972b76b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -17,7 +17,7 @@
  struct bpf_htab {
  struct bpf_map map;
  struct hlist_head *buckets;
-spinlock_t lock;
+raw_spinlock_t lock;


How do we address such things in general?
I bet there are tons of places around the kernel that
call spin_lock from atomic.
I'd hate to lose the benefits of lockdep of non-raw spin_lock
just to make rt happy.


You wont lose any benefits of lockdep. Lockdep still checks
raw_spin_lock(). The only difference between raw_spin_lock and
spin_lock is that in -rt spin_lock turns into an rt_mutex() and
raw_spin_lock stays a spin lock.


( Btw, Yang, would have been nice if your commit description would have
   already included such info, not only that you convert it, but also why
   it's okay to do so. )


I think Thomas's document will include all the information about rt spin 
lock/raw spin lock, etc.


Alexei & Daniel,

If you think such info is necessary, I definitely could add it into the 
commit log in v2.





The error is that in -rt, you called a mutex and not a spin lock while
atomic.


You are right, I think this happens due to the preempt_disable() in the
trace_call_bpf() handler. So, I think the patch seems okay. The dep_map
is btw union'ed in the struct spinlock case to the same offset of the
dep_map from raw_spinlock.

It's a bit inconvenient, though, when we add other library code as maps
in future, f.e. things like rhashtable as they would first need to be
converted to raw_spinlock_t as well, but judging from the git log, it
looks like common practice.


Yes, it is common practice for converting sleepable spin lock to raw 
spin lock in -rt to avoid scheduling in atomic context bug.


Thanks,
Yang



Thanks,
Daniel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >