Re: [v2 PATCH] arm64: replace read_lock to rcu lock in call_break_hook
On 10/1/2015 3:15 PM, Shi, Yang wrote: On 10/1/2015 2:27 PM, Paul E. McKenney wrote: On Thu, Oct 01, 2015 at 01:53:51PM -0700, Shi, Yang wrote: On 10/1/2015 10:08 AM, Steven Rostedt wrote: On Thu, 1 Oct 2015 09:37:37 -0700 Yang Shi wrote: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917 in_atomic(): 0, irqs_disabled(): 128, pid: 342, name: perf 1 lock held by perf/342: #0: (break_hook_lock){+.+...}, at: [] call_break_hook+0x34/0xd0 irq event stamp: 62224 hardirqs last enabled at (62223): [] __call_rcu.constprop.59+0x104/0x270 hardirqs last disabled at (62224): [] vprintk_emit+0x68/0x640 softirqs last enabled at (0): [] copy_process.part.8+0x428/0x17f8 softirqs last disabled at (0): [< (null)>] (null) CPU: 0 PID: 342 Comm: perf Not tainted 4.1.6-rt5 #4 Hardware name: linux,dummy-virt (DT) Call trace: [] dump_backtrace+0x0/0x128 [] show_stack+0x20/0x30 [] dump_stack+0x7c/0xa0 [] ___might_sleep+0x174/0x260 [] __rt_spin_lock+0x28/0x40 [] rt_read_lock+0x60/0x80 [] call_break_hook+0x30/0xd0 [] brk_handler+0x30/0x98 [] do_debug_exception+0x50/0xb8 Exception stack(0xffc00514fe30 to 0xffc00514ff50) fe20: c1594680 007f fe40: 92063940 007f 0550dcd8 ffc0 fe60: 0514fe70 ffc0 000be1f8 ffc0 0514feb0 ffc0 0008948c ffc0 fe80: 0004 0514fed0 ffc0 9282a948 007f fea0: 9282b708 007f c1592820 007f 00083914 ffc0 fec0: 0010 0064 0001 fee0: 005101e0 c1594680 007f c1594740 007f ffd8 ff80 ff00: c1594770 007f c1594770 007f ff20: 00665e10 7f7f7f7f 7f7f7f7f 01010101 01010101 ff40: 928e4cc0 007f 91ff11e8 007f call_break_hook is called in atomic context (hard irq disabled), so replace the sleepable lock to rcu lock and replace relevant list operations to rcu version. Signed-off-by: Yang Shi --- v1-> v2 Replace list operations to rcu version. arch/arm64/kernel/debug-monitors.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/arm64/kernel/debug-monitors.c b/arch/arm64/kernel/debug-monitors.c index cebf786..cf0e4fc 100644 --- a/arch/arm64/kernel/debug-monitors.c +++ b/arch/arm64/kernel/debug-monitors.c @@ -276,14 +276,14 @@ static DEFINE_RWLOCK(break_hook_lock); void register_break_hook(struct break_hook *hook) { write_lock(&break_hook_lock); -list_add(&hook->node, &break_hook); +list_add_rcu(&hook->node, &break_hook); write_unlock(&break_hook_lock); } void unregister_break_hook(struct break_hook *hook) { write_lock(&break_hook_lock); -list_del(&hook->node); +list_del_rcu(&hook->node); write_unlock(&break_hook_lock); } Shouldn't there be a synchronize_rcu() somewhere? So far kgdb is the only user of unregister_break_hook in mainline kernel. Just read Documentation/RCU/checklist.txt, it says: Note that synchronize_rcu() -only- guarantees to wait until all currently executing rcu_read_lock()-protected RCU read-side critical sections complete. For kgdb, the unregister is just called in kgdb_arch_exit by kgdb_unregister_io_module, which is called when rmmod kgdb module. The break point handler is done synchronously. So, it sounds should be not a problem without calling synchronize_rcu(). OK, I will bite... What does "synchronously" mean here? Unless you have somehow guaranteed that all current readers in call_break_hook() are done between the time you call unregister_break_hook() to remove a given break_hook structure and the time you call register_break_hook() to add that same structure back in, you have a problem. For kgdb usecase, this might be guaranteed. Generally, kgdb module is loaded then register_break_hook() is called. Then connect kgdb from host or via kdb, set breakpoint, wait for the break point is hit, run some commands to debug. Then finish debug, rmmod kgdb which will call unregister_break_hook(). It sounds the current readers in call_break_hook() could be done during the time otherwise I won't be able to continue my debug when break point is hit. What you have now only protects against invoking register_break_hook() on newly allocated and initialized break_hook structure. But the only calls to register_break_hook() that I see in v4.2 use compile-time initialized structures. So the only failure from using non-RCU list primitives would be due to the list_head's ->next pointer initialization. This could momentarily make the list appear to have only the new element, but not the old element. Unless you do a series of register_break_hook() and unregister_break_hook() calls, in which case a previously deleted s
Re: [PATCH] arm64: replace read_lock to rcu lock in call_break_hook
On 10/1/2015 8:07 AM, Catalin Marinas wrote: On Wed, Sep 30, 2015 at 03:59:04PM -0700, Yang Shi wrote: diff --git a/arch/arm64/kernel/debug-monitors.c b/arch/arm64/kernel/debug-monitors.c index cebf786..eb520d0 100644 --- a/arch/arm64/kernel/debug-monitors.c +++ b/arch/arm64/kernel/debug-monitors.c @@ -292,11 +292,11 @@ static int call_break_hook(struct pt_regs *regs, unsigned int esr) struct break_hook *hook; int (*fn)(struct pt_regs *regs, unsigned int esr) = NULL; - read_lock(&break_hook_lock); + rcu_read_lock(); list_for_each_entry(hook, &break_hook, node) if ((esr & hook->esr_mask) == hook->esr_val) fn = hook->fn; - read_unlock(&break_hook_lock); + rcu_read_unlock(); return fn ? fn(regs, esr) : DBG_HOOK_ERROR; } That's not enough, you also need list_(add|del)_rcu where the list is modified, together with list_for_each_entry_rcu() here. Thanks for pointing it out. I'm going to prepare v2 with list_(add|del)_rcu. Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [v2 PATCH] arm64: replace read_lock to rcu lock in call_break_hook
On 10/1/2015 10:08 AM, Steven Rostedt wrote: On Thu, 1 Oct 2015 09:37:37 -0700 Yang Shi wrote: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917 in_atomic(): 0, irqs_disabled(): 128, pid: 342, name: perf 1 lock held by perf/342: #0: (break_hook_lock){+.+...}, at: [] call_break_hook+0x34/0xd0 irq event stamp: 62224 hardirqs last enabled at (62223): [] __call_rcu.constprop.59+0x104/0x270 hardirqs last disabled at (62224): [] vprintk_emit+0x68/0x640 softirqs last enabled at (0): [] copy_process.part.8+0x428/0x17f8 softirqs last disabled at (0): [< (null)>] (null) CPU: 0 PID: 342 Comm: perf Not tainted 4.1.6-rt5 #4 Hardware name: linux,dummy-virt (DT) Call trace: [] dump_backtrace+0x0/0x128 [] show_stack+0x20/0x30 [] dump_stack+0x7c/0xa0 [] ___might_sleep+0x174/0x260 [] __rt_spin_lock+0x28/0x40 [] rt_read_lock+0x60/0x80 [] call_break_hook+0x30/0xd0 [] brk_handler+0x30/0x98 [] do_debug_exception+0x50/0xb8 Exception stack(0xffc00514fe30 to 0xffc00514ff50) fe20: c1594680 007f fe40: 92063940 007f 0550dcd8 ffc0 fe60: 0514fe70 ffc0 000be1f8 ffc0 0514feb0 ffc0 0008948c ffc0 fe80: 0004 0514fed0 ffc0 9282a948 007f fea0: 9282b708 007f c1592820 007f 00083914 ffc0 fec0: 0010 0064 0001 fee0: 005101e0 c1594680 007f c1594740 007f ffd8 ff80 ff00: c1594770 007f c1594770 007f ff20: 00665e10 7f7f7f7f 7f7f7f7f 01010101 01010101 ff40: 928e4cc0 007f 91ff11e8 007f call_break_hook is called in atomic context (hard irq disabled), so replace the sleepable lock to rcu lock and replace relevant list operations to rcu version. Signed-off-by: Yang Shi --- v1-> v2 Replace list operations to rcu version. arch/arm64/kernel/debug-monitors.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/arm64/kernel/debug-monitors.c b/arch/arm64/kernel/debug-monitors.c index cebf786..cf0e4fc 100644 --- a/arch/arm64/kernel/debug-monitors.c +++ b/arch/arm64/kernel/debug-monitors.c @@ -276,14 +276,14 @@ static DEFINE_RWLOCK(break_hook_lock); void register_break_hook(struct break_hook *hook) { write_lock(&break_hook_lock); - list_add(&hook->node, &break_hook); + list_add_rcu(&hook->node, &break_hook); write_unlock(&break_hook_lock); } void unregister_break_hook(struct break_hook *hook) { write_lock(&break_hook_lock); - list_del(&hook->node); + list_del_rcu(&hook->node); write_unlock(&break_hook_lock); } Shouldn't there be a synchronize_rcu() somewhere? So far kgdb is the only user of unregister_break_hook in mainline kernel. Just read Documentation/RCU/checklist.txt, it says: Note that synchronize_rcu() -only- guarantees to wait until all currently executing rcu_read_lock()-protected RCU read-side critical sections complete. For kgdb, the unregister is just called in kgdb_arch_exit by kgdb_unregister_io_module, which is called when rmmod kgdb module. The break point handler is done synchronously. So, it sounds should be not a problem without calling synchronize_rcu(). Yang -- Steve @@ -292,11 +292,11 @@ static int call_break_hook(struct pt_regs *regs, unsigned int esr) struct break_hook *hook; int (*fn)(struct pt_regs *regs, unsigned int esr) = NULL; - read_lock(&break_hook_lock); - list_for_each_entry(hook, &break_hook, node) + rcu_read_lock(); + list_for_each_entry_rcu(hook, &break_hook, node) if ((esr & hook->esr_mask) == hook->esr_val) fn = hook->fn; - read_unlock(&break_hook_lock); + rcu_read_unlock(); return fn ? fn(regs, esr) : DBG_HOOK_ERROR; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [v2 PATCH] arm64: replace read_lock to rcu lock in call_break_hook
On 10/1/2015 2:27 PM, Paul E. McKenney wrote: On Thu, Oct 01, 2015 at 01:53:51PM -0700, Shi, Yang wrote: On 10/1/2015 10:08 AM, Steven Rostedt wrote: On Thu, 1 Oct 2015 09:37:37 -0700 Yang Shi wrote: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917 in_atomic(): 0, irqs_disabled(): 128, pid: 342, name: perf 1 lock held by perf/342: #0: (break_hook_lock){+.+...}, at: [] call_break_hook+0x34/0xd0 irq event stamp: 62224 hardirqs last enabled at (62223): [] __call_rcu.constprop.59+0x104/0x270 hardirqs last disabled at (62224): [] vprintk_emit+0x68/0x640 softirqs last enabled at (0): [] copy_process.part.8+0x428/0x17f8 softirqs last disabled at (0): [< (null)>] (null) CPU: 0 PID: 342 Comm: perf Not tainted 4.1.6-rt5 #4 Hardware name: linux,dummy-virt (DT) Call trace: [] dump_backtrace+0x0/0x128 [] show_stack+0x20/0x30 [] dump_stack+0x7c/0xa0 [] ___might_sleep+0x174/0x260 [] __rt_spin_lock+0x28/0x40 [] rt_read_lock+0x60/0x80 [] call_break_hook+0x30/0xd0 [] brk_handler+0x30/0x98 [] do_debug_exception+0x50/0xb8 Exception stack(0xffc00514fe30 to 0xffc00514ff50) fe20: c1594680 007f fe40: 92063940 007f 0550dcd8 ffc0 fe60: 0514fe70 ffc0 000be1f8 ffc0 0514feb0 ffc0 0008948c ffc0 fe80: 0004 0514fed0 ffc0 9282a948 007f fea0: 9282b708 007f c1592820 007f 00083914 ffc0 fec0: 0010 0064 0001 fee0: 005101e0 c1594680 007f c1594740 007f ffd8 ff80 ff00: c1594770 007f c1594770 007f ff20: 00665e10 7f7f7f7f 7f7f7f7f 01010101 01010101 ff40: 928e4cc0 007f 91ff11e8 007f call_break_hook is called in atomic context (hard irq disabled), so replace the sleepable lock to rcu lock and replace relevant list operations to rcu version. Signed-off-by: Yang Shi --- v1-> v2 Replace list operations to rcu version. arch/arm64/kernel/debug-monitors.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/arm64/kernel/debug-monitors.c b/arch/arm64/kernel/debug-monitors.c index cebf786..cf0e4fc 100644 --- a/arch/arm64/kernel/debug-monitors.c +++ b/arch/arm64/kernel/debug-monitors.c @@ -276,14 +276,14 @@ static DEFINE_RWLOCK(break_hook_lock); void register_break_hook(struct break_hook *hook) { write_lock(&break_hook_lock); - list_add(&hook->node, &break_hook); + list_add_rcu(&hook->node, &break_hook); write_unlock(&break_hook_lock); } void unregister_break_hook(struct break_hook *hook) { write_lock(&break_hook_lock); - list_del(&hook->node); + list_del_rcu(&hook->node); write_unlock(&break_hook_lock); } Shouldn't there be a synchronize_rcu() somewhere? So far kgdb is the only user of unregister_break_hook in mainline kernel. Just read Documentation/RCU/checklist.txt, it says: Note that synchronize_rcu() -only- guarantees to wait until all currently executing rcu_read_lock()-protected RCU read-side critical sections complete. For kgdb, the unregister is just called in kgdb_arch_exit by kgdb_unregister_io_module, which is called when rmmod kgdb module. The break point handler is done synchronously. So, it sounds should be not a problem without calling synchronize_rcu(). OK, I will bite... What does "synchronously" mean here? Unless you have somehow guaranteed that all current readers in call_break_hook() are done between the time you call unregister_break_hook() to remove a given break_hook structure and the time you call register_break_hook() to add that same structure back in, you have a problem. For kgdb usecase, this might be guaranteed. Generally, kgdb module is loaded then register_break_hook() is called. Then connect kgdb from host or via kdb, set breakpoint, wait for the break point is hit, run some commands to debug. Then finish debug, rmmod kgdb which will call unregister_break_hook(). It sounds the current readers in call_break_hook() could be done during the time otherwise I won't be able to continue my debug when break point is hit. What you have now only protects against invoking register_break_hook() on newly allocated and initialized break_hook structure. But the only calls to register_break_hook() that I see in v4.2 use compile-time initialized structures. So the only failure from using non-RCU list primitives would be due to the list_head's ->next pointer initialization. This could momentarily make the list appear to have only the new element, but not the old element. Unless you do a series of register_break_hook() and unregister_break_hook() calls, in which case a previously deleted structure could
Re: [PATCH] perf: change samples type to unsigned long long
On 10/2/2015 12:10 PM, Arnaldo Carvalho de Melo wrote: Em Fri, Oct 02, 2015 at 04:08:38PM -0300, Arnaldo Carvalho de Melo escreveu: Em Tue, Sep 29, 2015 at 02:49:43PM -0700, Yang Shi escreveu: When run "perf record -e", the number of samples showed up is wrong on some 32 bit systems, i.e. powerpc and arm. For example, run the below commands on 32 bit powerpc: perf probe -x /lib/libc.so.6 malloc perf record -e probe_libc:malloc -a ls perf.data [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.036 MB perf.data (13829241621624967218 samples) ] Actually, "perf script" just shows 21 samples. The number of samples is also absurd since samples is long type, but it is printed as PRIu64. Build test is run on x86-64, x86, aarch64, arm, mips, ppc and ppc64. Sure? AR /tmp/build/perf/libperf.a builtin-record.c: In function ‘__cmd_record’: builtin-record.c:689:12: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare] You missed something important, I am fixing this and applying your patch: Thanks a lot. Yes, it was wrong, however, my compiler just throws a warning instead of an error. And, the warning was hidden by my OE build environment. Regards, Yang diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 974065f8ce80..24ace2f318c1 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -667,7 +667,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv) auxtrace_snapshot_enabled = 1; for (;;) { - int hits = rec->samples; + unsigned long long hits = rec->samples; if (record__mmap_read_all(rec) < 0) { auxtrace_snapshot_enabled = 0; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: Kconfig: make SCHED_MC and SCHED_SMT depend on SMP
On 4/14/2016 1:47 AM, Will Deacon wrote: On Wed, Apr 13, 2016 at 05:54:12PM -0700, Yang Shi wrote: SCHED_MC and SCHED_SMT are pointless when SMP is disabled. Although SMP is rarely disabled for ARM64, it looks more consistent to have such depend in Kconfig. You can't disable CONFIG_SMP for arm64 -- we select it unconditionally in the kconfig. Thanks Will. I didn't realize ARM64 has SMP selected unconditionally, it looks the patch is pointless. A follow-up question, I know ARM64 has no UP implementation now, it sounds make sense to have SMP select unconditionally, however, it might be more flexible to have SMP like x86 and other architectures. And, it may also help to find more bugs when !SMP. Thanks, Yang Will
Re: [PATCH] locktorture: make verbose writable and control stats print
On 4/15/2016 4:26 PM, Paul E. McKenney wrote: On Fri, Apr 15, 2016 at 01:28:11PM -0700, Yang Shi wrote: When building locktorture test into kernel image, it keeps printing out stats information even though there is no lock type specified. There is already verbose parameter to control print, but it is read-only, so it can't be changed at runtime. Make verbose read-write and control stats print. Signed-off-by: Yang Shi Interesting change! But just out of curiosity, when you boot with locktorture built in, do you specify the shutdown_secs boot parameter? If so, another No, just use the default value, which is 0 for shutdown_secs. approach would be to shutdown immediately upon detecting an error during initialization. In my case, it looks there is not error involved. If not, I would like to know more about your use case. In my test, I just built locktorture test into kernel instead of a module then check how it behaves, no specific purpose. It sounds like not a normal approach to use it. Thanks, Yang Thanx, Paul --- include/linux/torture.h | 2 ++ kernel/locking/locktorture.c | 11 +++ 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/include/linux/torture.h b/include/linux/torture.h index 7759fc3..86d6e54 100644 --- a/include/linux/torture.h +++ b/include/linux/torture.h @@ -48,6 +48,8 @@ do { if (verbose) pr_alert("%s" TORTURE_FLAG " %s\n", torture_type, s); } while (0) #define VERBOSE_TOROUT_ERRSTRING(s) \ do { if (verbose) pr_alert("%s" TORTURE_FLAG "!!! %s\n", torture_type, s); } while (0) +#define VERBOSE_STRING(s) \ + do { if (verbose) pr_alert("%s", s); } while (0) /* Definitions for online/offline exerciser. */ int torture_onoff_init(long ooholdoff, long oointerval); diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c index 8ef1919..d9838a3 100644 --- a/kernel/locking/locktorture.c +++ b/kernel/locking/locktorture.c @@ -55,8 +55,11 @@ torture_param(int, shutdown_secs, 0, "Shutdown time (j), <= zero to disable."); torture_param(int, stat_interval, 60, "Number of seconds between stats printk()s"); torture_param(int, stutter, 5, "Number of jiffies to run/halt test, 0=disable"); -torture_param(bool, verbose, true, -"Enable verbose debugging printk()s"); + +static bool verbose = true; +module_param(verbose, bool, 0644); +MODULE_PARM_DESC(verbose, +"Enable verbose debugging printk()s"); static char *torture_type = "spin_lock"; module_param(torture_type, charp, 0444); @@ -693,7 +696,7 @@ static void lock_torture_stats_print(void) } __torture_print_stats(buf, cxt.lwsa, true); - pr_alert("%s", buf); + VERBOSE_STRING(buf); kfree(buf); if (cxt.cur_ops->readlock) { @@ -705,7 +708,7 @@ static void lock_torture_stats_print(void) } __torture_print_stats(buf, cxt.lrsa, false); - pr_alert("%s", buf); + VERBOSE_STRING(buf); kfree(buf); } } -- 2.0.2
Re: [PATCH] locktorture: make verbose writable and control stats print
On 4/15/2016 5:09 PM, Paul E. McKenney wrote: On Fri, Apr 15, 2016 at 04:45:32PM -0700, Shi, Yang wrote: On 4/15/2016 4:26 PM, Paul E. McKenney wrote: On Fri, Apr 15, 2016 at 01:28:11PM -0700, Yang Shi wrote: When building locktorture test into kernel image, it keeps printing out stats information even though there is no lock type specified. There is already verbose parameter to control print, but it is read-only, so it can't be changed at runtime. Make verbose read-write and control stats print. Signed-off-by: Yang Shi Interesting change! But just out of curiosity, when you boot with locktorture built in, do you specify the shutdown_secs boot parameter? If so, another No, just use the default value, which is 0 for shutdown_secs. approach would be to shutdown immediately upon detecting an error during initialization. In my case, it looks there is not error involved. You said that there is no lock type specified, but that should mean that the default ("spin_lock") is chosen. If so, I would expect it to just Yes, spin_lock is chosen by default. do the test, at least if locktorture.torture_runnable has been set. But, the default value of torture_runnable is 0. And, it is readonly parameter too. This prevents torture from running if it is built into kernel instead of module. Actually, I'm confused why there is not LOCK_TORTURE_TEST_RUNNABLE Kconfig like RCU torture? Thanks, Yang Either way, the usual way to make locktorture shut up would be to boot with locktorture.stat_interval=0. If not, I would like to know more about your use case. In my test, I just built locktorture test into kernel instead of a module then check how it behaves, no specific purpose. It sounds like not a normal approach to use it. Agreed, I do believe that this is a case of "working as designed". Thanx, Paul
Re: [BUG linux-next] Kernel panic found with linux-next-20160414
On 4/27/2016 1:14 AM, Hugh Dickins wrote: On Wed, 20 Apr 2016, Shi, Yang wrote: On 4/20/2016 1:01 AM, Hugh Dickins wrote: On Tue, 19 Apr 2016, Shi, Yang wrote: Hi folks, When I ran ltp on linux-next-20160414 on my ARM64 machine, I got the below kernel panic: Unable to handle kernel paging request at virtual address ffc007846000 pgd = ffc01e21d000 [ffc007846000] *pgd=, *pud= Internal error: Oops: 9647 [#11] PREEMPT SMP Modules linked in: loop CPU: 7 PID: 274 Comm: systemd-journal Tainted: G D 4.6.0-rc3-next-20160414-WR8.0.0.0_standard+ #9 Hardware name: Freescale Layerscape 2085a RDB Board (DT) task: ffc01e3fcf80 ti: ffc01ea8c000 task.ti: ffc01ea8c000 PC is at copy_page+0x38/0x120 LR is at migrate_page_copy+0x604/0x1660 pc : [] lr : [] pstate: 2145 sp : ffc01ea8ecd0 x29: ffc01ea8ecd0 x28: x27: 1ff7b80240f8 x26: ffc018196f20 x25: ffbdc01e1180 x24: ffbdc01e1180 x23: x22: ffc01e3fcf80 x21: ffc00481f000 x20: ff900a31d000 x19: ffbdc01207c0 x18: 0f00 x17: x16: x15: x14: x13: x12: x11: x10: x9 : x8 : x7 : x6 : x5 : x4 : x3 : x2 : x1 : ffc00481f080 x0 : ffc007846000 Call trace: Exception stack(0xffc021fc2ed0 to 0xffc021fc2ff0) 2ec0: ffbdc00887c0 ff900a31d000 2ee0: ffc021fc30f0 ff9008ff2318 2145 0025 2f00: ffbdc025a280 ffc020adc4c0 41b58ab3 ff900a085fd0 2f20: ff9008200658 ffbdc00887c0 2f40: ff900b0f1320 ffc021fc3078 41b58ab3 ff900a0864f8 2f60: ff9008210010 ffc021fb8960 ff900867bacc 1ff8043f712d 2f80: ffc021fc2fb0 ff9008210564 ffc021fc3070 ffc021fb8940 2fa0: 08221f78 ff900862f9c8 ffc021fc2fe0 ff9008215dc8 2fc0: 1ff8043f8602 ffc021fc ffc00968a000 ffc00221f080 2fe0: f9407e11d1f0 d61f02209103e210 [] copy_page+0x38/0x120 [] migrate_page+0x74/0x98 [] nfs_migrate_page+0x58/0x80 [] move_to_new_page+0x15c/0x4d8 [] migrate_pages+0x7c8/0x11f0 [] compact_zone+0xdfc/0x2570 [] compact_zone_order+0xe0/0x170 [] try_to_compact_pages+0x2e8/0x8f8 [] __alloc_pages_direct_compact+0x100/0x540 [] __alloc_pages_nodemask+0xc40/0x1c58 [] khugepaged+0x468/0x19c8 [] kthread+0x248/0x2c0 [] ret_from_fork+0x10/0x40 Code: d281f012 91020021 f1020252 d503201f (a8000c02) I did some initial investigation and found it is caused by DEBUG_PAGEALLOC and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT. And, mainline 4.6-rc3 works well. It should be not arch specific although I got it caught on ARM64. I suspect this might be caused by Hugh's huge tmpfs patches. Thanks for testing. It might be caused by my patches, but I don't think that's very likely. This is page migraton for compaction, in the service of anon THP's khugepaged; and I wonder if you were even exercising huge tmpfs when running LTP here (it certainly can be done: I like to mount a huge tmpfs on /opt/ltp and install there, with shmem_huge 2 so any other tmpfs mounts are also huge). Some further investigation shows I got the panic even though I don't have tmpfs mounted with huge=1 or set shmem_huge to 2. There are compaction changes in linux-next too, but I don't see any reason why they'd cause this. I don't know arm64 traces enough to know whether it's the source page or the destination page for the copy, but it looks as if it has been freed (and DEBUG_PAGEALLOC unmapped) before reaching migration's copy. The fault address is passed by x0, which is dest in the implementation of copy_page, so it is the destination page. Needs more debugging, I'm afraid: is it reproducible? Yes, as long as I enable those two PAGEALLOC debug options, I can get the panic once I run ltp. But, it is not caused any specific ltp test case directly, the panic happens randomly during ltp is running. Your ping on the crash in release_freepages() reminded me to take another look at this one. And found that I only needed to enable DEBUG_PAGEALLOC and run LTP to get it on x86_64 too, as you suspected. It's another of those compaction errors, in mmotm and linux-next of a week or two ago, whose patch has since been withdrawn (but huge tmpfs has also been withdrawn for now, so you're right to stick with the older linux-next for testing it). Yes, I saw the discussion on LSFMM 2016 and the patches have gone in my latest update from linux-next. I will stick to 20160420 for the huge tmpfs testing. I believe the patch below fixes it; but I've not done full diligence on
Re: [PATCH] mm: move huge_pmd_set_accessed out of huge_memory.c
On 4/22/2016 2:48 AM, Kirill A. Shutemov wrote: On Thu, Apr 21, 2016 at 03:56:07PM -0700, Shi, Yang wrote: On 4/21/2016 12:30 AM, Kirill A. Shutemov wrote: On Wed, Apr 20, 2016 at 02:00:11PM -0700, Shi, Yang wrote: Hi folks, I didn't realize pmd_* functions are protected by CONFIG_TRANSPARENT_HUGEPAGE on the most architectures before I made this change. Before I fix all the affected architectures code, I want to check if you guys think this change is worth or not? Thanks, Yang On 4/20/2016 11:24 AM, Yang Shi wrote: huge_pmd_set_accessed is only called by __handle_mm_fault from memory.c, move the definition to memory.c and make it static like create_huge_pmd and wp_huge_pmd. Signed-off-by: Yang Shi On pte side we have the same functionality open-coded. Should we do the same for pmd? Or change pte side the same way? Sorry, I don't quite understand you. Do you mean pte_* functions? See handle_pte_fault(), we do the same for pte there what huge_pmd_set_accessed() does for pmd. Thanks for directing to this code. I think we should be consistent here: either both are abstructed into functions or both open-coded. I'm supposed functions sound better. However, do_wp_page has to be called with pte lock acquired. So, the abstracted function has to call it. Thanks, Yang
[BUG] Null pointer dereference when freeing pages on 4.6-rc6
Hi folks, When I enable the below kernel configs on 4.6-rc6, I came across null pointer deference issue in boot stage. CONFIG_SPARSEMEM CONFIG_DEFERRED_STRUCT_PAGE_INIT CONFIG_DEBUG_PAGEALLOC CONFIG_PAGE_EXTENSION CONFIG_DEBUG_VM The splat is: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] page_is_buddy+0x7b/0xe0 PGD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 3 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc6 #8 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: 88017c1d0040 ti: 88017c1d4000 task.ti: 88017c1d4000 RIP: 0010:[] [] page_is_buddy+0x7b/0xe0 RSP: :88017c1d7bf0 EFLAGS: 00010046 RAX: RBX: ea0019810040 RCX: RDX: RSI: ea0019810040 RDI: 00660401 RBP: 88017c1d7c08 R08: 0001 R09: R10: 01af R11: 0001 R12: ea001981 R13: R14: 0009 R15: ea001981 FS: () GS:88066cc4() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 02406000 CR4: 06e0 Stack: 1981 88066cfe6080 88017c1d7c70 8118bfea 000a 1600 0001 0401 ea0019810040 0400 88066cc5aca8 Call Trace: [] __free_one_page+0x23a/0x450 [] free_pcppages_bulk+0x136/0x360 [] free_hot_cold_page+0x168/0x1b0 [] __free_pages+0x5c/0x90 [] __free_pages_boot_core.isra.70+0x11a/0x14d [] deferred_free_range+0x50/0x62 [] deferred_init_memmap+0x220/0x3c3 [] ? deferred_free_range+0x62/0x62 [] kthread+0xf8/0x110 [] ret_from_fork+0x22/0x40 [] ? kthread_create_on_node+0x200/0x200 Code: 75 7b 48 89 d8 8b 40 1c 85 c0 74 50 48 c7 c6 38 bd 0d 82 48 89 df e8 25 e2 02 00 0f 0b 48 89 f7 89 55 ec e8 18 cb 07 00 8b 55 ec <48> 8b 00 a8 02 74 9d 3b 53 30 75 98 49 8b 14 24 48 8b 03 48 c1 RIP [] page_is_buddy+0x7b/0xe0 RSP CR2: ---[ end trace e0c05a86b43d97f9 ]--- note: pgdatinit1[106] exited with preempt_count 1 I changed page_is_buddy and __free_one_page to non-inline to get more accurate stack trace. Then I did some investigation on it with printing the address of page and buddy, please see the below log: @@__free_one_page:715: page is at ea0005f05c00 buddy is at ea0005f05c80, order is 1 @@__free_one_page:715: page is at ea0005f05c00 buddy is at ea0005f05d00, order is 2 @@__free_one_page:715: page is at ea0005f05c00 buddy is at ea0005f05e00, order is 3 @@__free_one_page:715: page is at ea001981 buddy is at ea0019810040, order is 0 call trace splat @@__free_one_page:715: page is at ea0005f05bc0 buddy is at ea0005f05b80, order is 0 @@__free_one_page:715: page is at ea0005f05b80 buddy is at ea0005f05bc0, order is 0 @@__free_one_page:715: page is at ea0005f05b80 buddy is at ea0005f05b00, order is 1 @@__free_one_page:715: page is at ea0005f05b40 buddy is at ea0005f05b00, order is 0 @@__free_one_page:715: page is at ea0005f05b00 buddy is at ea0005f05b40, order is 0 It shows just before the call trace splat, the page address jumped to ea001981 from ea0005f05xxx, not sure why this is happening. Any hint is appreciated. And, reading the code leads me to the below call path: page_is_buddy() --> page_is_guard() --> lookup_page_ext() Then lookup_page_ext() just returns null due to the below code: #if defined(CONFIG_DEBUG_VM) || defined(CONFIG_PAGE_POISONING) /* * The sanity checks the page allocator does upon freeing a * page can reach here before the page_ext arrays are * allocated when feeding a range of pages to the allocator * for the first time during bootup or memory hotplug. * * This check is also necessary for ensuring page poisoning * works as expected when enabled */ if (!section->page_ext) return NULL; #endif So, according to the comment, it looks there should be a WARN or BUG if it returns NULL? And, almost no codes check if the return pointer is null or not after lookup_page_ext() is called. Thanks, Yang
Re: [PATCH] mm: slab: remove ZONE_DMA_FLAG
On 5/5/2016 4:49 AM, Michal Hocko wrote: On Wed 04-05-16 10:01:37, Yang Shi wrote: Now we have IS_ENABLED helper to check if a Kconfig option is enabled or not, so ZONE_DMA_FLAG sounds no longer useful. And, the use of ZONE_DMA_FLAG in slab looks pointless according to the comment [1] from Johannes Weiner, so remove them and ORing passed in flags with the cache gfp flags has been done in kmem_getpages(). [1] https://lkml.org/lkml/2014/9/25/553 I haven't checked the patch but I have a formal suggestion. lkml.org tends to break and forget, please use http://lkml.kernel.org/r/$msg-id instead. In this case http://lkml.kernel.org/r/20140925185047.ga21...@cmpxchg.org Thanks for the suggestion. Will use msg-id in later post. Regards, Yang Thanks!
Re: [v2 PATCH] mm: move page_ext_init after all struct pages are initialized
On 5/20/2016 6:16 AM, Michal Hocko wrote: On Thu 19-05-16 15:13:26, Yang Shi wrote: [...] diff --git a/init/main.c b/init/main.c index b3c6e36..2075faf 100644 --- a/init/main.c +++ b/init/main.c @@ -606,7 +606,6 @@ asmlinkage __visible void __init start_kernel(void) initrd_start = 0; } #endif - page_ext_init(); debug_objects_mem_init(); kmemleak_init(); setup_per_cpu_pageset(); @@ -1004,6 +1003,8 @@ static noinline void __init kernel_init_freeable(void) sched_init_smp(); page_alloc_init_late(); + /* Initialize page ext after all struct pages are initializaed */ + page_ext_init(); do_basic_setup(); I might be missing something but don't we have the same problem with CONFIG_FLATMEM? page_ext_init_flatmem is called way earlier. Or CONFIG_DEFERRED_STRUCT_PAGE_INIT is never enabled for CONFIG_FLATMEM? Yes, CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on MEMORY_HOTPLUG which depends on SPARSEMEM. So, this config is not valid for FLATMEM at all. Thanks, Yang
Re: [PATCH] mm: page_is_guard return false when page_ext arrays are not allocated yet
On 5/19/2016 7:40 PM, Joonsoo Kim wrote: 2016-05-20 2:18 GMT+09:00 Shi, Yang : On 5/18/2016 5:28 PM, Joonsoo Kim wrote: Vlastiml, thanks for ccing me on original bug report. On Wed, May 18, 2016 at 03:23:45PM -0700, Yang Shi wrote: When enabling the below kernel configs: CONFIG_DEFERRED_STRUCT_PAGE_INIT CONFIG_DEBUG_PAGEALLOC CONFIG_PAGE_EXTENSION CONFIG_DEBUG_VM kernel bootup may fail due to the following oops: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] free_pcppages_bulk+0x2d2/0x8d0 PGD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: 88017c080040 ti: 88017c084000 task.ti: 88017c084000 RIP: 0010:[] [] free_pcppages_bulk+0x2d2/0x8d0 RSP: :88017c087c48 EFLAGS: 00010046 RAX: RBX: RCX: 0001 RDX: 0980 RSI: 0080 RDI: 00660401 RBP: 88017c087cd0 R08: 0401 R09: 0009 R10: 88017c080040 R11: 000a R12: 0400 R13: ea001981 R14: ea0019810040 R15: 88066cfe6080 FS: () GS:88066cd4() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 02406000 CR4: 06e0 Stack: 88066cd5bbd8 88066cfe6640 001f001f 88066cd5bbe8 ea001981 8118f53e 0009 0401 000a 0001 Call Trace: [] free_hot_cold_page+0x192/0x1d0 [] __free_pages+0x5c/0x90 [] __free_pages_boot_core+0x11a/0x14e [] deferred_free_range+0x50/0x62 [] deferred_init_memmap+0x220/0x3c3 [] ? setup_per_cpu_pageset+0x35/0x35 [] kthread+0xf8/0x110 [] ret_from_fork+0x22/0x40 [] ? kthread_create_on_node+0x200/0x200 Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff RIP [] free_pcppages_bulk+0x2d2/0x8d0 RSP CR2: The problem is lookup_page_ext() returns NULL then page_is_guard() tried to access it in page freeing. page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension flag, but freeing page might reach here before the page_ext arrays are allocated when feeding a range of pages to the allocator for the first time during bootup or memory hotplug. Patch itself looks find to me because I also found that this kind of problem happens during memory hotplug. So, we need to fix more sites, all callers of lookup_page_ext(). Yes, I agree. I will come up with a patch or a couple of patches to check the return value of lookup_page_ext(). But, I'd like to know how your problem occurs during bootup. debug_guardpage_enabled() is turned to 'enable' after page_ext is initialized. Before that, page_is_guard() unconditionally returns false so I think that the problem what you mentioned can't happen. Could you check that when debug_guardpage_enabled() returns 'enable' and init_section_page_ext() is called? I think the problem is I have CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled, which will defer some struct pages initialization to "pgdatinitX" kernel thread in page_alloc_init_late(). But, page_ext_init() is called before it. So, it leads debug_guardpage_enabled() return true, but page extension is not allocated yet for the struct pages initialized by "pgdatinitX". No. After page_ext_init(), it is ensured that all page extension is initialized. It sounds page_ext_init() should be called after page_alloc_init_late(). Or it should be just incompatible with CONFIG_DEFERRED_STRUCT_PAGE_INIT. I will try to move the init call around. We need to investigate more. I guess that problem is introduced by CONFIG_DEFERRED_STRUCT_PAGE_INIT. It makes pfn_to_nid() invalid until page_alloc_init_late() is done. That is a big side-effect. If there is pfn walker and it uses pfn_to_nid() between memmap_init_zone() and page_alloc_init_late(), it also has same problem. So, we need to think how to fix it more carefully. Thanks for the analysis. I think you are correct. Since pfn_to_nid() depends on memmap which has not been fully setup yet until page_alloc_init_late() is done. So, for such usecase early_pfn_to_nid() should be used. Anyway, to make sure that my assumption is true, could you confirm that below change fix your problem? Yes, it does. Thanks. ->8-- diff --git a/mm/page_ext.c b/mm/page_ext.c index 2d864e6..cac5dc9 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -391,7 +391,7 @@ void __init page_ext_init(void) * -pfn---
Re: [v2 PATCH] mm: move page_ext_init after all struct pages are initialized
On 5/23/2016 12:31 AM, Michal Hocko wrote: On Fri 20-05-16 08:41:09, Shi, Yang wrote: On 5/20/2016 6:16 AM, Michal Hocko wrote: On Thu 19-05-16 15:13:26, Yang Shi wrote: [...] diff --git a/init/main.c b/init/main.c index b3c6e36..2075faf 100644 --- a/init/main.c +++ b/init/main.c @@ -606,7 +606,6 @@ asmlinkage __visible void __init start_kernel(void) initrd_start = 0; } #endif - page_ext_init(); debug_objects_mem_init(); kmemleak_init(); setup_per_cpu_pageset(); @@ -1004,6 +1003,8 @@ static noinline void __init kernel_init_freeable(void) sched_init_smp(); page_alloc_init_late(); + /* Initialize page ext after all struct pages are initializaed */ + page_ext_init(); do_basic_setup(); I might be missing something but don't we have the same problem with CONFIG_FLATMEM? page_ext_init_flatmem is called way earlier. Or CONFIG_DEFERRED_STRUCT_PAGE_INIT is never enabled for CONFIG_FLATMEM? Yes, CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on MEMORY_HOTPLUG which depends on SPARSEMEM. So, this config is not valid for FLATMEM at all. Well config MEMORY_HOTPLUG bool "Allow for memory hot-add" depends on SPARSEMEM || X86_64_ACPI_NUMA depends on ARCH_ENABLE_MEMORY_HOTPLUG I wasn't really sure about X86_64_ACPI_NUMA dependency branch which depends on X86_64 && NUMA && ACPI && PCI and that didn't sound like SPARSEMEM only. If the FLATMEM shouldn't exist with Actually, FLATMEMT depends on !NUMA. CONFIG_DEFERRED_STRUCT_PAGE_INIT can we make that explicit please? Sure, it makes the condition clearer and more readable. Thanks, Yang
Re: [PATCH] mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly
On 5/23/2016 11:22 AM, Michal Hocko wrote: On Mon 23-05-16 09:54:31, Yang Shi wrote: Per the suggestion from Michal Hocko [1], CONFIG_DEFERRED_STRUCT_PAGE_INIT should be incompatible with FLATMEM, make this explicitly in Kconfig. I guess the changelog could benefit from some clarification. What do you think about the following: " DEFERRED_STRUCT_PAGE_INIT requires some ordering wrt other initialization operations, e.g. page_ext_init has to happen after the whole memmap is initialized properly. For SPARSEMEM this requires to wait for page_alloc_init_late. Other memory models (e.g. flatmem) might have different initialization layouts (page_ext_init_flatmem). Currently DEFERRED_STRUCT_PAGE_INIT depends on MEMORY_HOTPLUG which in turn depends on SPARSEMEM || X86_64_ACPI_NUMA depends on ARCH_ENABLE_MEMORY_HOTPLUG and X86_64_ACPI_NUMA depends on NUMA which in turn disable FLATMEM memory model: config ARCH_FLATMEM_ENABLE def_bool y depends on X86_32 && !NUMA so FLATMEM is ruled out via dependency maze. Be explicit and disable FLATMEM for DEFERRED_STRUCT_PAGE_INIT so that we do not reintroduce subtle initialization bugs " Thanks a lot. It sounds way better. Will address in V2. Yang [1] http://lkml.kernel.org/r/20160523073157.gd2...@dhcp22.suse.cz Signed-off-by: Yang Shi --- mm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/Kconfig b/mm/Kconfig index 2664c11..22fa818 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -649,6 +649,7 @@ config DEFERRED_STRUCT_PAGE_INIT default n depends on ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT depends on MEMORY_HOTPLUG + depends on !FLATMEM help Ordinarily all struct pages are initialised during early boot in a single thread. On very large machines this can take a considerable -- 2.0.2
Re: [PATCH] mm: fix build problems from lookup_page_ext
Hi Arnd, Thanks a lot for the patch. My bad, sorry for the inconvenience. I omitted the specific page_idle change is for 32 bit only. And, my host compiler looks old which is still 4.8 so it might not catch the warning. I will update my compiler. Regards, Yang On 5/24/2016 3:08 AM, Arnd Bergmann wrote: A patch for lookup_page_ext introduced several build errors and warnings, e.g. mm/page_owner.c: In function '__set_page_owner': mm/page_owner.c:71:2: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] include/linux/page_idle.h: In function 'set_page_young': include/linux/page_idle.h:62:3: error: expected ')' before 'return' This fixes all of them. Please fold into the original patch. Signed-off-by: Arnd Bergmann Fixes: 38c4fffbad3c ("mm: check the return value of lookup_page_ext for all call sites") diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index 569c3a180625..fec40271339f 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -48,7 +48,7 @@ static inline bool page_is_young(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext) + if (unlikely(!page_ext)) return false; return test_bit(PAGE_EXT_YOUNG, &page_ext->flags); @@ -58,7 +58,7 @@ static inline void set_page_young(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext) + if (unlikely(!page_ext)) return; set_bit(PAGE_EXT_YOUNG, &page_ext->flags); @@ -68,7 +68,7 @@ static inline bool test_and_clear_page_young(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext) + if (unlikely(!page_ext)) return false; return test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags); @@ -78,7 +78,7 @@ static inline bool page_is_idle(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext) + if (unlikely(!page_ext)) return false; return test_bit(PAGE_EXT_IDLE, &page_ext->flags); @@ -88,7 +88,7 @@ static inline void set_page_idle(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext) + if (unlikely(!page_ext)) return; set_bit(PAGE_EXT_IDLE, &page_ext->flags); @@ -98,7 +98,7 @@ static inline void clear_page_idle(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext) + if (unlikely(!page_ext)) return; clear_bit(PAGE_EXT_IDLE, &page_ext->flags); diff --git a/mm/page_owner.c b/mm/page_owner.c index 902e39813295..c6cda3e36212 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -65,9 +65,6 @@ void __set_page_owner(struct page *page, unsigned int order, gfp_t gfp_mask) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext)) - return; - struct stack_trace trace = { .nr_entries = 0, .max_entries = ARRAY_SIZE(page_ext->trace_entries), @@ -75,6 +72,9 @@ void __set_page_owner(struct page *page, unsigned int order, gfp_t gfp_mask) .skip = 3, }; + if (unlikely(!page_ext)) + return; + save_stack_trace(&trace); page_ext->order = order; @@ -111,12 +111,11 @@ void __copy_page_owner(struct page *oldpage, struct page *newpage) { struct page_ext *old_ext = lookup_page_ext(oldpage); struct page_ext *new_ext = lookup_page_ext(newpage); + int i; if (unlikely(!old_ext || !new_ext)) return; - int i; - new_ext->order = old_ext->order; new_ext->gfp_mask = old_ext->gfp_mask; new_ext->nr_entries = old_ext->nr_entries; @@ -204,11 +203,6 @@ err: void __dump_page_owner(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext)) { - pr_alert("There is not page extension available.\n"); - return; - } - struct stack_trace trace = { .nr_entries = page_ext->nr_entries, .entries = &page_ext->trace_entries[0], @@ -216,6 +210,11 @@ void __dump_page_owner(struct page *page) gfp_t gfp_mask = page_ext->gfp_mask; int mt = gfpflags_to_migratetype(gfp_mask); + if (unlikely(!page_ext)) { + pr_alert("There is not page extension available.\n"); + return; + } + if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags)) { pr_alert("page_owner info is not active (free page?)\n"); return;
Re: [PATCH] mm: page_is_guard return false when page_ext arrays are not allocated yet
On 5/23/2016 10:37 PM, Joonsoo Kim wrote: On Fri, May 20, 2016 at 10:00:06AM -0700, Shi, Yang wrote: On 5/19/2016 7:40 PM, Joonsoo Kim wrote: 2016-05-20 2:18 GMT+09:00 Shi, Yang : On 5/18/2016 5:28 PM, Joonsoo Kim wrote: Vlastiml, thanks for ccing me on original bug report. On Wed, May 18, 2016 at 03:23:45PM -0700, Yang Shi wrote: When enabling the below kernel configs: CONFIG_DEFERRED_STRUCT_PAGE_INIT CONFIG_DEBUG_PAGEALLOC CONFIG_PAGE_EXTENSION CONFIG_DEBUG_VM kernel bootup may fail due to the following oops: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] free_pcppages_bulk+0x2d2/0x8d0 PGD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: 88017c080040 ti: 88017c084000 task.ti: 88017c084000 RIP: 0010:[] [] free_pcppages_bulk+0x2d2/0x8d0 RSP: :88017c087c48 EFLAGS: 00010046 RAX: RBX: RCX: 0001 RDX: 0980 RSI: 0080 RDI: 00660401 RBP: 88017c087cd0 R08: 0401 R09: 0009 R10: 88017c080040 R11: 000a R12: 0400 R13: ea001981 R14: ea0019810040 R15: 88066cfe6080 FS: () GS:88066cd4() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 02406000 CR4: 06e0 Stack: 88066cd5bbd8 88066cfe6640 001f001f 88066cd5bbe8 ea001981 8118f53e 0009 0401 000a 0001 Call Trace: [] free_hot_cold_page+0x192/0x1d0 [] __free_pages+0x5c/0x90 [] __free_pages_boot_core+0x11a/0x14e [] deferred_free_range+0x50/0x62 [] deferred_init_memmap+0x220/0x3c3 [] ? setup_per_cpu_pageset+0x35/0x35 [] kthread+0xf8/0x110 [] ret_from_fork+0x22/0x40 [] ? kthread_create_on_node+0x200/0x200 Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff RIP [] free_pcppages_bulk+0x2d2/0x8d0 RSP CR2: The problem is lookup_page_ext() returns NULL then page_is_guard() tried to access it in page freeing. page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension flag, but freeing page might reach here before the page_ext arrays are allocated when feeding a range of pages to the allocator for the first time during bootup or memory hotplug. Patch itself looks find to me because I also found that this kind of problem happens during memory hotplug. So, we need to fix more sites, all callers of lookup_page_ext(). Yes, I agree. I will come up with a patch or a couple of patches to check the return value of lookup_page_ext(). But, I'd like to know how your problem occurs during bootup. debug_guardpage_enabled() is turned to 'enable' after page_ext is initialized. Before that, page_is_guard() unconditionally returns false so I think that the problem what you mentioned can't happen. Could you check that when debug_guardpage_enabled() returns 'enable' and init_section_page_ext() is called? I think the problem is I have CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled, which will defer some struct pages initialization to "pgdatinitX" kernel thread in page_alloc_init_late(). But, page_ext_init() is called before it. So, it leads debug_guardpage_enabled() return true, but page extension is not allocated yet for the struct pages initialized by "pgdatinitX". No. After page_ext_init(), it is ensured that all page extension is initialized. It sounds page_ext_init() should be called after page_alloc_init_late(). Or it should be just incompatible with CONFIG_DEFERRED_STRUCT_PAGE_INIT. I will try to move the init call around. We need to investigate more. I guess that problem is introduced by CONFIG_DEFERRED_STRUCT_PAGE_INIT. It makes pfn_to_nid() invalid until page_alloc_init_late() is done. That is a big side-effect. If there is pfn walker and it uses pfn_to_nid() between memmap_init_zone() and page_alloc_init_late(), it also has same problem. So, we need to think how to fix it more carefully. Thanks for the analysis. I think you are correct. Since pfn_to_nid() depends on memmap which has not been fully setup yet until page_alloc_init_late() is done. So, for such usecase early_pfn_to_nid() should be used. Anyway, to make sure that my assumption is true, could you confirm that below change fix your problem? Yes, it does. Thanks. ->8-- diff --git a/mm/page_ext.c b/mm/page_ext.c index 2d864e6..cac5dc9 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -391,7 +391,7 @@ void __init pag
Re: [BUG] Null pointer dereference when freeing pages on 4.6-rc6
On 5/16/2016 5:44 AM, Vlastimil Babka wrote: [+CC Joonsoo based on git blame] On 05/05/2016 11:13 PM, Shi, Yang wrote: Hi folks, When I enable the below kernel configs on 4.6-rc6, I came across null pointer deference issue in boot stage. CONFIG_SPARSEMEM CONFIG_DEFERRED_STRUCT_PAGE_INIT CONFIG_DEBUG_PAGEALLOC CONFIG_PAGE_EXTENSION CONFIG_DEBUG_VM The splat is: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] page_is_buddy+0x7b/0xe0 PGD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 3 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc6 #8 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: 88017c1d0040 ti: 88017c1d4000 task.ti: 88017c1d4000 RIP: 0010:[] [] page_is_buddy+0x7b/0xe0 RSP: :88017c1d7bf0 EFLAGS: 00010046 RAX: RBX: ea0019810040 RCX: RDX: RSI: ea0019810040 RDI: 00660401 RBP: 88017c1d7c08 R08: 0001 R09: R10: 01af R11: 0001 R12: ea001981 R13: R14: 0009 R15: ea001981 FS: () GS:88066cc4() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 02406000 CR4: 06e0 Stack: 1981 88066cfe6080 88017c1d7c70 8118bfea 000a 1600 0001 0401 ea0019810040 0400 88066cc5aca8 Call Trace: [] __free_one_page+0x23a/0x450 [] free_pcppages_bulk+0x136/0x360 [] free_hot_cold_page+0x168/0x1b0 [] __free_pages+0x5c/0x90 [] __free_pages_boot_core.isra.70+0x11a/0x14d [] deferred_free_range+0x50/0x62 [] deferred_init_memmap+0x220/0x3c3 [] ? deferred_free_range+0x62/0x62 [] kthread+0xf8/0x110 [] ret_from_fork+0x22/0x40 [] ? kthread_create_on_node+0x200/0x200 Code: 75 7b 48 89 d8 8b 40 1c 85 c0 74 50 48 c7 c6 38 bd 0d 82 48 89 df e8 25 e2 02 00 0f 0b 48 89 f7 89 55 ec e8 18 cb 07 00 8b 55 ec <48> 8b 00 a8 02 74 9d 3b 53 30 75 98 49 8b 14 24 48 8b 03 48 c1 RIP [] page_is_buddy+0x7b/0xe0 RSP CR2: ---[ end trace e0c05a86b43d97f9 ]--- note: pgdatinit1[106] exited with preempt_count 1 I changed page_is_buddy and __free_one_page to non-inline to get more accurate stack trace. Then I did some investigation on it with printing the address of page and buddy, please see the below log: @@__free_one_page:715: page is at ea0005f05c00 buddy is at ea0005f05c80, order is 1 @@__free_one_page:715: page is at ea0005f05c00 buddy is at ea0005f05d00, order is 2 @@__free_one_page:715: page is at ea0005f05c00 buddy is at ea0005f05e00, order is 3 @@__free_one_page:715: page is at ea001981 buddy is at ea0019810040, order is 0 call trace splat @@__free_one_page:715: page is at ea0005f05bc0 buddy is at ea0005f05b80, order is 0 @@__free_one_page:715: page is at ea0005f05b80 buddy is at ea0005f05bc0, order is 0 @@__free_one_page:715: page is at ea0005f05b80 buddy is at ea0005f05b00, order is 1 @@__free_one_page:715: page is at ea0005f05b40 buddy is at ea0005f05b00, order is 0 @@__free_one_page:715: page is at ea0005f05b00 buddy is at ea0005f05b40, order is 0 It shows just before the call trace splat, the page address jumped to ea001981 from ea0005f05xxx, not sure why this is happening. Any hint is appreciated. I think the page address didn't jump, it was that the previous call of __free_one_page() finished at order=3 and now another one was invoked that started at order=0 page ea001981 and immediately hit the splat. And, reading the code leads me to the below call path: page_is_buddy() --> page_is_guard() --> lookup_page_ext() From a quick decodecode I guess that's indeed the path that's crashed. You could try verifying with e.g. addr2line on your vmlinux? I tracked down there via gdb. Then lookup_page_ext() just returns null due to the below code: #if defined(CONFIG_DEBUG_VM) || defined(CONFIG_PAGE_POISONING) /* * The sanity checks the page allocator does upon freeing a * page can reach here before the page_ext arrays are * allocated when feeding a range of pages to the allocator * for the first time during bootup or memory hotplug. * * This check is also necessary for ensuring page poisoning * works as expected when enabled */ if (!section->page_ext) return NULL; #endif So, according to the comment, it looks there should be a WARN or BUG if it returns NULL? The comment reads like it's unlikely, but possible, so WARN/BUG doesn't sounds right. And, almost no code
Re: [PATCH] mm: page_is_guard return false when page_ext arrays are not allocated yet
On 5/18/2016 5:28 PM, Joonsoo Kim wrote: Vlastiml, thanks for ccing me on original bug report. On Wed, May 18, 2016 at 03:23:45PM -0700, Yang Shi wrote: When enabling the below kernel configs: CONFIG_DEFERRED_STRUCT_PAGE_INIT CONFIG_DEBUG_PAGEALLOC CONFIG_PAGE_EXTENSION CONFIG_DEBUG_VM kernel bootup may fail due to the following oops: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] free_pcppages_bulk+0x2d2/0x8d0 PGD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: 88017c080040 ti: 88017c084000 task.ti: 88017c084000 RIP: 0010:[] [] free_pcppages_bulk+0x2d2/0x8d0 RSP: :88017c087c48 EFLAGS: 00010046 RAX: RBX: RCX: 0001 RDX: 0980 RSI: 0080 RDI: 00660401 RBP: 88017c087cd0 R08: 0401 R09: 0009 R10: 88017c080040 R11: 000a R12: 0400 R13: ea001981 R14: ea0019810040 R15: 88066cfe6080 FS: () GS:88066cd4() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 02406000 CR4: 06e0 Stack: 88066cd5bbd8 88066cfe6640 001f001f 88066cd5bbe8 ea001981 8118f53e 0009 0401 000a 0001 Call Trace: [] free_hot_cold_page+0x192/0x1d0 [] __free_pages+0x5c/0x90 [] __free_pages_boot_core+0x11a/0x14e [] deferred_free_range+0x50/0x62 [] deferred_init_memmap+0x220/0x3c3 [] ? setup_per_cpu_pageset+0x35/0x35 [] kthread+0xf8/0x110 [] ret_from_fork+0x22/0x40 [] ? kthread_create_on_node+0x200/0x200 Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff RIP [] free_pcppages_bulk+0x2d2/0x8d0 RSP CR2: The problem is lookup_page_ext() returns NULL then page_is_guard() tried to access it in page freeing. page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension flag, but freeing page might reach here before the page_ext arrays are allocated when feeding a range of pages to the allocator for the first time during bootup or memory hotplug. Patch itself looks find to me because I also found that this kind of problem happens during memory hotplug. So, we need to fix more sites, all callers of lookup_page_ext(). Yes, I agree. I will come up with a patch or a couple of patches to check the return value of lookup_page_ext(). But, I'd like to know how your problem occurs during bootup. debug_guardpage_enabled() is turned to 'enable' after page_ext is initialized. Before that, page_is_guard() unconditionally returns false so I think that the problem what you mentioned can't happen. Could you check that when debug_guardpage_enabled() returns 'enable' and init_section_page_ext() is called? I think the problem is I have CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled, which will defer some struct pages initialization to "pgdatinitX" kernel thread in page_alloc_init_late(). But, page_ext_init() is called before it. So, it leads debug_guardpage_enabled() return true, but page extension is not allocated yet for the struct pages initialized by "pgdatinitX". It sounds page_ext_init() should be called after page_alloc_init_late(). Or it should be just incompatible with CONFIG_DEFERRED_STRUCT_PAGE_INIT. I will try to move the init call around. Thanks, Yang And, above comment would be stale because it comes from when memcg uses this struct page extension funtionality. Now, memcg doesn't use it and there are some changes on this area so I'm not sure that is still true. Thanks.
Re: [PATCH] mm: move page_ext_init after all struct pages are initialized
On 5/19/2016 3:30 PM, Andrew Morton wrote: On Thu, 19 May 2016 14:29:05 -0700 Yang Shi wrote: When DEFERRED_STRUCT_PAGE_INIT is enabled, just a subset of memmap at boot are initialized, then the rest are initialized in parallel by starting one-off "pgdatinitX" kernel thread for each node X. If page_ext_init is called before it, some pages will not have valid extension, so move page_ext_init() after it. When fixing a bug, please fully describe the end-user impact of that bug The kernel ran into the below oops which is same with the oops reported in http://ozlabs.org/~akpm/mmots/broken-out/mm-page_is_guard-return-false-when-page_ext-arrays-are-not-allocated-yet.patch. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] free_pcppages_bulk+0x2d2/0x8d0 PGD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: 88017c080040 ti: 88017c084000 task.ti: 88017c084000 RIP: 0010:[] [] free_pcppages_bulk+0x2d2/0x8d0 RSP: :88017c087c48 EFLAGS: 00010046 RAX: RBX: RCX: 0001 RDX: 0980 RSI: 0080 RDI: 00660401 RBP: 88017c087cd0 R08: 0401 R09: 0009 R10: 88017c080040 R11: 000a R12: 0400 R13: ea001981 R14: ea0019810040 R15: 88066cfe6080 FS: () GS:88066cd4() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 02406000 CR4: 06e0 Stack: 88066cd5bbd8 88066cfe6640 001f001f 88066cd5bbe8 ea001981 8118f53e 0009 0401 000a 0001 Call Trace: [] free_hot_cold_page+0x192/0x1d0 [] __free_pages+0x5c/0x90 [] __free_pages_boot_core+0x11a/0x14e [] deferred_free_range+0x50/0x62 [] deferred_init_memmap+0x220/0x3c3 [] ? setup_per_cpu_pageset+0x35/0x35 [] kthread+0xf8/0x110 [] ret_from_fork+0x22/0x40 [] ? kthread_create_on_node+0x200/0x200 Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff RIP [] free_pcppages_bulk+0x2d2/0x8d0 RSP CR2: I will add the oops info into the commit log in V2. Thanks, Yang
Re: [PATCH] mm: move page_ext_init after all struct pages are initialized
On 5/19/2016 4:21 PM, Andrew Morton wrote: On Thu, 19 May 2016 15:35:15 -0700 "Shi, Yang" wrote: On 5/19/2016 3:30 PM, Andrew Morton wrote: On Thu, 19 May 2016 14:29:05 -0700 Yang Shi wrote: When DEFERRED_STRUCT_PAGE_INIT is enabled, just a subset of memmap at boot are initialized, then the rest are initialized in parallel by starting one-off "pgdatinitX" kernel thread for each node X. If page_ext_init is called before it, some pages will not have valid extension, so move page_ext_init() after it. When fixing a bug, please fully describe the end-user impact of that bug The kernel ran into the below oops which is same with the oops reported in http://ozlabs.org/~akpm/mmots/broken-out/mm-page_is_guard-return-false-when-page_ext-arrays-are-not-allocated-yet.patch. So this patch makes mm-page_is_guard-return-false-when-page_ext-arrays-are-not-allocated-yet.patch obsolete? Actually, no. Checking the return value for lookup_page_ext() is still needed. But, the commit log need to be amended since that bootup oops won't happen anymore with this patch applied. Thanks, Yang
Re: [PATCH] Fixing compilation error from file include/linux/page_idle.h
On 5/25/2016 3:11 AM, shakilk1...@gmail.com wrote: From: shakil khan Thanks for the patch, this issue had been fixed by Arnd. Yang --- include/linux/page_idle.h | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index 569c3a1..fec4027 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -48,7 +48,7 @@ static inline bool page_is_young(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext) + if (unlikely(!page_ext)) return false; return test_bit(PAGE_EXT_YOUNG, &page_ext->flags); @@ -58,7 +58,7 @@ static inline void set_page_young(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext) + if (unlikely(!page_ext)) return; set_bit(PAGE_EXT_YOUNG, &page_ext->flags); @@ -68,7 +68,7 @@ static inline bool test_and_clear_page_young(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext) + if (unlikely(!page_ext)) return false; return test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags); @@ -78,7 +78,7 @@ static inline bool page_is_idle(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext) + if (unlikely(!page_ext)) return false; return test_bit(PAGE_EXT_IDLE, &page_ext->flags); @@ -88,7 +88,7 @@ static inline void set_page_idle(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext) + if (unlikely(!page_ext)) return; set_bit(PAGE_EXT_IDLE, &page_ext->flags); @@ -98,7 +98,7 @@ static inline void clear_page_idle(struct page *page) { struct page_ext *page_ext = lookup_page_ext(page); - if (unlikely(!page_ext) + if (unlikely(!page_ext)) return; clear_bit(PAGE_EXT_IDLE, &page_ext->flags);
Re: [PATCH] mm: use early_pfn_to_nid in register_page_bootmem_info_node
On 5/25/2016 3:23 PM, Andrew Morton wrote: On Wed, 25 May 2016 14:00:07 -0700 Yang Shi wrote: register_page_bootmem_info_node() is invoked in mem_init(), so it will be called before page_alloc_init_late() if CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled. But, pfn_to_nid() depends on memmap which won't be fully setup until page_alloc_init_late() is done, so replace pfn_to_nid() by early_pfn_to_nid(). What are the runtime effects of this fix? I didn't experience any problem without the fix. During working on the page_ext_init() fix (replace to early_pfn_to_nid()), I added printk before each pfn_to_nid() calls to check which one might be called before page_alloc_init_late(), then this one is caught. From the code perspective, it sounds not right since register_page_bootmem_info_section() may miss some pfns when CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, just like the problem happened in page_ext_init(). Thanks, Yang
Re: [PATCH] mm: check the return value of lookup_page_ext for all call sites
On 5/25/2016 5:37 PM, Minchan Kim wrote: On Tue, May 24, 2016 at 11:58:11AM +0900, Minchan Kim wrote: On Mon, May 23, 2016 at 10:16:08AM -0700, Yang Shi wrote: Per the discussion with Joonsoo Kim [1], we need check the return value of lookup_page_ext() for all call sites since it might return NULL in some cases, although it is unlikely, i.e. memory hotplug. Tested with ltp with "page_owner=0". [1] http://lkml.kernel.org/r/20160519002809.GA10245@js1304-P5Q-DELUXE Signed-off-by: Yang Shi I didn't read code code in detail to see how page_ext memory space allocated in boot code and memory hotplug but to me, it's not good to check NULL whenever we calls lookup_page_ext. More dangerous thing is now page_ext is used by optionable feature(ie, not critical for system stability) but if we want to use page_ext as another important tool for the system in future, it could be a serious problem. Can we put some hooks of page_ext into memory-hotplug so guarantee that page_ext memory space is allocated with memmap space at the same time? IOW, once every PFN wakers find a page is valid, page_ext is valid, too so lookup_page_ext never returns NULL on valid page by design. I hope we consider this direction, too. Yang, Could you think about this? Thanks a lot for the suggestion. Sorry for the late reply, I was busy on preparing patches. I do agree this is a direction we should look into, but I haven't got time to think about it deeper. I hope Joonsoo could chime in too since he is the original author for page extension. Even, your patch was broken, I think. It doesn't work with !CONFIG_DEBUG_VM && !CONFIG_PAGE_POISONING because lookup_page_ext doesn't return NULL in that case. Actually, I think the #ifdef should be removed if lookup_page_ext() is possible to return NULL. It sounds not make sense returning NULL only when DEBUG_VM is enabled. It should return NULL no matter what debug config is selected. If Joonsoo agrees with me I'm going to come up with a patch to fix it. Regards, Yang Thanks. --- include/linux/page_idle.h | 43 --- mm/page_alloc.c | 6 ++ mm/page_owner.c | 27 +++ mm/page_poison.c | 8 +++- mm/vmstat.c | 2 ++ 5 files changed, 78 insertions(+), 8 deletions(-) diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index bf268fa..8f5d4ad 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -46,33 +46,62 @@ extern struct page_ext_operations page_idle_ops; static inline bool page_is_young(struct page *page) { - return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags); + struct page_ext *page_ext; + page_ext = lookup_page_ext(page); + if (unlikely(!page_ext) + return false; + + return test_bit(PAGE_EXT_YOUNG, &page_ext->flags); } static inline void set_page_young(struct page *page) { - set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags); + struct page_ext *page_ext; + page_ext = lookup_page_ext(page); + if (unlikely(!page_ext) + return; + + set_bit(PAGE_EXT_YOUNG, &page_ext->flags); } static inline bool test_and_clear_page_young(struct page *page) { - return test_and_clear_bit(PAGE_EXT_YOUNG, - &lookup_page_ext(page)->flags); + struct page_ext *page_ext; + page_ext = lookup_page_ext(page); + if (unlikely(!page_ext) + return false; + + return test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags); } static inline bool page_is_idle(struct page *page) { - return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags); + struct page_ext *page_ext; + page_ext = lookup_page_ext(page); + if (unlikely(!page_ext) + return false; + + return test_bit(PAGE_EXT_IDLE, &page_ext->flags); } static inline void set_page_idle(struct page *page) { - set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags); + struct page_ext *page_ext; + page_ext = lookup_page_ext(page); + if (unlikely(!page_ext) + return; + + set_bit(PAGE_EXT_IDLE, &page_ext->flags); } static inline void clear_page_idle(struct page *page) { - clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags); + struct page_ext *page_ext; + page_ext = lookup_page_ext(page); + if (unlikely(!page_ext) + return; + + clear_bit(PAGE_EXT_IDLE, &page_ext->flags); } #endif /* CONFIG_64BIT */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f8f3bfc..d27e8b9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -656,6 +656,9 @@ static inline void set_page_guard(struct zone *zone, struct page *page, return; page_ext = lookup_page_ext(page); + if (unlikely(!page_ext)) + return; + __set_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flag
Re: [PATCH] arm64: kasan: instrument user memory access API
On 5/27/2016 4:02 AM, Andrey Ryabinin wrote: On 05/26/2016 09:43 PM, Yang Shi wrote: The upstream commit 1771c6e1a567ea0ba20a4ffe68a1419fd8ef ("x86/kasan: instrument user memory access API") added KASAN instrument to x86 user memory access API, so added such instrument to ARM64 too. Tested by test_kasan module. Signed-off-by: Yang Shi --- arch/arm64/include/asm/uaccess.h | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) Please, cover __copy_from_user() and __copy_to_user() too. Unlike x86, your patch doesn't instrument these two. I should elaborated this in my review. Yes, I did think about it, but unlike x86, __copy_to/from_user are implemented by asm code on ARM64. If I add kasan_check_read/write into them, I have to move the registers around to prepare the parameters for kasan calls, then restore them after the call, for example the below code for __copy_to_user: mov x9, x0 mov x10, x1 mov x11, x2 mov x0, x10 mov x1, x11 bl kasan_check_read mov x0, x9 mov x1, x10 So, I'm wondering if it is worth or not since __copy_to/from_user are just called at a couple of places, i.e. sctp, a couple of drivers, etc and not used too much. Actually, I think some of them could be replaced by __copy_to/from_user_inatomic. Any idea is appreciated. Thanks, Yang
Re: [PATCH] arm64: kasan: instrument user memory access API
On 5/27/2016 10:46 AM, Mark Rutland wrote: On Fri, May 27, 2016 at 09:34:03AM -0700, Shi, Yang wrote: On 5/27/2016 4:02 AM, Andrey Ryabinin wrote: On 05/26/2016 09:43 PM, Yang Shi wrote: The upstream commit 1771c6e1a567ea0ba20a4ffe68a1419fd8ef ("x86/kasan: instrument user memory access API") added KASAN instrument to x86 user memory access API, so added such instrument to ARM64 too. Tested by test_kasan module. Signed-off-by: Yang Shi --- arch/arm64/include/asm/uaccess.h | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) Please, cover __copy_from_user() and __copy_to_user() too. Unlike x86, your patch doesn't instrument these two. Argh, I missed those when reviewing. My bad. I should elaborated this in my review. Yes, I did think about it, but unlike x86, __copy_to/from_user are implemented by asm code on ARM64. If I add kasan_check_read/write into them, I have to move the registers around to prepare the parameters for kasan calls, then restore them after the call, for example the below code for __copy_to_user: mov x9, x0 mov x10, x1 mov x11, x2 mov x0, x10 mov x1, x11 bl kasan_check_read mov x0, x9 mov x1, x10 There's no need to alter the assembly. Rename the functions (e.g. have __arch_raw_copy_from_user), and add static inline wrappers in uaccess.h that do the kasan calls before calling the assembly functions. That gives the compiler the freedom to do the right thing, and avoids horrible ifdeffery in the assembly code. Thanks for the suggestion, will address in v2. Yang So, I'm wondering if it is worth or not since __copy_to/from_user are just called at a couple of places, i.e. sctp, a couple of drivers, etc and not used too much. [mark@leverpostej:~/src/linux]% git grep -w __copy_to_user -- ^arch | wc -l 63 [mark@leverpostej:~/src/linux]% git grep -w __copy_from_user -- ^arch | wc -l 47 That's a reasonable number of callsites. If we're going to bother adding this, it should be complete. So please do update __copy_from_user and __copy_to_user. Actually, I think some of them could be replaced by __copy_to/from_user_inatomic. Given the number of existing callers outside of arch code, I think we'll get far more traction reworking the arm64 parts for now. Thanks, Mark.
Re: [PATCH] mm: check the return value of lookup_page_ext for all call sites
On 5/27/2016 1:11 AM, Minchan Kim wrote: On Fri, May 27, 2016 at 03:08:39PM +0900, Joonsoo Kim wrote: On Fri, May 27, 2016 at 02:14:32PM +0900, Minchan Kim wrote: On Thu, May 26, 2016 at 04:15:28PM -0700, Shi, Yang wrote: On 5/25/2016 5:37 PM, Minchan Kim wrote: On Tue, May 24, 2016 at 11:58:11AM +0900, Minchan Kim wrote: On Mon, May 23, 2016 at 10:16:08AM -0700, Yang Shi wrote: Per the discussion with Joonsoo Kim [1], we need check the return value of lookup_page_ext() for all call sites since it might return NULL in some cases, although it is unlikely, i.e. memory hotplug. Tested with ltp with "page_owner=0". [1] http://lkml.kernel.org/r/20160519002809.GA10245@js1304-P5Q-DELUXE Signed-off-by: Yang Shi I didn't read code code in detail to see how page_ext memory space allocated in boot code and memory hotplug but to me, it's not good to check NULL whenever we calls lookup_page_ext. More dangerous thing is now page_ext is used by optionable feature(ie, not critical for system stability) but if we want to use page_ext as another important tool for the system in future, it could be a serious problem. Hello, Minchan. Hi Joonsoo, I wonder how pages that isn't managed by kernel yet will cause serious problem. Until onlining, these pages are out of our scope. Any information about them would be useless until it is actually activated. I guess that returning NULL for those pages will not hurt any functionality. Do you have any possible scenario that this causes the serious problem? I don't have any specific usecase now. That's why I said "in future". And I don't want to argue whether there is possible scenario or not to make the feature useful but if you want, I should write novel. One of example, pop up my mind, xen, hv and even memory_hotplug itself might want to use page_ext for their functionality extension to hook guest pages. My opinion is that page_ext is extension of struct page so it would be better to allow any operation on struct page without any limitation if we can do it. Whether it's useful or useless depend on random usecase and we don't need to limit that way from the beginning. However, current design allows deferred page_ext population so any user of page_ext should keep it in mind and should either make fallback plan or don't use page_ext for those cases. If we decide go this way through discussion, at least, we should make such limitation more clear to somewhere in this chance, maybe around page_ext_operation->need comment. My comment's point is that we should consider that way at least. It's worth to discuss pros and cons, what's the best and what makes that way hesitate if we can't. And, allocation such memory space doesn't come from free. If someone just add the memory device and don't online it, these memory will be Here goes several questions. Cced hotplug guys 1. If someone just add the memory device without onlining, kernel code can return pfn_valid == true on the offlined page? 2. If so, it means memmap on offline memory is already populated somewhere. Where is the memmap allocated? part of offlined memory space or other memory? 3. Could we allocate page_ext in part of offline memory space so that it doesn't consume online memory. wasted. I don't know if there is such a usecase but it's possible scenario. Can we put some hooks of page_ext into memory-hotplug so guarantee that page_ext memory space is allocated with memmap space at the same time? IOW, once every PFN wakers find a page is valid, page_ext is valid, too so lookup_page_ext never returns NULL on valid page by design. I hope we consider this direction, too. Yang, Could you think about this? Thanks a lot for the suggestion. Sorry for the late reply, I was busy on preparing patches. I do agree this is a direction we should look into, but I haven't got time to think about it deeper. I hope Joonsoo could chime in too since he is the original author for page extension. Even, your patch was broken, I think. It doesn't work with !CONFIG_DEBUG_VM && !CONFIG_PAGE_POISONING because lookup_page_ext doesn't return NULL in that case. Actually, I think the #ifdef should be removed if lookup_page_ext() is possible to return NULL. It sounds not make sense returning NULL only when DEBUG_VM is enabled. It should return NULL no matter what debug config is selected. If Joonsoo agrees with me I'm going to come up with a patch to fix it. Agreed but let's wait for Minchan's response. If we goes this way, how to guarantee this race? Thanks for pointing out this. It sounds reasonable. However, this should be only possible to happen on 32 bit since just 32 bit version page_is_idle() calls lookup_page_ext(), it doesn't do it on 64 bit. And, such race condition should exist regardless of whether DEBUG_VM is enabled or not, right? rc
Re: [PATCH] mm: check the return value of lookup_page_ext for all call sites
On 5/27/2016 1:02 PM, Andrew Morton wrote: On Thu, 26 May 2016 16:15:28 -0700 "Shi, Yang" wrote: I hope we consider this direction, too. Yang, Could you think about this? Thanks a lot for the suggestion. Sorry for the late reply, I was busy on preparing patches. I do agree this is a direction we should look into, but I haven't got time to think about it deeper. I hope Joonsoo could chime in too since he is the original author for page extension. Even, your patch was broken, I think. It doesn't work with !CONFIG_DEBUG_VM && !CONFIG_PAGE_POISONING because lookup_page_ext doesn't return NULL in that case. Actually, I think the #ifdef should be removed if lookup_page_ext() is possible to return NULL. It sounds not make sense returning NULL only when DEBUG_VM is enabled. It should return NULL no matter what debug config is selected. If Joonsoo agrees with me I'm going to come up with a patch to fix it. I've lost the plot here. What is the status of this patch? Latest version: Yes, this is the latest version. We are discussing about some future optimization. And, Minchan Kim pointed out a possible race condition which exists even before this patch. I proposed a quick fix, as long as they are happy to the fix, I will post it to the mailing list. Thanks, Yang From: Yang Shi Subject: mm: check the return value of lookup_page_ext for all call sites Per the discussion with Joonsoo Kim [1], we need check the return value of lookup_page_ext() for all call sites since it might return NULL in some cases, although it is unlikely, i.e. memory hotplug. Tested with ltp with "page_owner=0". [1] http://lkml.kernel.org/r/20160519002809.GA10245@js1304-P5Q-DELUXE [a...@linux-foundation.org: fix build-breaking typos] [a...@arndb.de: fix build problems from lookup_page_ext] Link: http://lkml.kernel.org/r/6285269.2CksypHdYp@wuerfel [a...@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1464023768-31025-1-git-send-email-yang@linaro.org Signed-off-by: Yang Shi Signed-off-by: Arnd Bergmann Cc: Joonsoo Kim Signed-off-by: Andrew Morton --- include/linux/page_idle.h | 43 ++-- mm/page_alloc.c |6 + mm/page_owner.c | 26 + mm/page_poison.c |8 +- mm/vmstat.c |2 + 5 files changed, 77 insertions(+), 8 deletions(-) diff -puN include/linux/page_idle.h~mm-check-the-return-value-of-lookup_page_ext-for-all-call-sites include/linux/page_idle.h --- a/include/linux/page_idle.h~mm-check-the-return-value-of-lookup_page_ext-for-all-call-sites +++ a/include/linux/page_idle.h @@ -46,33 +46,62 @@ extern struct page_ext_operations page_i static inline bool page_is_young(struct page *page) { - return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags); + struct page_ext *page_ext = lookup_page_ext(page); + + if (unlikely(!page_ext)) + return false; + + return test_bit(PAGE_EXT_YOUNG, &page_ext->flags); } static inline void set_page_young(struct page *page) { - set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags); + struct page_ext *page_ext = lookup_page_ext(page); + + if (unlikely(!page_ext)) + return; + + set_bit(PAGE_EXT_YOUNG, &page_ext->flags); } static inline bool test_and_clear_page_young(struct page *page) { - return test_and_clear_bit(PAGE_EXT_YOUNG, - &lookup_page_ext(page)->flags); + struct page_ext *page_ext = lookup_page_ext(page); + + if (unlikely(!page_ext)) + return false; + + return test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags); } static inline bool page_is_idle(struct page *page) { - return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags); + struct page_ext *page_ext = lookup_page_ext(page); + + if (unlikely(!page_ext)) + return false; + + return test_bit(PAGE_EXT_IDLE, &page_ext->flags); } static inline void set_page_idle(struct page *page) { - set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags); + struct page_ext *page_ext = lookup_page_ext(page); + + if (unlikely(!page_ext)) + return; + + set_bit(PAGE_EXT_IDLE, &page_ext->flags); } static inline void clear_page_idle(struct page *page) { - clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags); + struct page_ext *page_ext = lookup_page_ext(page); + + if (unlikely(!page_ext)) + return; + + clear_bit(PAGE_EXT_IDLE, &page_ext->flags); } #endif /* CONFIG_64BIT */ diff -puN mm/page_alloc.c~mm-check-the-return-value-of-lookup_page_ext-for-all-call-sites mm/page_alloc.c --- a/mm/page_alloc.c~mm-check-the-return-value-of-lookup_page_ext-for-al
Re: [BUG linux-next] kernel NULL pointer dereference on linux-next-20160420
Did some preliminary investigation on this one. The problem is the cc->freepages list is empty, but cc->nr_freepages > 0, it looks the list and nr_freepages get out-of-sync somewhere. Any hint is appreciated. Thanks, Yang On 4/21/2016 5:38 PM, Shi, Yang wrote: Hi folks, I did the below test with huge tmpfs on linux-next-20160420: # mount -t tmpfs huge=1 tmpfs /mnt # cd /mnt Then clone linux kernel Then I got the below bug, such test works well on non-huge tmpfs. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] release_freepages+0x18/0xa0 PGD 0 Oops: [#1] PREEMPT SMP Modules linked in: CPU: 6 PID: 110 Comm: kcompactd0 Not tainted 4.6.0-rc4-next-20160420-WR7.0.0.0_standard #4 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: 880361708040 ti: 880361704000 task.ti: 880361704000 RIP: 0010:[] [] release_freepages+0x18/0xa0 RSP: 0018:880361707cf8 EFLAGS: 00010282 RAX: RBX: 88036ffde7c0 RCX: 0009 RDX: 1bf1 RSI: 000f RDI: 880361707dd0 RBP: 880361707d20 R08: 0007 R09: 1600 R10: 88036ffde7c0 R11: R12: R13: 880361707dd0 R14: 880361707dc0 R15: 880361707db0 FS: () GS:880363cc() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 02206000 CR4: 06e0 Stack: 88036ffde7c0 1a00 880361707dc0 880361707db0 880361707da0 8119f13d 81196239 880361708040 0001 0010 Call Trace: [] compact_zone+0x55d/0x9f0 [] ? fragmentation_index+0x19/0x70 [] kcompactd_do_work+0x10f/0x230 [] kcompactd+0x90/0x1e0 [] ? wait_woken+0xa0/0xa0 [] ? kcompactd_do_work+0x230/0x230 [] kthread+0xdd/0x100 [] ret_from_fork+0x22/0x40 [] ? kthread_create_on_node+0x180/0x180 Code: c1 fa 06 31 f6 e8 a9 9b fd ff eb 98 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 49 89 fd 41 54 53 48 8b 07 <48> 8b 10 48 8d 78 e0 49 39 c5 4c 8d 62 e0 74 70 49 be 00 00 00 RIP [] release_freepages+0x18/0xa0 RSP CR2: ---[ end trace 855da7e142f7311f ]---
Re: [BUG] set_pte_at: racy dirty state clearing warning
On 4/21/2016 1:49 AM, Catalin Marinas wrote: On Wed, Apr 20, 2016 at 04:01:39PM -0700, Shi, Yang wrote: When I enable memory comact via # echo 1 > /proc/sys/vm/compact_memory I got the below WARNING: set_pte_at: racy dirty state clearing: 0x006899371bd3 -> 0x006899371fd3 [ cut here ] WARNING: CPU: 5 PID: 294 at ./arch/arm64/include/asm/pgtable.h:227 ptep_set_access_flags+0x138/0x1b8 Modules linked in: Do you have this patch applied: http://article.gmane.org/gmane.linux.ports.arm.kernel/492239 It's also queued into -next as commit 66dbd6e61a52. No, but I just applied it, it works. Thanks, Yang My kernel has ARM64_HW_AFDBM enabled, but LS2085 is not ARMv8.1. The code shows it just check if ARM64_HW_AFDBM is enabled or not, but doesn't check if the CPU really has such capability. So, it might be better to have the capability checked runtime? The warnings are there to spot any incorrect uses of the pte accessors even before you run on AF/DBM-capable hardware.
Re: [PATCH net-next] bpf: arm64: remove callee-save registers use for tmp registers
On 5/16/2016 4:45 PM, Z Lim wrote: Hi Yang, On Mon, May 16, 2016 at 4:09 PM, Yang Shi wrote: In the current implementation of ARM64 eBPF JIT, R23 and R24 are used for tmp registers, which are callee-saved registers. This leads to variable size of JIT prologue and epilogue. The latest blinding constant change prefers to constant size of prologue and epilogue. AAPCS reserves R9 ~ R15 for temp registers which not need to be saved/restored during function call. So, replace R23 and R24 to R10 and R11, and remove tmp_used flag. CC: Zi Shen Lim CC: Daniel Borkmann Signed-off-by: Yang Shi --- Couple suggestions, but otherwise: Acked-by: Zi Shen Lim 1. Update the diagram. I think it should now be: -* BPF fp register => -80:+-+ <= (BPF_FP) +* BPF fp register => -64:+-+ <= (BPF_FP) Nice catch. I forgot the stack diagram. 2. Add a comment in commit log along the lines of: this is an optimization saving 2 instructions per jited BPF program. Sure, will address in V2. Thanks, Yang Thanks :) z Apply on top of Daniel's blinding constant patchset. arch/arm64/net/bpf_jit_comp.c | 32 1 file changed, 4 insertions(+), 28 deletions(-)
Re: [PATCH 1/2] arm64: bpf: add 'store immediate' instruction
On 11/11/2015 4:39 AM, Will Deacon wrote: On Wed, Nov 11, 2015 at 12:12:56PM +, Will Deacon wrote: On Tue, Nov 10, 2015 at 06:45:39PM -0800, Z Lim wrote: On Tue, Nov 10, 2015 at 2:41 PM, Yang Shi wrote: aarch64 doesn't have native store immediate instruction, such operation Actually, aarch64 does have "STR (immediate)". For arm64 JIT, we can consider using it as an optimization. Yes, I'd definitely like to see that in preference to moving via a temporary register. Wait a second, we're both talking rubbish here :) The STR (immediate) form is referring to the addressing mode, whereas this patch wants to store an immediate value to memory, which does need moving to a register first. Yes, the immediate means immediate offset for addressing index. Doesn't mean to store immediate to memory. I don't think any load-store architecture has store immediate instruction. Thanks, Yang So the original patch is fine. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] arm64: bpf: make BPF prologue and epilogue align with ARM64 AAPCS
On 11/12/2015 7:28 PM, Z Lim wrote: On Thu, Nov 12, 2015 at 1:57 PM, Yang Shi wrote: Save and restore FP/LR in BPF prog prologue and epilogue, save SP to FP in prologue in order to get the correct stack backtrace. However, ARM64 JIT used FP (x29) as eBPF fp register, FP is subjected to change during function call so it may cause the BPF prog stack base address change too. Use x25 to replace FP as BPF stack base register (fp). Since x25 is callee saved register, so it will keep intact during function call. It is initialized in BPF prog prologue when BPF prog is started to run everytime. When BPF prog exits, it could be just tossed. So, the BPF stack layout looks like: high original A64_SP => 0:+-+ BPF prologue | | FP/LR and callee saved registers BPF fp register => -64:+-+ | | | ... | BPF prog stack | | | | current A64_SP/FP => +-+ | | | ... | Function call stack | | +-+ low Yang, for stack unwinding to work, shouldn't it be something like the following? Yes, thanks for catching this. v3 will be post soon. Yang | LR | A64_FP => | FP | | .. | -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: reenable interrupt when handling ptrace breakpoint
On 12/16/2015 3:13 AM, Will Deacon wrote: On Tue, Dec 15, 2015 at 04:18:08PM -0800, Yang Shi wrote: The kernel just send out a SIGTRAP signal when handling ptrace breakpoint in debug exception, so it sounds safe to have interrupt enabled if it is not disabled by the parent process. Is this actually fixing an issue you're seeing, or did you just spot this? Given that force_sig_info disable interrupts, I don't think this is really worth doing. I should made more comments at the first place, sorry for the inconvenience. I did run into some problems on -rt kernel with CRIU restore, please see the below kernel bug log: BUG: sleeping function called from invalid context at /kernel-source/kernel/locking/rtmutex.c:917 in_atomic(): 0, irqs_disabled(): 128, pid: 551, name: test.sh CPU: 5 PID: 551 Comm: test.sh Not tainted 4.1.13-rt13 #7 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Call trace: [] dump_backtrace+0x0/0x128 [] show_stack+0x24/0x30 [] dump_stack+0x80/0xa0 [] ___might_sleep+0x128/0x1a0 [] rt_spin_lock+0x2c/0x40 [] force_sig_info+0xcc/0x210 [] brk_handler.part.2+0x6c/0x80 [] brk_handler+0xd8/0xe8 [] do_debug_exception+0x58/0xb8 Exception stack(0x80834b6e3e30 to 0x80834b6e3f50) 3e20: 004e6000 3e40: 0044 d280 0007 3e60: 0021 011a 00de 3e80: 4ab9ef50 8083 00086324 8000 e587f780 000839b0 8000 3ea0: 004e6000 0044 3ec0: 6000 0015 aa3e6000 d280 3ee0: 0007 0021 3f00: 00de 0008 3f20: 004eff00 004e4ff0 004f0490 e587f730 3f40: 0046f508 0028 It is because force_sig_info called spin_lock_irqsave which could sleep on -rt kernel with irq disabled. However, it just happens at brk_handler in my test. But, I saw single_step has the same code path, so I expanded it to single step too. Since this is rt related, cc to rt mailing list too. Signed-off-by: Yang Shi --- arch/arm64/kernel/debug-monitors.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/arm64/kernel/debug-monitors.c b/arch/arm64/kernel/debug-monitors.c index 8aee3ae..90d70e4 100644 --- a/arch/arm64/kernel/debug-monitors.c +++ b/arch/arm64/kernel/debug-monitors.c @@ -239,6 +239,9 @@ static int single_step_handler(unsigned long addr, unsigned int esr, return 0; if (user_mode(regs)) { + if (interrupts_enabled(regs)) + local_irq_enable(); + My worry here is that we take an interrupt and, on the return path, decide to reschedule due to CONFIG_PREEMPT. If we somehow end up back in the debugger, I'm concerned that it could remove the breakpoint and then later see an unexpected SIGTRAP from the child. Having said that, I've failed to construct a non-racy scenario in which that can happen, but I'm just really uncomfortable making this change unless there's a real problem being solved. The patch is inspired by the similar code in other architectures, e.g. x86 and powerpc which have hardware debug exception to handle breakpoint and single step like arm64. And, they have interrupt enabled in both breakpoint and single step. So, I'm supposed arm64 could do the same thing. For the preempt case, it might be possible, but it sounds like a exception handling problem to me. The preempt should not be allowed in debug exception (current arm64 kernel does it), and in interrupt return path the code should check if debug is on or not. If debug is on, preempt should be just skipped. Or we could disable preempt in debug exception. I also checked the handling in x86 and powerpc, they go different way. 1. x86 Disable preempt in IST exception since it uses per CPU stack. 2. powerpc Check if debug is on in interrupt return path. Powerpc has DBCR0_IDM indicate if the processor is in debug mode. For ARM64, I don't find such bit. So, I may consider to have preempt disabled. Thanks, Yang Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] writeback: initialize m_dirty to avoid compile warning
On 11/18/2015 1:53 AM, Arnd Bergmann wrote: On Tuesday 17 November 2015 15:38:55 Andrew Morton wrote: On Fri, 13 Nov 2015 10:26:41 -0800 Yang Shi wrote: When building kernel with gcc 5.2, the below warning is raised: mm/page-writeback.c: In function 'balance_dirty_pages.isra.10': mm/page-writeback.c:1545:17: warning: 'm_dirty' may be used uninitialized in this function [-Wmaybe-uninitialized] unsigned long m_dirty, m_thresh, m_bg_thresh; The m_dirty{thresh, bg_thresh} are initialized in the block of "if (mdtc)", so if mdts is null, they won't be initialized before being used. Initialize m_dirty to zero, also initialize m_thresh and m_bg_thresh to keep consistency. They are used later by if condition: !mdtc || m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh) If mdtc is null, dirty_freerun_ceiling will not be called at all, so the initialization will not change any behavior other than just ceasing the compile warning. Geeze I hate that warning. gcc really could be a bit smarter about it and this is such a case. --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1542,7 +1542,7 @@ static void balance_dirty_pages(struct address_space *mapping, for (;;) { unsigned long now = jiffies; unsigned long dirty, thresh, bg_thresh; - unsigned long m_dirty, m_thresh, m_bg_thresh; + unsigned long m_dirty = 0, m_thresh = 0, m_bg_thresh = 0; /* * Unstable writes are a feature of certain networked Adding runtime overhead to suppress a compile-time warning is Just Wrong. With gcc-4.4.4 the above patch actually reduces page-writeback.o's .text by 36 bytes, lol. With gcc-4.8.4 the patch saves 19 bytes. No idea what's going on there... I've done tons of build tests and never got the warning for the variables other than m_dirty, and that one also just with very few configurations (e.g. ARM omap2plus_defconfig). Yes, I just got the warning for m_dirty too. Just initialize m_thresh and m_bg_thresh to keep consistency (not sure if it is necessary). And, I'm a little bit confused why gcc just reports m_dirty but ignore others. How about initializing only m_dirty but not the others? Fine to me. Thanks, Yang Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] writeback: initialize m_dirty to avoid compile warning
On 11/18/2015 10:11 AM, Tejun Heo wrote: Hello, On Tue, Nov 17, 2015 at 03:38:55PM -0800, Andrew Morton wrote: --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1542,7 +1542,7 @@ static void balance_dirty_pages(struct address_space *mapping, for (;;) { unsigned long now = jiffies; unsigned long dirty, thresh, bg_thresh; - unsigned long m_dirty, m_thresh, m_bg_thresh; + unsigned long m_dirty = 0, m_thresh = 0, m_bg_thresh = 0; /* * Unstable writes are a feature of certain networked Adding runtime overhead to suppress a compile-time warning is Just Wrong. With gcc-4.4.4 the above patch actually reduces page-writeback.o's .text by 36 bytes, lol. With gcc-4.8.4 the patch saves 19 bytes. No idea what's going on there... And initializing locals in the above fashion can hide real bugs - looky: This was the main reason the code was structured the way it is. If cgroup writeback is not enabled, any derefs of mdtc variables should trigger warnings. Ugh... I don't know. Compiler really should be able to tell this much. Thanks for the explanation. It sounds like a compiler problem. If you think it is still good to cease the compile warning, maybe we could just assign it to an insane value as what Andrew suggested, maybe 0xdeadbeef. Thanks, Yang Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] writeback: initialize m_dirty to avoid compile warning
On 11/18/2015 10:33 AM, Tejun Heo wrote: Hello, On Wed, Nov 18, 2015 at 10:27:32AM -0800, Shi, Yang wrote: This was the main reason the code was structured the way it is. If cgroup writeback is not enabled, any derefs of mdtc variables should trigger warnings. Ugh... I don't know. Compiler really should be able to tell this much. Thanks for the explanation. It sounds like a compiler problem. If you think it is still good to cease the compile warning, maybe we could If this is gonna be a problem with new gcc versions, I don't think we have any other options. :( just assign it to an insane value as what Andrew suggested, maybe 0xdeadbeef. I'd just keep it at zero. Whatever we do, the effect is gonna be difficult to track down - it's not gonna blow up in an obvious way. Can you please add a comment tho explaining that this is to work around compiler deficiency? Sure. Other than this, in v2, I will just initialize m_dirty since compiler just reports it is uninitialized. Thanks, Yang Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: restore bogomips information in /proc/cpuinfo
On 11/18/2015 10:47 AM, Will Deacon wrote: Hello, On Wed, Nov 18, 2015 at 10:15:05AM -0800, Yang Shi wrote: As what Pavel Machek reported [1], some userspace applications depend on bogomips showed by /proc/cpuinfo. Although there is much less legacy impact on aarch64 than arm, but it does break libvirt. Basically, this patch reverts commit 326b16db9f69fd0d279be873c6c00f88c0a4aad5 ("arm64: delay: don't bother reporting bogomips in /proc/cpuinfo"), but with some tweak due to context change. [1] https://lkml.org/lkml/2015/1/4/132 I lost this argument last time around, so I won't re-tread that path this time around. I do, however, have some comments on the patch. Signed-off-by: Yang Shi --- arch/arm64/kernel/cpuinfo.c | 5 + arch/arm64/kernel/smp.c | 7 ++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c index 706679d..8d4ba77 100644 --- a/arch/arm64/kernel/cpuinfo.c +++ b/arch/arm64/kernel/cpuinfo.c @@ -30,6 +30,7 @@ #include #include #include +#include /* * In case the boot CPU is hotpluggable, we record its initial state and @@ -112,6 +113,10 @@ static int c_show(struct seq_file *m, void *v) */ seq_printf(m, "processor\t: %d\n", i); + seq_printf(m, "BogoMIPS\t: %lu.%02lu\n\n", This double newline makes /proc/cpuinfo looks really odd. Can we just have one, please? + loops_per_jiffy / (50UL/HZ), + loops_per_jiffy / (5000UL/HZ) % 100); + /* * Dump out the common processor features in a single line. * Userspace should read the hwcaps with getauxval(AT_HWCAP) diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index b1adc51..1bed772 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -326,7 +326,12 @@ static void __init hyp_mode_check(void) void __init smp_cpus_done(unsigned int max_cpus) { - pr_info("SMP: Total of %d processors activated.\n", num_online_cpus()); + unsigned long bogosum = loops_per_jiffy * num_online_cpus(); + + pr_info("SMP: Total of %d processors activated (%lu.%02lu BogoMIPS).\n", + num_online_cpus(), bogosum / (50/HZ), + (bogosum / (5000/HZ)) % 100); Can we drop this hunk? I don't see a pressing need to print this in dmesg. Sure, will solve them in v2. Thanks, Yang With those two changes: Acked-by: Will Deacon I guess this needs Cc'ing to stable, too. Thanks, Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] writeback: initialize m_dirty to avoid compile warning
On 11/18/2015 10:55 AM, Andrew Morton wrote: On Wed, 18 Nov 2015 10:39:23 -0800 "Shi, Yang" wrote: On 11/18/2015 10:33 AM, Tejun Heo wrote: Hello, On Wed, Nov 18, 2015 at 10:27:32AM -0800, Shi, Yang wrote: This was the main reason the code was structured the way it is. If cgroup writeback is not enabled, any derefs of mdtc variables should trigger warnings. Ugh... I don't know. Compiler really should be able to tell this much. Thanks for the explanation. It sounds like a compiler problem. If you think it is still good to cease the compile warning, maybe we could If this is gonna be a problem with new gcc versions, I don't think we have any other options. :( just assign it to an insane value as what Andrew suggested, maybe 0xdeadbeef. I'd just keep it at zero. Whatever we do, the effect is gonna be difficult to track down - it's not gonna blow up in an obvious way. Can you please add a comment tho explaining that this is to work around compiler deficiency? Sure. Other than this, in v2, I will just initialize m_dirty since compiler just reports it is uninitialized. gcc-4.4.4 and gcc-4.8.4 warn about all three variables. It sounds 5.x is smarter :-) --- a/mm/page-writeback.c~writeback-initialize-m_dirty-to-avoid-compile-warning-fix +++ a/mm/page-writeback.c @@ -1542,7 +1542,9 @@ static void balance_dirty_pages(struct a for (;;) { unsigned long now = jiffies; unsigned long dirty, thresh, bg_thresh; - unsigned long m_dirty = 0, m_thresh = 0, m_bg_thresh = 0; + unsigned long m_dirty = 0; /* stop bogus uninit warnings */ + unsigned long m_thresh = 0; + unsigned long m_bg_thresh = 0; Still need v2? Thanks, Yang /* * Unstable writes are a feature of certain networked _ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: bpf: fix buffer pointer
On 11/18/2015 12:56 AM, Zi Shen Lim wrote: During code review, I noticed we were passing a bad buffer pointer to bpf_load_pointer helper function called by jitted code. Point to the buffer allocated by JIT, so we don't silently corrupt other parts of the stack. Signed-off-by: Zi Shen Lim --- arch/arm64/net/bpf_jit_comp.c | 27 +-- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index d6a53ef..7cf032b 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -139,6 +139,12 @@ static inline int epilogue_offset(const struct jit_ctx *ctx) /* Stack must be multiples of 16B */ #define STACK_ALIGN(sz) (((sz) + 15) & ~15) +#define _STACK_SIZE \ + (MAX_BPF_STACK \ ++ 4 /* extra for skb_copy_bits buffer */) + +#define STACK_SIZE STACK_ALIGN(_STACK_SIZE) + static void build_prologue(struct jit_ctx *ctx) { const u8 r6 = bpf2a64[BPF_REG_6]; @@ -150,10 +156,6 @@ static void build_prologue(struct jit_ctx *ctx) const u8 rx = bpf2a64[BPF_REG_X]; const u8 tmp1 = bpf2a64[TMP_REG_1]; const u8 tmp2 = bpf2a64[TMP_REG_2]; - int stack_size = MAX_BPF_STACK; - - stack_size += 4; /* extra for skb_copy_bits buffer */ - stack_size = STACK_ALIGN(stack_size); /* * BPF prog stack layout @@ -165,12 +167,13 @@ static void build_prologue(struct jit_ctx *ctx) *| ... | callee saved registers *+-+ *| | x25/x26 -* BPF fp register => -80:+-+ +* BPF fp register => -80:+-+ <= (BPF_FP) *| | *| ... | BPF prog stack *| | -*| | -* current A64_SP => +-+ +*+-+ <= (BPF_FP - MAX_BPF_STACK) +*|RSVD | JIT scratchpad +* current A64_SP => +-+ <= (BPF_FP - STACK_SIZE) *| | *| ... | Function call stack *| | @@ -196,7 +199,7 @@ static void build_prologue(struct jit_ctx *ctx) emit(A64_MOV(1, fp, A64_SP), ctx); /* Set up function call stack */ - emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx); + emit(A64_SUB_I(1, A64_SP, A64_SP, STACK_SIZE), ctx); /* Clear registers A and X */ emit_a64_mov_i64(ra, 0, ctx); @@ -213,13 +216,9 @@ static void build_epilogue(struct jit_ctx *ctx) const u8 fp = bpf2a64[BPF_REG_FP]; const u8 tmp1 = bpf2a64[TMP_REG_1]; const u8 tmp2 = bpf2a64[TMP_REG_2]; - int stack_size = MAX_BPF_STACK; - - stack_size += 4; /* extra for skb_copy_bits buffer */ - stack_size = STACK_ALIGN(stack_size); /* We're done with BPF stack */ - emit(A64_ADD_I(1, A64_SP, A64_SP, stack_size), ctx); + emit(A64_ADD_I(1, A64_SP, A64_SP, STACK_SIZE), ctx); /* Restore fs (x25) and x26 */ emit(A64_POP(fp, A64_R(26), A64_SP), ctx); @@ -658,7 +657,7 @@ emit_cond_jmp: return -EINVAL; } emit_a64_mov_i64(r3, size, ctx); - emit(A64_ADD_I(1, r4, fp, MAX_BPF_STACK), ctx); + emit(A64_SUB_I(1, r4, fp, STACK_SIZE), ctx); Should not it sub MAX_BPF_STACK? If you sub STACK_SIZE here, the buffer pointer will point to bottom of the reserved area. You stack layout change also shows this: +*+-+ <= (BPF_FP - MAX_BPF_STACK) +*|RSVD | JIT scratchpad +* current A64_SP => +-+ <= (BPF_FP - STACK_SIZE) Thanks, Yang emit_a64_mov_i64(r5, (unsigned long)bpf_load_pointer, ctx); emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx); emit(A64_MOV(1, A64_FP, A64_SP), ctx); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: bpf: fix buffer pointer
On 11/18/2015 1:41 PM, Z Lim wrote: On Wed, Nov 18, 2015 at 1:07 PM, Shi, Yang wrote: On 11/18/2015 12:56 AM, Zi Shen Lim wrote: emit_a64_mov_i64(r3, size, ctx); - emit(A64_ADD_I(1, r4, fp, MAX_BPF_STACK), ctx); + emit(A64_SUB_I(1, r4, fp, STACK_SIZE), ctx); Should not it sub MAX_BPF_STACK? No, if it's at (BPF_FP - MAX_BPF_STACK), we'll be writing into the BPF stack area, which should only be used by the BPF program. If you sub STACK_SIZE here, the buffer pointer will point to bottom of the reserved area. Yes, that's the idea. The buffer is allocated in here. Right now we're using this "reserved" space for this buffer only. OK, I see. The buffer grows from low to high. Thanks for the elaboration. Acked-by: Yang Shi Yang You stack layout change also shows this: +*+-+ <= (BPF_FP - MAX_BPF_STACK) +*|RSVD | JIT scratchpad +* current A64_SP => +-+ <= (BPF_FP - STACK_SIZE) Yes, this diagram reflects the code and intention. Thanks for reviewing, we definitely need more of these :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [V2 PATCH] arm64: restore bogomips information in /proc/cpuinfo
On 11/19/2015 9:59 AM, Catalin Marinas wrote: On Wed, Nov 18, 2015 at 10:48:55AM -0800, Yang Shi wrote: As what Pavel Machek reported [1], some userspace applications depend on bogomips showed by /proc/cpuinfo. Although there is much less legacy impact on aarch64 than arm, but it does break libvirt. Basically, this patch reverts commit 326b16db9f69fd0d279be873c6c00f88c0a4aad5 ("arm64: delay: don't bother reporting bogomips in /proc/cpuinfo"), but with some tweak due to context change. [1] https://lkml.org/lkml/2015/1/4/132 Acked-by: Will Deacon Cc: #3.12+ Signed-off-by: Yang Shi Patch applied as a fix for stable, basically returning back to the pre-3.12 behaviour. If there is a need for some more useful information here, it can be done as an additional patch, though without cc: stable. Thanks. This approach sounds safer to usersapce. Yang Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH] arm64: bpf: add 'store immediate' instruction
On 11/30/2015 2:24 PM, Yang Shi wrote: aarch64 doesn't have native store immediate instruction, such operation has to be implemented by the below instruction sequence: Load immediate to register Store register Signed-off-by: Yang Shi CC: Zi Shen Lim Had email exchange offline with Zi Shen Lim since he is traveling and cannot send text-only mail, quoted below for his reply: "I've given reviewed-by in response to original posting. Unless something has changed, feel free to add it." Since there is nothing changed, added his reviewed-by. Reviewed-by: Zi Shen Lim Thanks, Yang CC: Xi Wang --- Thsi patch might be buried by the storm of xadd discussion, however, it is absolutely irrelevent to xadd, so resend the patch itself. arch/arm64/net/bpf_jit_comp.c | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 6809647..49c1f1b 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -563,7 +563,25 @@ emit_cond_jmp: case BPF_ST | BPF_MEM | BPF_H: case BPF_ST | BPF_MEM | BPF_B: case BPF_ST | BPF_MEM | BPF_DW: - goto notyet; + /* Load imm to a register then store it */ + ctx->tmp_used = 1; + emit_a64_mov_i(1, tmp2, off, ctx); + emit_a64_mov_i(1, tmp, imm, ctx); + switch (BPF_SIZE(code)) { + case BPF_W: + emit(A64_STR32(tmp, dst, tmp2), ctx); + break; + case BPF_H: + emit(A64_STRH(tmp, dst, tmp2), ctx); + break; + case BPF_B: + emit(A64_STRB(tmp, dst, tmp2), ctx); + break; + case BPF_DW: + emit(A64_STR64(tmp, dst, tmp2), ctx); + break; + } + break; /* STX: *(size *)(dst + off) = src */ case BPF_STX | BPF_MEM | BPF_W: -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] trace/events: Add gup trace events
On 12/1/2015 3:56 PM, Steven Rostedt wrote: On Tue, 1 Dec 2015 15:06:11 -0800 Yang Shi wrote: page-faults events record the invoke to handle_mm_fault, but the invoke may come from do_page_fault or gup. In some use cases, the finer event count mey be needed, so add trace events support for: __get_user_pages __get_user_pages_fast fixup_user_fault Signed-off-by: Yang Shi --- include/trace/events/gup.h | 77 ++ 1 file changed, 77 insertions(+) create mode 100644 include/trace/events/gup.h diff --git a/include/trace/events/gup.h b/include/trace/events/gup.h new file mode 100644 index 000..37d18f9 --- /dev/null +++ b/include/trace/events/gup.h @@ -0,0 +1,77 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM gup + +#if !defined(_TRACE_GUP_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_GUP_H + +#include +#include + +TRACE_EVENT(gup_fixup_user_fault, + + TP_PROTO(struct task_struct *tsk, struct mm_struct *mm, + unsigned long address, unsigned int fault_flags), + + TP_ARGS(tsk, mm, address, fault_flags), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN ) Why save the comm? The tracing infrastructure should keep track of that. The code is referred to kmem.h which has comm copied. If it is unnecessary, it definitely could be removed. + __field(unsigned long, address ) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->address = address; + ), + + TP_printk("comm=%s address=%lx", __entry->comm, __entry->address) +); + +TRACE_EVENT(gup_get_user_pages, + + TP_PROTO(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *nonblocking), + + TP_ARGS(tsk, mm, start, nr_pages, gup_flags, pages, vmas, nonblocking), Why so many arguments? Most are not used. My understanding to TP_ARGS may be not right. Doesn't it require all the args defined by the function? If not, it could definitely be shrunk. Just need keep the args used by TP_printk? Thanks, Yang -- Steve + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN ) + __field(unsigned long, start ) + __field(unsigned long, nr_pages) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->start = start; + __entry->nr_pages= nr_pages; + ), + + TP_printk("comm=%s start=%lx nr_pages=%lu", __entry->comm, __entry->start, __entry->nr_pages) +); + +TRACE_EVENT(gup_get_user_pages_fast, + + TP_PROTO(unsigned long start, int nr_pages, int write, + struct page **pages), + + TP_ARGS(start, nr_pages, write, pages), + + TP_STRUCT__entry( + __field(unsigned long, start ) + __field(unsigned long, nr_pages) + ), + + TP_fast_assign( + __entry->start = start; + __entry->nr_pages= nr_pages; + ), + + TP_printk("start=%lx nr_pages=%lu", __entry->start, __entry->nr_pages) +); + +#endif /* _TRACE_GUP_H */ + +/* This part must be outside protection */ +#include -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] trace/events: Add gup trace events
On 12/1/2015 4:18 PM, Steven Rostedt wrote: On Tue, 01 Dec 2015 16:07:44 -0800 "Shi, Yang" wrote: On 12/1/2015 3:56 PM, Steven Rostedt wrote: On Tue, 1 Dec 2015 15:06:11 -0800 Yang Shi wrote: page-faults events record the invoke to handle_mm_fault, but the invoke may come from do_page_fault or gup. In some use cases, the finer event count mey be needed, so add trace events support for: __get_user_pages __get_user_pages_fast fixup_user_fault Signed-off-by: Yang Shi --- include/trace/events/gup.h | 77 ++ 1 file changed, 77 insertions(+) create mode 100644 include/trace/events/gup.h diff --git a/include/trace/events/gup.h b/include/trace/events/gup.h new file mode 100644 index 000..37d18f9 --- /dev/null +++ b/include/trace/events/gup.h @@ -0,0 +1,77 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM gup + +#if !defined(_TRACE_GUP_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_GUP_H + +#include +#include + +TRACE_EVENT(gup_fixup_user_fault, + + TP_PROTO(struct task_struct *tsk, struct mm_struct *mm, + unsigned long address, unsigned int fault_flags), + + TP_ARGS(tsk, mm, address, fault_flags), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN ) Why save the comm? The tracing infrastructure should keep track of that. The code is referred to kmem.h which has comm copied. If it is unnecessary, it definitely could be removed. Sometimes comm isn't that reliable. But really, the only tracepoint that should record it is sched_switch, and sched_wakeup. With those two, the rest of the trace points should be fine. + __field(unsigned long, address ) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->address = address; + ), + + TP_printk("comm=%s address=%lx", __entry->comm, __entry->address) +); + +TRACE_EVENT(gup_get_user_pages, + + TP_PROTO(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *nonblocking), + + TP_ARGS(tsk, mm, start, nr_pages, gup_flags, pages, vmas, nonblocking), Why so many arguments? Most are not used. My understanding to TP_ARGS may be not right. Doesn't it require all the args defined by the function? If not, it could definitely be shrunk. Just need keep the args used by TP_printk? It only needs what is used by TP_fast_assign(). Thanks, will fix them in V2. Yang -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/7] mm/gup: add gup trace points
On 12/2/2015 3:36 PM, Dave Hansen wrote: On 12/02/2015 02:53 PM, Yang Shi wrote: diff --git a/mm/gup.c b/mm/gup.c index deafa2c..10245a4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -13,6 +13,9 @@ #include #include +#define CREATE_TRACE_POINTS +#include + #include #include This needs to be _the_ last thing that gets #included. Otherwise, you risk colliding with any other trace header that gets implicitly included below. Thanks for the suggestion, will move it to the last. @@ -1340,6 +1346,8 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, start, len))) return 0; + trace_gup_get_user_pages_fast(start, nr_pages, write, pages); + /* * Disable interrupts. We use the nested form as we can already have * interrupts disabled by get_futex_key. It would be _really_ nice to be able to see return values from the various gup calls as well. Is that feasible? I think it should be feasible. kmem_cache_alloc trace event could show return value. I'm supposed gup trace events should be able to do the same thing. Regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/7] mm/gup: add gup trace points
On 12/2/2015 4:11 PM, Shi, Yang wrote: On 12/2/2015 3:36 PM, Dave Hansen wrote: On 12/02/2015 02:53 PM, Yang Shi wrote: diff --git a/mm/gup.c b/mm/gup.c index deafa2c..10245a4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -13,6 +13,9 @@ #include #include +#define CREATE_TRACE_POINTS +#include + #include #include This needs to be _the_ last thing that gets #included. Otherwise, you risk colliding with any other trace header that gets implicitly included below. Thanks for the suggestion, will move it to the last. @@ -1340,6 +1346,8 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, start, len))) return 0; +trace_gup_get_user_pages_fast(start, nr_pages, write, pages); + /* * Disable interrupts. We use the nested form as we can already have * interrupts disabled by get_futex_key. It would be _really_ nice to be able to see return values from the various gup calls as well. Is that feasible? I think it should be feasible. kmem_cache_alloc trace event could show return value. I'm supposed gup trace events should be able to do the same thing. Just did a quick test, it is definitely feasible. Please check the below test log: trace-cmd-200 [000]99.221486: gup_get_user_pages: start=800ff0 nr_pages=1 ret=1 trace-cmd-200 [000]99.223215: gup_get_user_pages: start=800fdb nr_pages=1 ret=1 trace-cmd-200 [000]99.223298: gup_get_user_pages: start=800ed0 nr_pages=1 ret=1 nr_pages is the number of pages requested by the gup, ret is the return value. If nobody has objection, I will add it into V3. Regards, Yang Regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/7] mm/gup: add gup trace points
On 12/2/2015 8:13 PM, Steven Rostedt wrote: On Wed, 2 Dec 2015 15:36:50 -0800 Dave Hansen wrote: On 12/02/2015 02:53 PM, Yang Shi wrote: diff --git a/mm/gup.c b/mm/gup.c index deafa2c..10245a4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -13,6 +13,9 @@ #include #include +#define CREATE_TRACE_POINTS +#include + #include #include This needs to be _the_ last thing that gets #included. Otherwise, you risk colliding with any other trace header that gets implicitly included below. Agreed. @@ -1340,6 +1346,8 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, start, len))) return 0; + trace_gup_get_user_pages_fast(start, nr_pages, write, pages); + /* * Disable interrupts. We use the nested form as we can already have * interrupts disabled by get_futex_key. It would be _really_ nice to be able to see return values from the various gup calls as well. Is that feasible? Only if you rewrite the functions to have a single return code path that we can add a tracepoint too. Or have a wrapper function that gets Yes. My preliminary test just could cover the success case. gup could return errno from different a few code path. called directly that calls these functions internally and the tracepoint can trap the return value. This will incur more changes in other subsystems (futex, kvm, etc), I'm not sure if it is worth making such changes to get return value. I can probably make function_graph tracer give return values, although it will give a return value for void functions as well. And it may give long long returns for int returns that may have bogus data in the higher bits. If the return value requirement is not limited to gup, the approach sounds more reasonable. Thanks, Yang -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 1/7] trace/events: Add gup trace events
On 12/2/2015 8:07 PM, Steven Rostedt wrote: On Wed, 2 Dec 2015 14:53:27 -0800 Yang Shi wrote: page-faults events record the invoke to handle_mm_fault, but the invoke may come from do_page_fault or gup. In some use cases, the finer event count mey be needed, so add trace events support for: __get_user_pages __get_user_pages_fast fixup_user_fault Signed-off-by: Yang Shi --- include/trace/events/gup.h | 71 ++ 1 file changed, 71 insertions(+) create mode 100644 include/trace/events/gup.h diff --git a/include/trace/events/gup.h b/include/trace/events/gup.h new file mode 100644 index 000..03a4674 --- /dev/null +++ b/include/trace/events/gup.h @@ -0,0 +1,71 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM gup + +#if !defined(_TRACE_GUP_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_GUP_H + +#include +#include + +TRACE_EVENT(gup_fixup_user_fault, + + TP_PROTO(struct task_struct *tsk, struct mm_struct *mm, + unsigned long address, unsigned int fault_flags), + + TP_ARGS(tsk, mm, address, fault_flags), Arges added and not used by TP_fast_assign(), this will slow down the code while tracing is enabled, as they need to be added to the trace function call. + + TP_STRUCT__entry( + __field(unsigned long, address ) + ), + + TP_fast_assign( + __entry->address = address; + ), + + TP_printk("address=%lx", __entry->address) +); + +TRACE_EVENT(gup_get_user_pages, + + TP_PROTO(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages), + + TP_ARGS(tsk, mm, start, nr_pages), Here too but this is worse. See below. + + TP_STRUCT__entry( + __field(unsigned long, start ) + __field(unsigned long, nr_pages) + ), + + TP_fast_assign( + __entry->start = start; + __entry->nr_pages= nr_pages; + ), + + TP_printk("start=%lx nr_pages=%lu", __entry->start, __entry->nr_pages) +); + +TRACE_EVENT(gup_get_user_pages_fast, + + TP_PROTO(unsigned long start, int nr_pages, int write, + struct page **pages), + + TP_ARGS(start, nr_pages, write, pages), This and the above "gup_get_user_pages" have the same entry field, assign and printk. They should be combined into a DECLARE_EVENT_CLASS() and two DEFINE_EVENT()s. That will save on size as the DECLARE_EVENT_CLASS() is the biggest part of each TRACE_EVENT(). Thanks for the suggestion, will fix them in V3. Regards, Yang -- Steve + + TP_STRUCT__entry( + __field(unsigned long, start ) + __field(unsigned long, nr_pages) + ), + + TP_fast_assign( + __entry->start = start; + __entry->nr_pages= nr_pages; + ), + + TP_printk("start=%lx nr_pages=%lu", __entry->start, __entry->nr_pages) +); + +#endif /* _TRACE_GUP_H */ + +/* This part must be outside protection */ +#include -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/7] mm/gup: add gup trace points
On 12/3/2015 11:06 AM, Steven Rostedt wrote: On Thu, 03 Dec 2015 10:36:18 -0800 "Shi, Yang" wrote: called directly that calls these functions internally and the tracepoint can trap the return value. This will incur more changes in other subsystems (futex, kvm, etc), I'm not sure if it is worth making such changes to get return value. No, it wouldn't require any changes outside of this. -long __get_user_pages(..) +static long __get_user_pages_internal(..) { [..] } + +long __get_user_pages(..) +{ + long ret; + ret = __get_user_pages_internal(..); + trace_get_user_pages(.., ret) +} Thanks for this. I just checked the fast version, it looks it just has single return path, so this should be just needed by slow version. I can probably make function_graph tracer give return values, although it will give a return value for void functions as well. And it may give long long returns for int returns that may have bogus data in the higher bits. If the return value requirement is not limited to gup, the approach sounds more reasonable. Others have asked about it. Maybe I should do it. If you are going to add return value in common trace code, I won't do the gup specific one in V3. Thanks, Yang -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [V2 PATCH] sparc64/gup: check address scope legitimacy
On 12/3/2015 12:38 PM, Sam Ravnborg wrote: Hi Yang. On Wed, Nov 25, 2015 at 02:45:43PM -0800, Yang Shi wrote: Check if user address is accessible in atomic version __get_user_pages_fast() before walking the page table. And, check if end > start in get_user_pages_fast(), otherwise fallback to slow path. Two different but related things in one patch is often a bad thing. It would have been better to split it up. Signed-off-by: Yang Shi --- Just found slow_irqon label is not defined, added it to avoid compile error. arch/sparc/mm/gup.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index 2e5c4fc..cf4fb47 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -173,6 +173,9 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, addr = start; len = (unsigned long) nr_pages << PAGE_SHIFT; end = start + len; + if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ, + (void __user *)start, len))) + return 0; This change is not justified. Why would we take the time to first do the access_ok() stuff. If this had been an expensive operation then we had made this function slower in the normal case ( assuming there were no access violations in the normal case). When I look at the implementation of access_ok() I get the impression that this is not really a check we need. access_ok() always returns 1. Thanks for pointing it out. And, I didn't notice that gup is just built for SPARC64. I though it is built by both 64 bit and 32 bit. A follow-up question, is there any reason to just have sparc specific fast gup for 64 bit not for 32 bit? local_irq_save(flags); pgdp = pgd_offset(mm, addr); @@ -203,6 +206,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, addr = start; len = (unsigned long) nr_pages << PAGE_SHIFT; end = start + len; + if (end < start) + goto slow_irqon; end can only be smaller than start if there is some overflow. See how end is calculated just the line above. This looks like a highly suspicious change. I'm supposed this is used to protect the overflow. I copied the code from other arch. Actually, every arch has this except sparc. Thanks, Yang Sam -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: add HAVE_LATENCYTOP_SUPPORT config
Hi folks, I tried to enable latencytop for arm64 and came across this discussion, so any plan about when this will get merged into mainline? 4.5 merge window? Thanks, Yang On 11/10/2015 3:34 AM, Heiko Carstens wrote: From: Will Deacon Date: Tue, 10 Nov 2015 11:10:04 + Subject: [PATCH] Kconfig: remove HAVE_LATENCYTOP_SUPPORT As illustrated by a3afe70b83fd ("[S390] latencytop s390 support."), HAVE_LATENCYTOP_SUPPORT is defined by an architecture to advertise an implementation of save_stack_trace_tsk. However, as of 9212ddb5eada ("stacktrace: provide save_stack_trace_tsk() weak alias") a dummy implementation is provided if STACKTRACE=y. Given that LATENCYTOP already depends on STACKTRACE_SUPPORT and selects STACKTRACE, we can remove HAVE_LATENCYTOP_SUPPORT altogether. Signed-off-by: Will Deacon --- arch/arc/Kconfig| 3 --- arch/arm/Kconfig| 5 - arch/metag/Kconfig | 3 --- arch/microblaze/Kconfig | 3 --- arch/parisc/Kconfig | 3 --- arch/powerpc/Kconfig| 3 --- arch/s390/Kconfig | 3 --- arch/sh/Kconfig | 3 --- arch/sparc/Kconfig | 4 arch/unicore32/Kconfig | 3 --- arch/x86/Kconfig| 3 --- lib/Kconfig.debug | 1 - 12 files changed, 37 deletions(-) Acked-by: Heiko Carstens ___ linux-arm-kernel mailing list linux-arm-ker...@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: restore bogomips information in /proc/cpuinfo
On 11/25/2015 7:16 AM, Nicolas Pitre wrote: On Wed, 25 Nov 2015, Jon Masters wrote: On 11/18/15, 1:15 PM, Yang Shi wrote: As what Pavel Machek reported [1], some userspace applications depend on bogomips showed by /proc/cpuinfo. Although there is much less legacy impact on aarch64 than arm, but it does break libvirt. Basically, this patch reverts commit 326b16db9f69fd0d279be873c6c00f88c0a4aad5 ("arm64: delay: don't bother reporting bogomips in /proc/cpuinfo"), but with some tweak due to context change. On a total tangent, it would be ideal to (eventually) have something reported in /proc/cpuinfo or dmesg during boot that does "accurately" map back to the underlying core frequency (as opposed to the generic timer frequency). I have seen almost countless silly situations in the industry (external to my own organization) in which someone has taken a $VENDOR_X reference system that they're not supposed to run benchmarks on, and they've done it anyway. But usually on some silicon that's clocked multiples under what production would be. Then silly rumors about performance get around because nobody can do simple arithmetic and notice that they ought to have at least divided by some factor. Be my guest my friend. According to the common wisdom, the bogomips reporting is completely senseless at this point and no one should expect anything useful from it. Therefore I attempted to rehabilitate some meaning into it given that we just can't get rid of it either and it continues to cause dammage. You certainly saw where that has led me. Or we may create a new one, i.e. "cpu MHz" like x86? Then we keep both in cpuinfo so that the userspace could adopt it gradually? Thanks, Yang Nicolas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sparc64/gup: check address scope legitimacy
On 11/25/2015 4:26 PM, kbuild test robot wrote: Hi Yang, [auto build test ERROR on v4.4-rc2] [also build test ERROR on next-20151124] url: https://github.com/0day-ci/linux/commits/Yang-Shi/sparc64-gup-check-address-scope-legitimacy/20151126-065342 config: sparc64-allnoconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=sparc64 All errors (new ones prefixed by >>): arch/sparc/mm/gup.c: In function 'get_user_pages_fast': arch/sparc/mm/gup.c:210:3: error: label 'slow_irqon' used but not defined goto slow_irqon; Already fixed in v2. Thanks, Yang ^ vim +/slow_irqon +210 arch/sparc/mm/gup.c 204 205 start &= PAGE_MASK; 206 addr = start; 207 len = (unsigned long) nr_pages << PAGE_SHIFT; 208 end = start + len; 209 if (end < start) > 210 goto slow_irqon; 211 212 /* 213 * XXX: batch / limit 'nr', to avoid large irq off latency --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/7] mm/gup: add gup trace points
On 12/8/2015 12:25 PM, Steven Rostedt wrote: On Tue, 8 Dec 2015 11:39:50 -0800 Yang Shi wrote: For slow version, just add trace point for raw __get_user_pages since all slow variants call it to do the real work finally. Signed-off-by: Yang Shi --- mm/gup.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index deafa2c..44f05c9 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -18,6 +18,9 @@ #include "internal.h" +#define CREATE_TRACE_POINTS +#include + static struct page *no_page_table(struct vm_area_struct *vma, unsigned int flags) { @@ -462,6 +465,8 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, if (!nr_pages) return 0; + trace_gup_get_user_pages(start, nr_pages); + VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET)); /* @@ -599,6 +604,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, if (!(vm_flags & vma->vm_flags)) return -EFAULT; + trace_gup_fixup_user_fault(address); ret = handle_mm_fault(mm, vma, address, fault_flags); if (ret & VM_FAULT_ERROR) { if (ret & VM_FAULT_OOM) @@ -1340,6 +1346,8 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, start, len))) return 0; + trace_gup_get_user_pages_fast(start, (unsigned long) nr_pages); typecast shouldn't be needed. But I'm wondering, it would save space in the ring buffer if we used unsigend int instead of long. Will nr_pages ever be bigger than 4 billion? The "unsigned long" comes from get_user_pages() definition, I'm not quite sure why "unsigned long" is used. The fast version uses int (I guess unsigned int sounds better since it will not go negative). "unsigned int" could cover 0x pages (almost 16TB), it sounds good enough in the most use case to me. In my test, just 1 page is passed to nr_pages in the most cases. Thanks, Yang -- Steve + /* * Disable interrupts. We use the nested form as we can already have * interrupts disabled by get_futex_key. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 3/7] x86: mm/gup: add gup trace points
On 12/9/2015 1:07 PM, Steven Rostedt wrote: On Wed, 9 Dec 2015 09:29:20 -0800 Yang Shi wrote: Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Signed-off-by: Yang Shi --- arch/x86/mm/gup.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index ae9a37b..a96bcb7 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -12,6 +12,9 @@ #include +#define CREATE_TRACE_POINTS +#include > First off, does the above even compile? Second, you already created the tracepoints in mm/gup.c, why are you creating them here again? CREATE_TRACE_POINTS must be defined only once per events/.h file. Sorry for that. The typo was introduced by git amend without running test build. The multiple definition of trace points was not caught by my shaky test script. Will fix them soon. Thanks, Yang -- Steve + static inline pte_t gup_get_pte(pte_t *ptep) { #ifndef CONFIG_X86_PAE @@ -270,6 +273,8 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, (void __user *)start, len))) return 0; + trace_gup_get_user_pages_fast(start, nr_pages); + /* * XXX: batch / limit 'nr', to avoid large irq off latency * needs some instrumenting to determine the common sizes used by @@ -373,6 +378,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, } while (pgdp++, addr = next, addr != end); local_irq_enable(); + trace_gup_get_user_pages_fast(start, nr_pages); + VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT); return nr; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ppc64: select HAVE_CONTEXT_TRACKING by default
Hi folks, Any comment on this is appreciated. Thanks, Yang On 12/9/2015 2:43 PM, Yang Shi wrote: The functionality of context tracking has been implemented by PPC64 and HAVE_CONTEXT_TRACKING was selected by pseries by default. Actually, it is applicale to all PPC64 platforms, so select it in PPC64 generic Kconfig. NO_HZ_FULL depends on it, with this change NO_HZ_FULL could be enabled for all PPC64 machines. Signed-off-by: Yang Shi --- Following the instruction in Documentation/timers/NO_HZ.txt, I tested full nohz on my FSL T2080 target, the below trace log shows it works well. user_loop-574 [001] d..1 137.044892: tick_stop: success=yes msg= user_loop-574 [001] d.h1 138.044880: hrtimer_expire_entry: hrtimer=c0007fd22d10 function=.tick_sched_timer now=137796002092 user_loop-574 [001] d.h1 139.044880: hrtimer_expire_entry: hrtimer=c0007fd22d10 function=.tick_sched_timer now=138796002129 user_loop-574 [001] d.h1 140.044880: hrtimer_expire_entry: hrtimer=c0007fd22d10 function=.tick_sched_timer now=139796002219 user_loop-574 [001] d.h1 141.044880: hrtimer_expire_entry: hrtimer=c0007fd22d10 function=.tick_sched_timer now=140796002229 user_loop-574 [001] d.h1 142.044879: hrtimer_expire_entry: hrtimer=c0007fd22d10 function=.tick_sched_timer now=141796002159 user_loop-574 [001] d.h1 143.044879: hrtimer_expire_entry: hrtimer=c0007fd22d10 function=.tick_sched_timer now=142796002142 user_loop-574 [001] d.h1 144.044878: hrtimer_expire_entry: hrtimer=c0007fd22d10 function=.tick_sched_timer now=143796002046 user_loop-574 [001] d.h1 145.044878: hrtimer_expire_entry: hrtimer=c0007fd22d10 function=.tick_sched_timer now=144796002189 user_loop-574 [001] d.h1 146.044878: hrtimer_expire_entry: hrtimer=c0007fd22d10 function=.tick_sched_timer now=145796002199 user_loop-574 [001] d.h1 147.044880: hrtimer_expire_entry: hrtimer=c0007fd22d10 function=.tick_sched_timer now=146796003836 It shows the timer tick just comes in every second. It should be low risk to have problem on other PPC64 targets, although they are not tested. arch/powerpc/platforms/Kconfig.cputype | 1 + arch/powerpc/platforms/pseries/Kconfig | 1 - 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype index 142dff5..7b25dd1 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -3,6 +3,7 @@ config PPC64 default n select HAVE_VIRT_CPU_ACCOUNTING select ZLIB_DEFLATE + select HAVE_CONTEXT_TRACKING help This option selects whether a 32-bit or a 64-bit kernel will be built. diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index bec90fb..6853148 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -17,7 +17,6 @@ config PPC_PSERIES select PPC_UDBG_16550 select PPC_NATIVE select PPC_DOORBELL - select HAVE_CONTEXT_TRACKING select HOTPLUG_CPU if SMP select ARCH_RANDOM select PPC_DOORBELL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: ubsan: select ARCH_HAS_UBSAN_SANITIZE_ALL
On 2/8/2016 3:46 AM, Mark Rutland wrote: Hi, On Fri, Feb 05, 2016 at 03:50:18PM -0800, Yang Shi wrote: To enable UBSAN on arm64, ARCH_HAS_UBSAN_SANITIZE_ALL need to be selected. Basic kernel bootup test is passed on arm64 with CONFIG_UBSAN_SANITIZE_ALL enabled. Signed-off-by: Yang Shi --- arch/arm64/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 8cc6228..1c29e20 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -14,6 +14,7 @@ config ARM64 select ARCH_WANT_OPTIONAL_GPIOLIB select ARCH_WANT_COMPAT_IPC_PARSE_VERSION select ARCH_WANT_FRAME_POINTERS + select ARCH_HAS_UBSAN_SANITIZE_ALL select ARM_AMBA select ARM_ARCH_TIMER select ARM_GIC I gave this a go, and I got a couple of splats (included below) when booting an Ubuntu 14.04 arm64 rootfs. I'm using Linato 15.08 GCC 5.1 to compile an arm64 defconfig, and I see the issue with v4.5-rc2 and v4.5-rc3. I will dig into that and file a report shortl, unless someone has already reported the same issue. Did you see any failures in your testing? For reference, which kernel version, compiler, and config were you using? Yes, I did. I'm using 4.5-rc1 with gcc 5.2. And, I got one more splat and was digging into it. I saw your report to ext4 maintainers. I tried to have a quick fix, but it sounds not work well. And, that code does look suspicious. Let's see what the ext4 maintainers say. Thanks, Yang This patch itself looks good, so FWIW: Tested-by: Mark Rutland Thanks, Mark. [3.804750] [3.813176] UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2612:15 [3.819431] shift exponent 4294967295 is too large for 32-bit type 'int' [3.826121] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.5.0-rc2+ #48 [3.832463] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD0085E 11/23/2015 [3.841060] Call trace: [3.843499] [] dump_backtrace+0x0/0x298 [3.848887] [] show_stack+0x14/0x20 [3.853929] [] dump_stack+0xe0/0x178 [3.859056] [] ubsan_epilogue+0x14/0x50 [3.86] [] __ubsan_handle_shift_out_of_bounds+0xe0/0x138 [3.871655] [] ext4_mb_init+0x84c/0x920 [3.877043] [] ext4_fill_super+0x2eac/0x4958 [3.882866] [] mount_bdev+0x180/0x1e8 [3.888079] [] ext4_mount+0x14/0x20 [3.893118] [] mount_fs+0x44/0x1c8 [3.898073] [] vfs_kern_mount+0x50/0x1a8 [3.903547] [] do_mount+0x240/0x1478 [3.908673] [] SyS_mount+0x90/0xf8 [3.913627] [] mount_block_root+0x22c/0x3c4 [3.919361] [] mount_root+0x120/0x138 [3.924574] [] prepare_namespace+0x13c/0x184 [3.930396] [] kernel_init_freeable+0x390/0x3b4 [3.936479] [] kernel_init+0x10/0xe0 [3.941606] [] ret_from_fork+0x10/0x40 [3.946905] [5.566166] [5.574596] UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1274:11 [5.580851] shift exponent -1 is negative [5.584851] CPU: 4 PID: 1028 Comm: mount Not tainted 4.5.0-rc2+ #48 [5.591105] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD0085E 11/23/2015 [5.599702] Call trace: [5.602142] [] dump_backtrace+0x0/0x298 [5.607530] [] show_stack+0x14/0x20 [5.612572] [] dump_stack+0xe0/0x178 [5.617700] [] ubsan_epilogue+0x14/0x50 [5.623088] [] __ubsan_handle_shift_out_of_bounds+0xe0/0x138 [5.630300] [] mb_find_order_for_block+0x154/0x1b0 [5.636641] [] mb_find_extent+0xcc/0x548 [5.642116] [] ext4_mb_complex_scan_group+0xe8/0x4e8 [5.648632] [] ext4_mb_regular_allocator+0x2d4/0x648 [5.655148] [] ext4_mb_new_blocks+0x344/0x7e0 [5.661056] [] ext4_ext_map_blocks+0x684/0xf68 [5.667052] [] ext4_map_blocks+0x12c/0x500 [5.672699] [] ext4_writepages+0x47c/0xe38 [5.678348] [] do_writepages+0x48/0xc8 [5.683649] [] __filemap_fdatawrite_range+0x70/0xe8 [5.690078] [] filemap_flush+0x18/0x20 [5.695378] [] ext4_alloc_da_blocks+0x3c/0x78 [5.701285] [] ext4_rename+0x690/0xe38 [5.706585] [] ext4_rename2+0x1c/0x40 [5.711800] [] vfs_rename+0x2c0/0xa90 [5.717013] [] SyS_renameat2+0x464/0x5c0 [5.722486] [] SyS_renameat+0x10/0x18 [5.727700] [] el0_svc_naked+0x24/0x28 [5.732998]
Re: [PATCH] arm64: disable kasan when accessing frame->fp in unwind_frame
On 2/8/2016 12:51 AM, Andrey Ryabinin wrote: On 02/06/2016 02:04 AM, Yang Shi wrote: #include #include @@ -64,7 +65,9 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame) return -EINVAL; frame->sp = fp + 0x10; + kasan_disable_current(); frame->fp = *(unsigned long *)(fp); It would be better to use READ_ONCE_NOCHECK() here. See f7d27c35ddff7 ("x86/mm, kasan: Silence KASAN warnings in get_wchan()") which solves the same problem for x86. Thanks for the suggestion, I'm going to try it soon. + kasan_enable_current(); frame->pc = *(unsigned long *)(fp + 8); Why you left frame->pc out of scope? This line could trigger kasan as well. Actually, it was not reported as frequently as the first one. In my first a couple of boot test, it was not triggered before I came up with the patch. The first one is triggered every time. It will be fixed in v2 too. Thanks, Yang #ifdef CONFIG_FUNCTION_GRAPH_TRACER
Re: [RFC V5] Add gup trace points support
Hi Andrew, This series already got acked from Steven and arch maintainers except for x86. How should I proceed? Any comment is appreciated. Thanks, Yang On 1/14/2016 6:40 AM, Steven Rostedt wrote: Andrew, Do you want to pull in this series? You can add my Acked-by to the whole set. -- Steve On Wed, 13 Jan 2016 10:14:24 -0800 "Shi, Yang" wrote: On 1/12/2016 12:10 PM, Steven Rostedt wrote: On Tue, 12 Jan 2016 12:00:54 -0800 "Shi, Yang" wrote: Hi Steven, Any more comments on this series? How should I proceed it? The tracing part looks fine to me. Now you just need to get the arch maintainers to ack each of the arch patches, and I can pull them in for 4.6. Too late for 4.5. Probably need Andrew Morton's ack for the mm/gup.c patch. Thanks Steven. Already sent email to x86, s390 and sparc maintainers. Ralf already acked the MIPS part since v1. Regards, Yang -- Steve
Re: [PATCH v2] arm64: disable kasan when accessing frame->fp in unwind_frame
On 2/9/2016 8:54 AM, Will Deacon wrote: On Mon, Feb 08, 2016 at 09:13:09AM -0800, Yang Shi wrote: When boot arm64 kernel with KASAN enabled, the below error is reported by kasan: BUG: KASAN: out-of-bounds in unwind_frame+0xec/0x260 at addr ffc064d57ba0 Read of size 8 by task pidof/499 page:ffbdc39355c0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x0() page dumped because: kasan: bad access detected CPU: 2 PID: 499 Comm: pidof Not tainted 4.5.0-rc1 #119 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Call trace: [] dump_backtrace+0x0/0x290 [] show_stack+0x24/0x30 [] dump_stack+0x8c/0xd8 [] kasan_report_error+0x558/0x588 [] kasan_report+0x60/0x70 [] __asan_load8+0x60/0x78 [] unwind_frame+0xec/0x260 [] get_wchan+0x110/0x160 [] do_task_stat+0xb44/0xb68 [] proc_tgid_stat+0x40/0x50 [] proc_single_show+0x88/0xd8 [] seq_read+0x370/0x770 [] __vfs_read+0xc8/0x1d8 [] vfs_read+0x94/0x168 [] SyS_read+0xb8/0x128 [] el0_svc_naked+0x24/0x28 Memory state around the buggy address: ffc064d57a80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 f4 f4 ffc064d57b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffc064d57b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ ffc064d57c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffc064d57c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Since the shadow byte pointed by the report is 0, so it may mean it is just hit oob in non-current task. So, disable the instrumentation to silence these warnings. Curious, but how did you trigger this? I'm just trying to confirm that mainline is affected, but my machine boots happily with KASAN and STACKTRACE selected and I can cat /proc/self/{stack,stat} quite happily. What am I missing? I'm using mainline 4.5-rc1 kernel with gcc 5.2. And, my rootfs is NFS mounted. Not sure if other kernel configs, i.e tracing stuff will have impact on the trigger since I'm not using the defconfig. Thanks, Yang Will
Re: [PATCH v2] arm64: disable kasan when accessing frame->fp in unwind_frame
On 2/9/2016 9:23 AM, Will Deacon wrote: On Tue, Feb 09, 2016 at 09:17:12AM -0800, Shi, Yang wrote: On 2/9/2016 8:54 AM, Will Deacon wrote: On Mon, Feb 08, 2016 at 09:13:09AM -0800, Yang Shi wrote: When boot arm64 kernel with KASAN enabled, the below error is reported by kasan: BUG: KASAN: out-of-bounds in unwind_frame+0xec/0x260 at addr ffc064d57ba0 Read of size 8 by task pidof/499 page:ffbdc39355c0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x0() page dumped because: kasan: bad access detected CPU: 2 PID: 499 Comm: pidof Not tainted 4.5.0-rc1 #119 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Call trace: [] dump_backtrace+0x0/0x290 [] show_stack+0x24/0x30 [] dump_stack+0x8c/0xd8 [] kasan_report_error+0x558/0x588 [] kasan_report+0x60/0x70 [] __asan_load8+0x60/0x78 [] unwind_frame+0xec/0x260 [] get_wchan+0x110/0x160 [] do_task_stat+0xb44/0xb68 [] proc_tgid_stat+0x40/0x50 [] proc_single_show+0x88/0xd8 [] seq_read+0x370/0x770 [] __vfs_read+0xc8/0x1d8 [] vfs_read+0x94/0x168 [] SyS_read+0xb8/0x128 [] el0_svc_naked+0x24/0x28 Memory state around the buggy address: ffc064d57a80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 f4 f4 ffc064d57b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffc064d57b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ ffc064d57c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffc064d57c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Since the shadow byte pointed by the report is 0, so it may mean it is just hit oob in non-current task. So, disable the instrumentation to silence these warnings. Curious, but how did you trigger this? I'm just trying to confirm that mainline is affected, but my machine boots happily with KASAN and STACKTRACE selected and I can cat /proc/self/{stack,stat} quite happily. What am I missing? I'm using mainline 4.5-rc1 kernel with gcc 5.2. And, my rootfs is NFS mounted. Not sure if other kernel configs, i.e tracing stuff will have impact on the trigger since I'm not using the defconfig. If you could put your .config somewhere, that would be helpful, please. Attached it. BTW, I run the test on LS2085a RDB board which has 8 A57 cores. Not sure if it could be reproduced on other boards easily. Yang Will # # Automatically generated file; DO NOT EDIT. # Linux/arm64 4.5.0-rc1 Kernel Configuration # CONFIG_ARM64=y CONFIG_64BIT=y CONFIG_ARCH_PHYS_ADDR_T_64BIT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MIN=18 CONFIG_ARCH_MMAP_RND_BITS_MAX=24 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=11 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 CONFIG_STACKTRACE_SUPPORT=y CONFIG_ILLEGAL_POINTER_VALUE=0xdead CONFIG_LOCKDEP_SUPPORT=y CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CSUM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ZONE_DMA=y CONFIG_HAVE_GENERIC_RCU_GUP=y CONFIG_ARCH_DMA_ADDR_T_64BIT=y CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_SMP=y CONFIG_SWIOTLB=y CONFIG_IOMMU_HELPER=y CONFIG_KERNEL_MODE_NEON=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_PGTABLE_LEVELS=3 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_CONSTRUCTORS=y CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y CONFIG_USELIB=y # CONFIG_AUDIT is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_SHOW_LEVEL=y CONFIG_GENERIC_IRQ_MIGRATION=y CONFIG_HARDIRQS_SW_RESEND=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y CONFIG_HANDLE_DOMAIN_IRQ=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_ARCH_HAS_TICK_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set # CONFIG_NO_HZ_IDLE is not set CONFIG_NO_HZ_FULL=y CONFIG_NO_HZ_FULL_ALL=y # CONFIG_NO_HZ_FULL_SYSIDLE is not set # CONFIG_NO_HZ is not set CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accounting # CONFIG_VIRT_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_GEN=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y # CONFIG_TASKSTATS is not set # # RCU Subsystem # CONFIG_PREEMPT_RCU=y CONFIG_RCU_EXPERT=y CONFIG_SRCU=y CONFIG_TASKS_RCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_CONTEXT_TRACKING=y CONFIG_CONTEXT_TRACKING_FORCE=y CONFIG_RCU_FANOUT=64 CONFI
Re: [PATCH v2] arm64: disable kasan when accessing frame->fp in unwind_frame
On 2/9/2016 9:31 AM, Shi, Yang wrote: On 2/9/2016 9:23 AM, Will Deacon wrote: On Tue, Feb 09, 2016 at 09:17:12AM -0800, Shi, Yang wrote: On 2/9/2016 8:54 AM, Will Deacon wrote: On Mon, Feb 08, 2016 at 09:13:09AM -0800, Yang Shi wrote: When boot arm64 kernel with KASAN enabled, the below error is reported by kasan: BUG: KASAN: out-of-bounds in unwind_frame+0xec/0x260 at addr ffc064d57ba0 Read of size 8 by task pidof/499 page:ffbdc39355c0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x0() page dumped because: kasan: bad access detected CPU: 2 PID: 499 Comm: pidof Not tainted 4.5.0-rc1 #119 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Call trace: [] dump_backtrace+0x0/0x290 [] show_stack+0x24/0x30 [] dump_stack+0x8c/0xd8 [] kasan_report_error+0x558/0x588 [] kasan_report+0x60/0x70 [] __asan_load8+0x60/0x78 [] unwind_frame+0xec/0x260 [] get_wchan+0x110/0x160 [] do_task_stat+0xb44/0xb68 [] proc_tgid_stat+0x40/0x50 [] proc_single_show+0x88/0xd8 [] seq_read+0x370/0x770 [] __vfs_read+0xc8/0x1d8 [] vfs_read+0x94/0x168 [] SyS_read+0xb8/0x128 [] el0_svc_naked+0x24/0x28 Memory state around the buggy address: ffc064d57a80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 f4 f4 ffc064d57b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffc064d57b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ ffc064d57c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffc064d57c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Since the shadow byte pointed by the report is 0, so it may mean it is just hit oob in non-current task. So, disable the instrumentation to silence these warnings. Curious, but how did you trigger this? I'm just trying to confirm that mainline is affected, but my machine boots happily with KASAN and STACKTRACE selected and I can cat /proc/self/{stack,stat} quite happily. What am I missing? I'm using mainline 4.5-rc1 kernel with gcc 5.2. And, my rootfs is NFS mounted. Not sure if other kernel configs, i.e tracing stuff will have impact on the trigger since I'm not using the defconfig. If you could put your .config somewhere, that would be helpful, please. Attached it. BTW, I run the test on LS2085a RDB board which has 8 A57 cores. Not sure if it could be reproduced on other boards easily. This config has GCOV enabled, my test is run with it disabled. So, you may need disable it to replicate the trigger. Regards, Yang Yang Will
Very low responsiveness for pstore ftrace
Hi folks, I'm newbie to pstore. I just did a quick try for pstore ftrace. But, I experience very serious responsiveness problem. When running "echo 1 > /sys/kernel/debug/pstore/record_ftrace", it took a couple of minutes to return (here return means to get prompt back) on my arm64 machine (8 A57 cores). time command shows: # time echo 1 > /sys/kernel/debug/pstore/record_ftrace hrtimer: interrupt took 17532560 ns real7m35.717s user0m0.004s sys 2m25.244s I had a quick look at the code, it looks there is a long critical section with irq disabled. If with CONFIG_PSTORE_CONSOLE enabled, the machine even can't response interrupts. The pstore console code also need acquire the same lock with irq disabled. The low responsiveness sounds not reasonable to me. Any idea? Thanks, Yang
Re: [PATCH] scsi: fdomain: add missing CONFIG_ to PCMCIA
On 2/9/2016 3:25 PM, One Thousand Gnomes wrote: On Tue, 9 Feb 2016 14:16:49 -0800 Yang Shi wrote: There are some "#ifdef PCMCIA" in the driver code, but all of them missed CONFIG_ prefix. Correct PCMCIA to CONFIG_PCMCIA. Signed-off-by: Yang Shi NAK. This breaks the driver completely Read drivers/scsi/pcmcia/fdomain* Got it. Thanks for spotting out this. Sorry for the unthought change. Yang Alan
Re: [RFC PATCH] kernfs: create raw version kernfs_path_len and kernfs_path
On 2/26/2016 10:56 AM, Steven Rostedt wrote: On Fri, 26 Feb 2016 10:15:05 -0800 Yang Shi wrote: commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback tracepoints to report cgroup") made writeback tracepoints report cgroup writeback, but it may trigger the below bug on -rt kernel since kernfs_path and kernfs_path_len are called by tracepoints, which acquire sleeping lock. BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930 in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3 INFO: lockdep is turned off. Preemption disabled at:[] wb_writeback+0xec/0x830 CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Workqueue: writeback wb_workfn (flush-7:0) Call trace: [] dump_backtrace+0x0/0x200 [] show_stack+0x24/0x30 [] dump_stack+0x88/0xa8 [] ___might_sleep+0x2ec/0x300 [] rt_spin_lock+0x38/0xb8 [] kernfs_path_len+0x30/0x90 [] trace_event_raw_event_writeback_work_class+0xe8/0x2e8 [] wb_writeback+0x620/0x830 [] wb_workfn+0x61c/0x950 [] process_one_work+0x3ac/0xb30 [] worker_thread+0x9c/0x7a8 [] kthread+0x190/0x1b0 [] ret_from_fork+0x10/0x30 Since kernfs_* functions are heavily used by cgroup, so it sounds not reasonable to convert kernfs_rename_lock to raw lock. Create raw version kernfs_path, kernfs_path_len and cgroup_path, which don't acquire lock and are used by tracepoints only. And what prevents name from being freed while the tracepoint is reading it? Perhaps we need this change as well: Thanks for pointing this out. Yes, it looks we need this since the tracepoints are protected by rcu_read_lock_sched. Will add this in V2. Yang -- Steve diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 996b7742c90b..d2ef153145c0 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1397,6 +1397,12 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent, kn->hash = kernfs_name_hash(kn->name, kn->ns); kernfs_link_sibling(kn); + /* +* Tracepoints may be reading the old name. They are protected +* by rcu_read_lock_sched(). +*/ + synchronize_sched(); + kernfs_put(old_parent); kfree_const(old_name);
Re: [RFC PATCH] kernfs: create raw version kernfs_path_len and kernfs_path
On 2/26/2016 11:37 AM, Shi, Yang wrote: On 2/26/2016 10:56 AM, Steven Rostedt wrote: On Fri, 26 Feb 2016 10:15:05 -0800 Yang Shi wrote: commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback tracepoints to report cgroup") made writeback tracepoints report cgroup writeback, but it may trigger the below bug on -rt kernel since kernfs_path and kernfs_path_len are called by tracepoints, which acquire sleeping lock. BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930 in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3 INFO: lockdep is turned off. Preemption disabled at:[] wb_writeback+0xec/0x830 CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Workqueue: writeback wb_workfn (flush-7:0) Call trace: [] dump_backtrace+0x0/0x200 [] show_stack+0x24/0x30 [] dump_stack+0x88/0xa8 [] ___might_sleep+0x2ec/0x300 [] rt_spin_lock+0x38/0xb8 [] kernfs_path_len+0x30/0x90 [] trace_event_raw_event_writeback_work_class+0xe8/0x2e8 [] wb_writeback+0x620/0x830 [] wb_workfn+0x61c/0x950 [] process_one_work+0x3ac/0xb30 [] worker_thread+0x9c/0x7a8 [] kthread+0x190/0x1b0 [] ret_from_fork+0x10/0x30 Since kernfs_* functions are heavily used by cgroup, so it sounds not reasonable to convert kernfs_rename_lock to raw lock. Create raw version kernfs_path, kernfs_path_len and cgroup_path, which don't acquire lock and are used by tracepoints only. And what prevents name from being freed while the tracepoint is reading it? Perhaps we need this change as well: Thanks for pointing this out. Yes, it looks we need this since the tracepoints are protected by rcu_read_lock_sched. Will add this in V2. BTW, it sounds this is not the only point where kernfs_node could be updated, __kernfs_remove should need synchronize_sched too. Thanks, Yang Yang -- Steve diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 996b7742c90b..d2ef153145c0 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1397,6 +1397,12 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent, kn->hash = kernfs_name_hash(kn->name, kn->ns); kernfs_link_sibling(kn); +/* + * Tracepoints may be reading the old name. They are protected + * by rcu_read_lock_sched(). + */ +synchronize_sched(); + kernfs_put(old_parent); kfree_const(old_name);
Re: [RFC PATCH] kernfs: create raw version kernfs_path_len and kernfs_path
On 2/26/2016 11:59 AM, Shi, Yang wrote: On 2/26/2016 11:37 AM, Shi, Yang wrote: On 2/26/2016 10:56 AM, Steven Rostedt wrote: On Fri, 26 Feb 2016 10:15:05 -0800 Yang Shi wrote: commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback tracepoints to report cgroup") made writeback tracepoints report cgroup writeback, but it may trigger the below bug on -rt kernel since kernfs_path and kernfs_path_len are called by tracepoints, which acquire sleeping lock. BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930 in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3 INFO: lockdep is turned off. Preemption disabled at:[] wb_writeback+0xec/0x830 CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Workqueue: writeback wb_workfn (flush-7:0) Call trace: [] dump_backtrace+0x0/0x200 [] show_stack+0x24/0x30 [] dump_stack+0x88/0xa8 [] ___might_sleep+0x2ec/0x300 [] rt_spin_lock+0x38/0xb8 [] kernfs_path_len+0x30/0x90 [] trace_event_raw_event_writeback_work_class+0xe8/0x2e8 [] wb_writeback+0x620/0x830 [] wb_workfn+0x61c/0x950 [] process_one_work+0x3ac/0xb30 [] worker_thread+0x9c/0x7a8 [] kthread+0x190/0x1b0 [] ret_from_fork+0x10/0x30 Since kernfs_* functions are heavily used by cgroup, so it sounds not reasonable to convert kernfs_rename_lock to raw lock. Create raw version kernfs_path, kernfs_path_len and cgroup_path, which don't acquire lock and are used by tracepoints only. And what prevents name from being freed while the tracepoint is reading it? Perhaps we need this change as well: Thanks for pointing this out. Yes, it looks we need this since the tracepoints are protected by rcu_read_lock_sched. Will add this in V2. BTW, it sounds this is not the only point where kernfs_node could be updated, __kernfs_remove should need synchronize_sched too. Should be synchronize_sched moved into kernfs_put itself, just before the kfree call since it is called at some places other than kernfs_remove and kernfs_rename_ns. diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index a4b1db1..89c0c16 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -456,6 +456,12 @@ void kernfs_put(struct kernfs_node *kn) if (kernfs_type(kn) == KERNFS_LINK) kernfs_put(kn->symlink.target_kn); + /* +* Tracepoints may be reading the old name. They are protected +* by rcu_read_lock_sched(). +*/ + synchronize_sched(); + kfree_const(kn->name); if (kn->iattr) { But, kernfs_put might be called recursively, is it fine? Thanks, Yang Thanks, Yang Yang -- Steve diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 996b7742c90b..d2ef153145c0 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1397,6 +1397,12 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent, kn->hash = kernfs_name_hash(kn->name, kn->ns); kernfs_link_sibling(kn); +/* + * Tracepoints may be reading the old name. They are protected + * by rcu_read_lock_sched(). + */ +synchronize_sched(); + kernfs_put(old_parent); kfree_const(old_name);
Re: [RFC PATCH] kernfs: create raw version kernfs_path_len and kernfs_path
On 2/26/2016 12:15 PM, Steven Rostedt wrote: On Fri, 26 Feb 2016 11:59:40 -0800 "Shi, Yang" wrote: BTW, it sounds this is not the only point where kernfs_node could be updated, __kernfs_remove should need synchronize_sched too. Question is, can the kernfs of a cgroup be removed while the cgroup is still active? Hmm, it sounds unlikely to me. Yang I don't see the kernfs_rename_lock being held anywhere in that remove. If it can be an issue there with this patch, then it's an issue today, because the kernfs_mutex is not held by the tracepoint. -- Steve
Re: [RFC V2 PATCH] kernfs: create raw version kernfs_path_len and kernfs_path
On 2/26/2016 3:01 PM, Greg KH wrote: On Fri, Feb 26, 2016 at 01:47:39PM -0800, Yang Shi wrote: commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback tracepoints to report cgroup") made writeback tracepoints report cgroup writeback, but it may trigger the below bug on -rt kernel since kernfs_path and kernfs_path_len are called by tracepoints, which acquire sleeping lock. BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930 in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3 INFO: lockdep is turned off. Preemption disabled at:[] wb_writeback+0xec/0x830 CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Workqueue: writeback wb_workfn (flush-7:0) Call trace: [] dump_backtrace+0x0/0x200 [] show_stack+0x24/0x30 [] dump_stack+0x88/0xa8 [] ___might_sleep+0x2ec/0x300 [] rt_spin_lock+0x38/0xb8 [] kernfs_path_len+0x30/0x90 [] trace_event_raw_event_writeback_work_class+0xe8/0x2e8 [] wb_writeback+0x620/0x830 [] wb_workfn+0x61c/0x950 [] process_one_work+0x3ac/0xb30 [] worker_thread+0x9c/0x7a8 [] kthread+0x190/0x1b0 [] ret_from_fork+0x10/0x30 Since kernfs_* functions are heavily used by cgroup, so it sounds not reasonable to convert kernfs_rename_lock to raw lock. Call synchronize_sched() when kernfs_node is updated since tracepoints are protected by rcu_read_lock_sched. Create raw version kernfs_path, kernfs_path_len and cgroup_path, which don't acquire lock and are used by tracepoints only. Signed-off-by: Yang Shi --- It should be applicable to both mainline and -rt kernel. The change survives ftrace stress test in ltp. v2: * Adopted Steven's suggestion to call synchronize_sched in kernfs_rename_ns. fs/kernfs/dir.c | 30 +++--- include/linux/cgroup.h | 10 ++ include/linux/kernfs.h | 10 ++ include/trace/events/writeback.h | 4 ++-- 4 files changed, 49 insertions(+), 5 deletions(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 996b774..70a9687 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -44,7 +44,7 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen) return strlcpy(buf, kn->parent ? kn->name : "/", buflen); } -static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf, +static char * __must_check __kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen) { char *p = buf + buflen; @@ -131,12 +131,33 @@ char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen) char *p; spin_lock_irqsave(&kernfs_rename_lock, flags); - p = kernfs_path_locked(kn, buf, buflen); + p = __kernfs_path(kn, buf, buflen); spin_unlock_irqrestore(&kernfs_rename_lock, flags); return p; } EXPORT_SYMBOL_GPL(kernfs_path); +/* Raw version. Used by tracepoints only without acquiring lock */ +size_t _kernfs_path_len(struct kernfs_node *kn) +{ + size_t len = 0; + + do { + len += strlen(kn->name) + 1; + kn = kn->parent; + } while (kn && kn->parent); + + return len; +} + +char *_kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen) +{ + char *p; + + p = __kernfs_path(kn, buf, buflen); + return p; +} + /** * pr_cont_kernfs_name - pr_cont name of a kernfs_node * @kn: kernfs_node of interest @@ -168,7 +189,7 @@ void pr_cont_kernfs_path(struct kernfs_node *kn) spin_lock_irqsave(&kernfs_rename_lock, flags); - p = kernfs_path_locked(kn, kernfs_pr_cont_buf, + p = __kernfs_path(kn, kernfs_pr_cont_buf, sizeof(kernfs_pr_cont_buf)); if (p) pr_cont("%s", p); @@ -1397,6 +1418,9 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent, kn->hash = kernfs_name_hash(kn->name, kn->ns); kernfs_link_sibling(kn); + /* Tracepoints are protected by rcu_read_lock_sched */ + synchronize_sched(); + kernfs_put(old_parent); kfree_const(old_name); diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 2162dca..bbd88d3 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -538,6 +538,16 @@ static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, return kernfs_path(cgrp->kn, buf, buflen); } +/* + * Without acquiring kernfs_rename_lock in _kernfs_path. + * Used by tracepoints only. + */ +static inline char * __must_check _cgroup_path(struct cgroup *cgrp, char *buf, + size_t buflen) +{ + return _kernfs_path(cgrp->kn, buf, buflen); +} + static inline void pr_cont_cgroup_name(struct cgroup *cgrp) { pr_cont_kernfs_name(cgrp->kn); diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index af51df3..14c01cdc 1006
Re: [RFC PATCH] writeback: move list_lock down into the for loop
On 2/29/2016 7:06 AM, Michal Hocko wrote: On Fri 26-02-16 08:46:25, Yang Shi wrote: The list_lock was moved outside the for loop by commit e8dfc30582995ae12454cda517b17d6294175b07 ("writeback: elevate queue_io() into wb_writeback())", however, the commit log says "No behavior change", so it sounds safe to have the list_lock acquired inside the for loop as it did before. Leave tracepoints outside the critical area since tracepoints already have preempt disabled. The patch says what but it completely misses the why part. I'm just wondering the finer grained lock may reach a little better performance, i.e. more likely for preempt, lower latency. Thanks, Yang Signed-off-by: Yang Shi --- Tested with ltp on 8 cores Cortex-A57 machine. fs/fs-writeback.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1f76d89..9b7b5f6 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1623,7 +1623,6 @@ static long wb_writeback(struct bdi_writeback *wb, work->older_than_this = &oldest_jif; blk_start_plug(&plug); - spin_lock(&wb->list_lock); for (;;) { /* * Stop writeback when nr_pages has been consumed @@ -1661,15 +1660,19 @@ static long wb_writeback(struct bdi_writeback *wb, oldest_jif = jiffies; trace_writeback_start(wb, work); + + spin_lock(&wb->list_lock); if (list_empty(&wb->b_io)) queue_io(wb, work); if (work->sb) progress = writeback_sb_inodes(work->sb, wb, work); else progress = __writeback_inodes_wb(wb, work); - trace_writeback_written(wb, work); wb_update_bandwidth(wb, wb_start); + spin_unlock(&wb->list_lock); + + trace_writeback_written(wb, work); /* * Did we write something? Try for more @@ -1693,15 +1696,14 @@ static long wb_writeback(struct bdi_writeback *wb, */ if (!list_empty(&wb->b_more_io)) { trace_writeback_wait(wb, work); + spin_lock(&wb->list_lock); inode = wb_inode(wb->b_more_io.prev); - spin_lock(&inode->i_lock); spin_unlock(&wb->list_lock); + spin_lock(&inode->i_lock); /* This function drops i_lock... */ inode_sleep_on_writeback(inode); - spin_lock(&wb->list_lock); } } - spin_unlock(&wb->list_lock); blk_finish_plug(&plug); return nr_pages - work->nr_pages; -- 2.0.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org
Re: [RFC V2 PATCH] kernfs: create raw version kernfs_path_len and kernfs_path
On 2/27/2016 3:51 AM, Tejun Heo wrote: Hello, On Sat, Feb 27, 2016 at 6:45 AM, Thomas Gleixner wrote: It can be, but we can print out the ino and userland can match that up with path if necessary. Wouldn't be cgroup id the better choice? AFAIK we aren't exposing cgroup id to userland anywhere right now. Eventually, I think the right thing to do is using the same number for both. Thanks for all the comments. Since the current tracepoints print the path length too by __trace_wb_cgroup_size and __trace_wbc_cgroup_size, but we can't get the path length if we switch to group ino. So, I'm supposed I have to drop all the *_size stuff. And, when CONFIG_CGROUP_WRITEBACK is not enabled, __trace_wb_assign_cgroup and __trace_wbc_assign_cgroup return "strcpy(buf, "/")", so to get aligned with this, I need print out the ino of "/", right? But, the ROOT_INO may vary from different filesystems. All of them will be addressed in V4, but it may experience some delay since I have to travel this week. Regards, Yang Thanks.
Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
On 2/24/2016 6:40 PM, Steven Rostedt wrote: On Wed, 24 Feb 2016 14:47:23 -0800 Yang Shi wrote: commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback tracepoints to report cgroup") made writeback tracepoints report cgroup writeback, but it may trigger the below bug on -rt kernel due to the list_lock held for the for loop in wb_writeback(). list_lock is a sleeping mutex, it's not disabling preemption. Moving it doesn't make a difference. BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930 in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3 Something else disabled preemption. And note, nothing in the tracepoint should have called a sleeping function. Yes, it makes me confused too. It sounds like the preempt_ip address is not that accurate. INFO: lockdep is turned off. Preemption disabled at:[] wb_writeback+0xec/0x830 CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Workqueue: writeback wb_workfn (flush-7:0) Call trace: [] dump_backtrace+0x0/0x200 [] show_stack+0x24/0x30 [] dump_stack+0x88/0xa8 [] ___might_sleep+0x2ec/0x300 [] rt_spin_lock+0x38/0xb8 [] kernfs_path_len+0x30/0x90 [] trace_event_raw_event_writeback_work_class+0xe8/0x2e8 How accurate is this trace back? Here's the code that is executed in this tracepoint: TP_fast_assign( struct device *dev = bdi->dev; if (!dev) dev = default_backing_dev_info.dev; strncpy(__entry->name, dev_name(dev), 32); __entry->nr_pages = work->nr_pages; __entry->sb_dev = work->sb ? work->sb->s_dev : 0; __entry->sync_mode = work->sync_mode; __entry->for_kupdate = work->for_kupdate; __entry->range_cyclic = work->range_cyclic; __entry->for_background = work->for_background; __entry->reason = work->reason; ), See anything that would sleep? According to the stack backtrace, kernfs_path_len calls slepping lock, which is called by __trace_wb_cgroup_size(wb) in __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb)). The below is the definition: DECLARE_EVENT_CLASS(writeback_work_class, TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work), TP_ARGS(wb, work), TP_STRUCT__entry( __array(char, name, 32) __field(long, nr_pages) __field(dev_t, sb_dev) __field(int, sync_mode) __field(int, for_kupdate) __field(int, range_cyclic) __field(int, for_background) __field(int, reason) __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb)) Thanks, Yang [] wb_writeback+0x620/0x830 [] wb_workfn+0x61c/0x950 [] process_one_work+0x3ac/0xb30 [] worker_thread+0x9c/0x7a8 [] kthread+0x190/0x1b0 [] ret_from_fork+0x10/0x30 The list_lock was moved outside the for loop by commit e8dfc30582995ae12454cda517b17d6294175b07 ("writeback: elevate queue_io() into wb_writeback())", however, the commit log says "No behavior change", so it sounds safe to have the list_lock acquired inside the for loop as it did before. Just acquire list_lock at the necessary points and keep all writeback tracepoints outside the critical area protected by list_lock in wb_writeback(). But list_lock itself is a sleeping lock. This doesn't make sense. This is not the bug you are looking for. -- Steve Signed-off-by: Yang Shi --- fs/fs-writeback.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1f76d89..9b7b5f6 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1623,7 +1623,6 @@ static long wb_writeback(struct bdi_writeback *wb, work->older_than_this = &oldest_jif; blk_start_plug(&plug); - spin_lock(&wb->list_lock); for (;;) { /* * Stop writeback when nr_pages has been consumed @@ -1661,15 +1660,19 @@ static long wb_writeback(struct bdi_writeback *wb, oldest_jif = jiffies; trace_writeback_start(wb, work); + + spin_lock(&wb->list_lock); if (list_empty(&wb->b_io)) queue_io(wb, work); if (work->sb) progress = writeback_sb_inodes(work->sb, wb, work); else progress = __writeback_inodes_wb(wb, work); - trace_writeback_written(wb, work); wb_update_bandwidth(wb, wb_start); + spin_unlock(&wb->list_lock); + + trace_writeback_written(wb, work); /* * Did we write something? Try for more @@ -1693,15 +1696,14 @@ static long wb_writeback(struct bdi_writeback *wb,
Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
On 2/25/2016 11:54 AM, Steven Rostedt wrote: On Thu, 25 Feb 2016 11:38:48 -0800 "Shi, Yang" wrote: On 2/24/2016 6:40 PM, Steven Rostedt wrote: On Wed, 24 Feb 2016 14:47:23 -0800 Yang Shi wrote: commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback tracepoints to report cgroup") made writeback tracepoints report cgroup writeback, but it may trigger the below bug on -rt kernel due to the list_lock held for the for loop in wb_writeback(). list_lock is a sleeping mutex, it's not disabling preemption. Moving it doesn't make a difference. BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930 in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3 Something else disabled preemption. And note, nothing in the tracepoint should have called a sleeping function. Yes, it makes me confused too. It sounds like the preempt_ip address is not that accurate. Yep, but the change you made doesn't look to be the fix. Actually, regardless whether this is the right fix for the splat, it makes me be wondering if the spin lock which protects the whole for loop is really necessary. It sounds feasible to move it into the for loop and just protect the necessary area. INFO: lockdep is turned off. Preemption disabled at:[] wb_writeback+0xec/0x830 Can you disassemble the vmlinux file to see exactly where that call is. I use gdb to find the right locations. gdb> li *0xffc000374a5c gdb> disass 0xffc000374a5c I use gdb to get the code too. It does point to the spin_lock. (gdb) list *0xffc000374a5c 0xffc000374a5c is in wb_writeback (fs/fs-writeback.c:1621). 1616 1617oldest_jif = jiffies; 1618work->older_than_this = &oldest_jif; 1619 1620blk_start_plug(&plug); 1621spin_lock(&wb->list_lock); 1622for (;;) { 1623/* 1624 * Stop writeback when nr_pages has been consumed 1625 */ The disassemble: 0xffc000374a58 <+232>: bl 0xffc0001300b0 0xffc000374a5c <+236>: mov x0, x22 0xffc000374a60 <+240>: bl 0xffc000d5d518 CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Workqueue: writeback wb_workfn (flush-7:0) Call trace: [] dump_backtrace+0x0/0x200 [] show_stack+0x24/0x30 [] dump_stack+0x88/0xa8 [] ___might_sleep+0x2ec/0x300 [] rt_spin_lock+0x38/0xb8 [] kernfs_path_len+0x30/0x90 [] trace_event_raw_event_writeback_work_class+0xe8/0x2e8 How accurate is this trace back? Here's the code that is executed in this tracepoint: TP_fast_assign( struct device *dev = bdi->dev; if (!dev) dev = default_backing_dev_info.dev; strncpy(__entry->name, dev_name(dev), 32); __entry->nr_pages = work->nr_pages; __entry->sb_dev = work->sb ? work->sb->s_dev : 0; __entry->sync_mode = work->sync_mode; __entry->for_kupdate = work->for_kupdate; __entry->range_cyclic = work->range_cyclic; __entry->for_background = work->for_background; __entry->reason = work->reason; ), See anything that would sleep? According to the stack backtrace, kernfs_path_len calls slepping lock, which is called by __trace_wb_cgroup_size(wb) in __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb)). The below is the definition: DECLARE_EVENT_CLASS(writeback_work_class, TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work), TP_ARGS(wb, work), TP_STRUCT__entry( __array(char, name, 32) __field(long, nr_pages) __field(dev_t, sb_dev) __field(int, sync_mode) __field(int, for_kupdate) __field(int, range_cyclic) __field(int, for_background) __field(int, reason) __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb)) Ah, thanks for pointing that out. I missed that. It sounds not correct if tracepoint doesn't allow sleep. I considered to change sleeping lock to raw lock in kernfs_* functions, but it sounds not reasonable since they are used heavily by cgroup. Thanks, Yang -- Steve
Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
On 2/25/2016 3:31 PM, Steven Rostedt wrote: On Thu, 25 Feb 2016 15:16:54 -0800 "Shi, Yang" wrote: Actually, regardless whether this is the right fix for the splat, it makes me be wondering if the spin lock which protects the whole for loop is really necessary. It sounds feasible to move it into the for loop and just protect the necessary area. That's a separate issue, which may have its own merits that should be decided by the writeback folks. Yes, definitely. I will rework my commit log for this part. INFO: lockdep is turned off. Preemption disabled at:[] wb_writeback+0xec/0x830 Can you disassemble the vmlinux file to see exactly where that call is. I use gdb to find the right locations. gdb> li *0xffc000374a5c gdb> disass 0xffc000374a5c I use gdb to get the code too. It does point to the spin_lock. (gdb) list *0xffc000374a5c 0xffc000374a5c is in wb_writeback (fs/fs-writeback.c:1621). 1616 1617oldest_jif = jiffies; 1618work->older_than_this = &oldest_jif; 1619 1620blk_start_plug(&plug); 1621spin_lock(&wb->list_lock); 1622for (;;) { 1623/* 1624 * Stop writeback when nr_pages has been consumed 1625 */ The disassemble: 0xffc000374a58 <+232>: bl 0xffc0001300b0 The above is the place it recorded. But I just realized, this isn't the issue. I know where the problem is. 0xffc000374a5c <+236>: mov x0, x22 0xffc000374a60 <+240>: bl 0xffc000d5d518 DECLARE_EVENT_CLASS(writeback_work_class, TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work), TP_ARGS(wb, work), TP_STRUCT__entry( __array(char, name, 32) __field(long, nr_pages) __field(dev_t, sb_dev) __field(int, sync_mode) __field(int, for_kupdate) __field(int, range_cyclic) __field(int, for_background) __field(int, reason) __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb)) Ah, thanks for pointing that out. I missed that. It sounds not correct if tracepoint doesn't allow sleep. I considered to change sleeping lock to raw lock in kernfs_* functions, but it sounds not reasonable since they are used heavily by cgroup. It is the kernfs_* that can't sleep. Tracepoints use rcu_read_lock_sched_notrace(), which disables preemption, and not only that, hides itself from lockdep as the last place to disable preemption. Ah, thanks for pointing out this. Is there a way to not use the kernfs_* function? At least for -rt? I'm not quite sure if there is straightforward replacement. However, I'm wondering if lock free version could be used by tracing. For example, create __kernfs_path_len which doesn't acquire any lock for writeback tracing as long as there is not any race condition. At least we could rule out preemption. Thanks, Yang -- Steve
Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()
On 2/25/2016 3:47 PM, Shi, Yang wrote: On 2/25/2016 3:31 PM, Steven Rostedt wrote: On Thu, 25 Feb 2016 15:16:54 -0800 "Shi, Yang" wrote: Actually, regardless whether this is the right fix for the splat, it makes me be wondering if the spin lock which protects the whole for loop is really necessary. It sounds feasible to move it into the for loop and just protect the necessary area. That's a separate issue, which may have its own merits that should be decided by the writeback folks. Yes, definitely. I will rework my commit log for this part. INFO: lockdep is turned off. Preemption disabled at:[] wb_writeback+0xec/0x830 Can you disassemble the vmlinux file to see exactly where that call is. I use gdb to find the right locations. gdb> li *0xffc000374a5c gdb> disass 0xffc000374a5c I use gdb to get the code too. It does point to the spin_lock. (gdb) list *0xffc000374a5c 0xffc000374a5c is in wb_writeback (fs/fs-writeback.c:1621). 1616 1617oldest_jif = jiffies; 1618work->older_than_this = &oldest_jif; 1619 1620blk_start_plug(&plug); 1621spin_lock(&wb->list_lock); 1622for (;;) { 1623/* 1624 * Stop writeback when nr_pages has been consumed 1625 */ The disassemble: 0xffc000374a58 <+232>: bl 0xffc0001300b0 The above is the place it recorded. But I just realized, this isn't the issue. I know where the problem is. 0xffc000374a5c <+236>: mov x0, x22 0xffc000374a60 <+240>: bl 0xffc000d5d518 DECLARE_EVENT_CLASS(writeback_work_class, TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work), TP_ARGS(wb, work), TP_STRUCT__entry( __array(char, name, 32) __field(long, nr_pages) __field(dev_t, sb_dev) __field(int, sync_mode) __field(int, for_kupdate) __field(int, range_cyclic) __field(int, for_background) __field(int, reason) __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb)) Ah, thanks for pointing that out. I missed that. It sounds not correct if tracepoint doesn't allow sleep. I considered to change sleeping lock to raw lock in kernfs_* functions, but it sounds not reasonable since they are used heavily by cgroup. It is the kernfs_* that can't sleep. Tracepoints use rcu_read_lock_sched_notrace(), which disables preemption, and not only that, hides itself from lockdep as the last place to disable preemption. Ah, thanks for pointing out this. Is there a way to not use the kernfs_* function? At least for -rt? I'm not quite sure if there is straightforward replacement. However, I'm wondering if lock free version could be used by tracing. For example, create __kernfs_path_len which doesn't acquire any lock for writeback tracing as long as there is not any race condition. At least we could rule out preemption. Can we disable irqs in tracepoints since spin_lock_irqsave is used by kernfs_* functions. Thanks, Yang Thanks, Yang -- Steve
Re: [BUG linux-next] Kernel panic found with linux-next-20160414
On 4/20/2016 1:01 AM, Hugh Dickins wrote: On Tue, 19 Apr 2016, Shi, Yang wrote: Hi folks, When I ran ltp on linux-next-20160414 on my ARM64 machine, I got the below kernel panic: Unable to handle kernel paging request at virtual address ffc007846000 pgd = ffc01e21d000 [ffc007846000] *pgd=, *pud= Internal error: Oops: 9647 [#11] PREEMPT SMP Modules linked in: loop CPU: 7 PID: 274 Comm: systemd-journal Tainted: G D 4.6.0-rc3-next-20160414-WR8.0.0.0_standard+ #9 Hardware name: Freescale Layerscape 2085a RDB Board (DT) task: ffc01e3fcf80 ti: ffc01ea8c000 task.ti: ffc01ea8c000 PC is at copy_page+0x38/0x120 LR is at migrate_page_copy+0x604/0x1660 pc : [] lr : [] pstate: 2145 sp : ffc01ea8ecd0 x29: ffc01ea8ecd0 x28: x27: 1ff7b80240f8 x26: ffc018196f20 x25: ffbdc01e1180 x24: ffbdc01e1180 x23: x22: ffc01e3fcf80 x21: ffc00481f000 x20: ff900a31d000 x19: ffbdc01207c0 x18: 0f00 x17: x16: x15: x14: x13: x12: x11: x10: x9 : x8 : x7 : x6 : x5 : x4 : x3 : x2 : x1 : ffc00481f080 x0 : ffc007846000 Call trace: Exception stack(0xffc021fc2ed0 to 0xffc021fc2ff0) 2ec0: ffbdc00887c0 ff900a31d000 2ee0: ffc021fc30f0 ff9008ff2318 2145 0025 2f00: ffbdc025a280 ffc020adc4c0 41b58ab3 ff900a085fd0 2f20: ff9008200658 ffbdc00887c0 2f40: ff900b0f1320 ffc021fc3078 41b58ab3 ff900a0864f8 2f60: ff9008210010 ffc021fb8960 ff900867bacc 1ff8043f712d 2f80: ffc021fc2fb0 ff9008210564 ffc021fc3070 ffc021fb8940 2fa0: 08221f78 ff900862f9c8 ffc021fc2fe0 ff9008215dc8 2fc0: 1ff8043f8602 ffc021fc ffc00968a000 ffc00221f080 2fe0: f9407e11d1f0 d61f02209103e210 [] copy_page+0x38/0x120 [] migrate_page+0x74/0x98 [] nfs_migrate_page+0x58/0x80 [] move_to_new_page+0x15c/0x4d8 [] migrate_pages+0x7c8/0x11f0 [] compact_zone+0xdfc/0x2570 [] compact_zone_order+0xe0/0x170 [] try_to_compact_pages+0x2e8/0x8f8 [] __alloc_pages_direct_compact+0x100/0x540 [] __alloc_pages_nodemask+0xc40/0x1c58 [] khugepaged+0x468/0x19c8 [] kthread+0x248/0x2c0 [] ret_from_fork+0x10/0x40 Code: d281f012 91020021 f1020252 d503201f (a8000c02) I did some initial investigation and found it is caused by DEBUG_PAGEALLOC and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT. And, mainline 4.6-rc3 works well. It should be not arch specific although I got it caught on ARM64. I suspect this might be caused by Hugh's huge tmpfs patches. Thanks for testing. It might be caused by my patches, but I don't think that's very likely. This is page migraton for compaction, in the service of anon THP's khugepaged; and I wonder if you were even exercising huge tmpfs when running LTP here (it certainly can be done: I like to mount a huge tmpfs on /opt/ltp and install there, with shmem_huge 2 so any other tmpfs mounts are also huge). Some further investigation shows I got the panic even though I don't have tmpfs mounted with huge=1 or set shmem_huge to 2. There are compaction changes in linux-next too, but I don't see any reason why they'd cause this. I don't know arm64 traces enough to know whether it's the source page or the destination page for the copy, but it looks as if it has been freed (and DEBUG_PAGEALLOC unmapped) before reaching migration's copy. The fault address is passed by x0, which is dest in the implementation of copy_page, so it is the destination page. Needs more debugging, I'm afraid: is it reproducible? Yes, as long as I enable those two PAGEALLOC debug options, I can get the panic once I run ltp. But, it is not caused any specific ltp test case directly, the panic happens randomly during ltp is running. Thanks, Yang Hugh
[BUG linux-next] KASAN bug is raised on linux-next-20160414 with huge tmpfs on
Hi folks, When I run the below test on my ARM64 machine with NFS mounted rootfs, I got KASAN bug report. The test runs well if mnt is not mounted with "huge=1". # mount -t tmpfs -o huge=1 tmpfs /mnt # cp -a /opt/ltp /mnt/ BUG: KASAN: use-after-free in nfs_readdir+0x2c4/0x848 at addr 8b7f4000 Read of size 4 by task crond/446 page:7bffc02dfd00 count:2 mapcount:0 mapping:80001c2cae98 index:0x0 flags: 0x6c(referenced|uptodate|lru|active) page dumped because: kasan: bad access detected page->mem_cgroup:80002402da80 CPU: 0 PID: 446 Comm: crond Tainted: GW 4.6.0-rc3-next-20160414-WR8.0.0.0_standard+ #13 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Call trace: [] dump_backtrace+0x0/0x2b8 [] show_stack+0x24/0x30 [] dump_stack+0xb0/0xe8 [] kasan_report_error+0x518/0x5c0 [] kasan_report+0x60/0x70 [] __asan_load4+0x64/0x80 [] nfs_readdir+0x2c4/0x848 [] iterate_dir+0x120/0x1d8 [] SyS_getdents64+0xdc/0x170 [] __sys_trace_return+0x0/0x4 Memory state around the buggy address: 8b7f3f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 8b7f3f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >8b7f4000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ 8b7f4080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 8b7f4100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff BUG: KASAN: use-after-free in nfs_do_filldir+0x88/0x298 at addr 8b7f4000 Read of size 4 by task crond/446 page:7bffc02dfd00 count:2 mapcount:0 mapping:80001c2cae98 index:0x0 flags: 0x6c(referenced|uptodate|lru|active) page dumped because: kasan: bad access detected page->mem_cgroup:80002402da80 CPU: 0 PID: 446 Comm: crond Tainted: GB W 4.6.0-rc3-next-20160414-WR8.0.0.0_standard+ #13 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Call trace: [] dump_backtrace+0x0/0x2b8 [] show_stack+0x24/0x30 [] dump_stack+0xb0/0xe8 [] kasan_report_error+0x518/0x5c0 [] kasan_report+0x60/0x70 [] __asan_load4+0x64/0x80 [] nfs_do_filldir+0x88/0x298 [] nfs_readdir+0x488/0x848 [] iterate_dir+0x120/0x1d8 [] SyS_getdents64+0xdc/0x170 [] __sys_trace_return+0x0/0x4 Memory state around the buggy address: 8b7f3f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 8b7f3f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >8b7f4000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ 8b7f4080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 8b7f4100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Thanks, Yang
Re: [PATCH] mm: move huge_pmd_set_accessed out of huge_memory.c
Hi folks, I didn't realize pmd_* functions are protected by CONFIG_TRANSPARENT_HUGEPAGE on the most architectures before I made this change. Before I fix all the affected architectures code, I want to check if you guys think this change is worth or not? Thanks, Yang On 4/20/2016 11:24 AM, Yang Shi wrote: huge_pmd_set_accessed is only called by __handle_mm_fault from memory.c, move the definition to memory.c and make it static like create_huge_pmd and wp_huge_pmd. Signed-off-by: Yang Shi --- include/linux/huge_mm.h | 4 mm/huge_memory.c| 23 --- mm/memory.c | 23 +++ 3 files changed, 23 insertions(+), 27 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 7008623..c218ab7b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -8,10 +8,6 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm, extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma); -extern void huge_pmd_set_accessed(struct mm_struct *mm, - struct vm_area_struct *vma, - unsigned long address, pmd_t *pmd, - pmd_t orig_pmd, int dirty); extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fecbbc5..6c14cb6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1137,29 +1137,6 @@ out: return ret; } -void huge_pmd_set_accessed(struct mm_struct *mm, - struct vm_area_struct *vma, - unsigned long address, - pmd_t *pmd, pmd_t orig_pmd, - int dirty) -{ - spinlock_t *ptl; - pmd_t entry; - unsigned long haddr; - - ptl = pmd_lock(mm, pmd); - if (unlikely(!pmd_same(*pmd, orig_pmd))) - goto unlock; - - entry = pmd_mkyoung(orig_pmd); - haddr = address & HPAGE_PMD_MASK; - if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty)) - update_mmu_cache_pmd(vma, address, pmd); - -unlock: - spin_unlock(ptl); -} - static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index 93897f2..6ced4eb 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3287,6 +3287,29 @@ static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma, return VM_FAULT_FALLBACK; } +static void huge_pmd_set_accessed(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, + pmd_t *pmd, pmd_t orig_pmd, + int dirty) +{ + spinlock_t *ptl; + pmd_t entry; + unsigned long haddr; + + ptl = pmd_lock(mm, pmd); + if (unlikely(!pmd_same(*pmd, orig_pmd))) + goto unlock; + + entry = pmd_mkyoung(orig_pmd); + haddr = address & HPAGE_PMD_MASK; + if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty)) + update_mmu_cache_pmd(vma, address, pmd); + +unlock: + spin_unlock(ptl); +} + /* * These routines also need to handle stuff like marking pages dirty * and/or accessed for architectures that don't do it in hardware (most
[BUG] set_pte_at: racy dirty state clearing warning
Hi Will and Catalin, When I enable memory comact via # echo 1 > /proc/sys/vm/compact_memory I got the below WARNING: set_pte_at: racy dirty state clearing: 0x006899371bd3 -> 0x006899371fd3 [ cut here ] WARNING: CPU: 5 PID: 294 at ./arch/arm64/include/asm/pgtable.h:227 ptep_set_access_flags+0x138/0x1b8 Modules linked in: CPU: 5 PID: 294 Comm: systemd-journal Not tainted 4.6.0-rc3-next-20160414 #13 Hardware name: Freescale Layerscape 2085a RDB Board (DT) task: 80001e4f8080 ti: 80001e8b4000 task.ti: 80001e8b4000 PC is at ptep_set_access_flags+0x138/0x1b8 LR is at ptep_set_access_flags+0x138/0x1b8 pc : [] lr : [] pstate: 2145 sp : 80001e8b7bc0 x29: 80001e8b7bc0 x28: 80001e843ac8 x27: 0040 x26: 80001e9ae0d8 x25: 29901000 x24: 80001f32a938 x23: 0001 x22: 80001e9ae088 x21: 7eb59000 x20: 80001e843ac8 x19: 006899371fd3 x18: fe09 x17: 7ea48c88 x16: cb5afb20 x15: 003b9aca x14: 307830203e2d2033 x13: 6462313733393930 x12: 3030303836303078 x11: 30203a676e697261 x10: 656c632065746174 x9 : 7320797472696420 x8 : 79636172203a7461 x7 : 5f6574705f746573 x6 : 28300ab8 x5 : 0003 x4 : x3 : 0003 x2 : 13d16f64 x1 : dfff2000 x0 : 004f ---[ end trace d75cd9bb88364c80 ]--- Call trace: Exception stack(0x80001e8b79a0 to 0x80001e8b7ac0) 79a0: 006899371fd3 80001e843ac8 80001e8b7bc0 28497b70 79c0: 2145 003d 29901000 28301558 79e0: 41b58ab3 296870d0 28200668 292b0e40 7a00: 0001 80001f32a938 29901000 80001e9ae0d8 7a20: 0040 80001e843ac8 29901000 80001e9ae0d8 7a40: 2a9dcf60 0a6d9320 2000 7a60: 80001e8b7bc0 80001e8b7bc0 80001e8b7b80 ffc8 7a80: 80001e8b7ad0 28415418 80001e8b4000 13d16f64 7aa0: 004f dfff2000 13d16f64 0003 [] ptep_set_access_flags+0x138/0x1b8 [] handle_mm_fault+0xa24/0xfa0 [] do_page_fault+0x3d4/0x4c0 [] do_mem_abort+0xac/0x140 My kernel has ARM64_HW_AFDBM enabled, but LS2085 is not ARMv8.1. The code shows it just check if ARM64_HW_AFDBM is enabled or not, but doesn't check if the CPU really has such capability. So, it might be better to have the capability checked runtime? Thanks, Yang
Re: [PATCH] mm: move huge_pmd_set_accessed out of huge_memory.c
On 4/21/2016 12:30 AM, Kirill A. Shutemov wrote: On Wed, Apr 20, 2016 at 02:00:11PM -0700, Shi, Yang wrote: Hi folks, I didn't realize pmd_* functions are protected by CONFIG_TRANSPARENT_HUGEPAGE on the most architectures before I made this change. Before I fix all the affected architectures code, I want to check if you guys think this change is worth or not? Thanks, Yang On 4/20/2016 11:24 AM, Yang Shi wrote: huge_pmd_set_accessed is only called by __handle_mm_fault from memory.c, move the definition to memory.c and make it static like create_huge_pmd and wp_huge_pmd. Signed-off-by: Yang Shi On pte side we have the same functionality open-coded. Should we do the same for pmd? Or change pte side the same way? Sorry, I don't quite understand you. Do you mean pte_* functions? Thanks, Yang
Re: [PATCH] mm: move huge_pmd_set_accessed out of huge_memory.c
On 4/21/2016 2:15 AM, Hugh Dickins wrote: On Wed, 20 Apr 2016, Shi, Yang wrote: Hi folks, I didn't realize pmd_* functions are protected by CONFIG_TRANSPARENT_HUGEPAGE on the most architectures before I made this change. Before I fix all the affected architectures code, I want to check if you guys think this change is worth or not? Thanks for asking: no, it is not worthwhile. I would much prefer not to have to consider these trivial cleanups in the huge memory area at this time. Kirill and I have urgent work to do in this area, and coping with patch conflicts between different versions of the source will not help any of us. Thanks for your suggestion. I would consider to put such cleanup work on the back burner. Yang Thanks, Hugh Thanks, Yang On 4/20/2016 11:24 AM, Yang Shi wrote: huge_pmd_set_accessed is only called by __handle_mm_fault from memory.c, move the definition to memory.c and make it static like create_huge_pmd and wp_huge_pmd. Signed-off-by: Yang Shi --- include/linux/huge_mm.h | 4 mm/huge_memory.c| 23 --- mm/memory.c | 23 +++ 3 files changed, 23 insertions(+), 27 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 7008623..c218ab7b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -8,10 +8,6 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm, extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma); -extern void huge_pmd_set_accessed(struct mm_struct *mm, - struct vm_area_struct *vma, - unsigned long address, pmd_t *pmd, - pmd_t orig_pmd, int dirty); extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fecbbc5..6c14cb6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1137,29 +1137,6 @@ out: return ret; } -void huge_pmd_set_accessed(struct mm_struct *mm, - struct vm_area_struct *vma, - unsigned long address, - pmd_t *pmd, pmd_t orig_pmd, - int dirty) -{ - spinlock_t *ptl; - pmd_t entry; - unsigned long haddr; - - ptl = pmd_lock(mm, pmd); - if (unlikely(!pmd_same(*pmd, orig_pmd))) - goto unlock; - - entry = pmd_mkyoung(orig_pmd); - haddr = address & HPAGE_PMD_MASK; - if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty)) - update_mmu_cache_pmd(vma, address, pmd); - -unlock: - spin_unlock(ptl); -} - static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index 93897f2..6ced4eb 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3287,6 +3287,29 @@ static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma, return VM_FAULT_FALLBACK; } +static void huge_pmd_set_accessed(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, + pmd_t *pmd, pmd_t orig_pmd, + int dirty) +{ + spinlock_t *ptl; + pmd_t entry; + unsigned long haddr; + + ptl = pmd_lock(mm, pmd); + if (unlikely(!pmd_same(*pmd, orig_pmd))) + goto unlock; + + entry = pmd_mkyoung(orig_pmd); + haddr = address & HPAGE_PMD_MASK; + if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty)) + update_mmu_cache_pmd(vma, address, pmd); + +unlock: + spin_unlock(ptl); +} + /* * These routines also need to handle stuff like marking pages dirty * and/or accessed for architectures that don't do it in hardware (most
[BUG linux-next] kernel NULL pointer dereference on linux-next-20160420
Hi folks, I did the below test with huge tmpfs on linux-next-20160420: # mount -t tmpfs huge=1 tmpfs /mnt # cd /mnt Then clone linux kernel Then I got the below bug, such test works well on non-huge tmpfs. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] release_freepages+0x18/0xa0 PGD 0 Oops: [#1] PREEMPT SMP Modules linked in: CPU: 6 PID: 110 Comm: kcompactd0 Not tainted 4.6.0-rc4-next-20160420-WR7.0.0.0_standard #4 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: 880361708040 ti: 880361704000 task.ti: 880361704000 RIP: 0010:[] [] release_freepages+0x18/0xa0 RSP: 0018:880361707cf8 EFLAGS: 00010282 RAX: RBX: 88036ffde7c0 RCX: 0009 RDX: 1bf1 RSI: 000f RDI: 880361707dd0 RBP: 880361707d20 R08: 0007 R09: 1600 R10: 88036ffde7c0 R11: R12: R13: 880361707dd0 R14: 880361707dc0 R15: 880361707db0 FS: () GS:880363cc() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 02206000 CR4: 06e0 Stack: 88036ffde7c0 1a00 880361707dc0 880361707db0 880361707da0 8119f13d 81196239 880361708040 0001 0010 Call Trace: [] compact_zone+0x55d/0x9f0 [] ? fragmentation_index+0x19/0x70 [] kcompactd_do_work+0x10f/0x230 [] kcompactd+0x90/0x1e0 [] ? wait_woken+0xa0/0xa0 [] ? kcompactd_do_work+0x230/0x230 [] kthread+0xdd/0x100 [] ret_from_fork+0x22/0x40 [] ? kthread_create_on_node+0x180/0x180 Code: c1 fa 06 31 f6 e8 a9 9b fd ff eb 98 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 49 89 fd 41 54 53 48 8b 07 <48> 8b 10 48 8d 78 e0 49 39 c5 4c 8d 62 e0 74 70 49 be 00 00 00 RIP [] release_freepages+0x18/0xa0 RSP CR2: ---[ end trace 855da7e142f7311f ]---
Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages
Hi Kirill, Finally, I got some time to look into and try yours and Hugh's patches, got two problems. 1. A quick boot up test on my ARM64 machine with your v7 tree shows some unexpected error: systemd-journald[285]: Failed to save stream data /run/systemd/journal/streams/8:16863: No space left on device systemd-journald[285]: Failed to save stream data /run/systemd/journal/streams/8:16865: No space left on device Starting DNS forwarder and DHCP server.systemd-journald[285]: Failed to save stream data /run/systemd/journal/streams/8:16867: No space left on device .. systemd-journald[285]: Failed to save stream data /run/systemd/journal/streams/8:16869: No space left on device Starting Postfix Mail Transport Agent... systemd-journald[285]: Failed to save stream data /run/systemd/journal/streams/8:16871: No space left on device Starting Berkeley Internet Name Domain (DNS)... Starting Wait for Network to be Configured... systemd-journald[285]: Failed to save stream data /run/systemd/journal/streams/8:2422: No space left on device [ OK ] Started /etc/rc.local Compatibility. [FAILED] Failed to start DNS forwarder and DHCP server. See 'systemctl status dnsmasq.service' for details. systemd-journald[285]: Failed to save stream data /run/systemd/journal/streams/8:2425: No space left on device [ OK ] Started Serial Getty on ttyS1. [ OK ] Started Serial Getty on ttyS0. [ OK ] Started Getty on tty1. systemd-journald[285]: Failed to save stream data /run/systemd/journal/streams/8:2433: No space left on device [FAILED] Failed to start Berkeley Internet Name Domain (DNS). See 'systemctl status named.service' for details. The /run dir is mounted as tmpfs. x86 boot doesn't get such error. And, Hugh's patches don't have such problem. 2. I ran my THP test (generated a program with 4MB text section) on both x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got the program execution time reduced by ~12% on x86-64, it looks very impressive. But, on ARM64, there is just ~3% change, and sometimes huge tmpfs may show even worse data than non-hugepage. Both yours and Hugh's patches has the same behavior. Any idea? Thanks, Yang On 4/15/2016 5:23 PM, Kirill A. Shutemov wrote: This is probably the last update before the mm summit. Main forcus is on khugepaged stability. khugepaged is in more reasonable shape now. I missed quite a few corner cases on first try. I run this version via LTP, trinity and syzkaller without crashes so far. The patchset is on top of v4.6-rc3 plus Hugh's "easy preliminaries to THPagecache" and Ebru's khugepaged swapin patches form -mm tree. Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v7 == Changelog == v7: - khugepaged updates: + fix page leak/page cache corruption on collapse fail; + filter out VMAs not suitable for huge pages due misaligned vm_pgoff; + fix build without CONFIG_SHMEM; + drop few over-protective checks; - fix bogus VM_BUG_ON() in __delete_from_page_cache(); v6: - experimental collapse support; - fix swapout mapped huge pages; - fix page leak in faularound code; - fix exessive huge page allocation with huge=within_size; - rename VM_NO_THP to VM_NO_KHUGEPAGED; - fix condition in hugepage_madvise(); - accounting reworked again; v5: - add FileHugeMapped to /proc/PID/smaps; - make FileHugeMapped in meminfo aligned with other fields; - Documentation/vm/transhuge.txt updated; v4: - first four patch were applied to -mm tree; - drop pages beyond i_size on split_huge_pages; - few small random bugfixes; v3: - huge= mountoption now can have values always, within_size, advice and never; - sysctl handle is replaced with sysfs knob; - MADV_HUGEPAGE/MADV_NOHUGEPAGE is now respected on page allocation via page fault; - mlock() handling had been fixed; - bunch of smaller bugfixes and cleanups. == Design overview == Huge pages are allocated by shmem when it's allowed (by mount option) and there's no entries for the range in radix-tree. Huge page is represented by HPAGE_PMD_NR entries in radix-tree. MM core maps a page with PMD if ->fault() returns huge page and the VMA is suitable for huge pages (size, alignment). There's no need into two requests to file system: filesystem returns huge page if it can, graceful fallback to small pages otherwise. As with DAX, split_huge_pmd() is implemented by unmapping the PMD: we can re-fault the page with PTEs later. Basic scheme for split_huge_page() is the same as for anon-THP. Few differences: - File pages are on radix-tree, so we have head->_count offset by HPAGE_PMD_NR. The count got distributed to small pages during split. - mapping->tree_lock prevents non-lockless access to pages under split over radix-tree; - Lockless access is prevented by setting the head->_count to 0 during split, so get_page_unless_z
Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages
On 4/19/2016 7:33 AM, Jerome Marchand wrote: On 04/19/2016 12:55 AM, Shi, Yang wrote: 2. I ran my THP test (generated a program with 4MB text section) on both x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got the program execution time reduced by ~12% on x86-64, it looks very impressive. But, on ARM64, there is just ~3% change, and sometimes huge tmpfs may show even worse data than non-hugepage. Both yours and Hugh's patches has the same behavior. Any idea? Just a shot in the dark, but what page size do you use? If you use 4k pages, then hugepage size should be the same as on x86 and a similar I do use 4K pages for both x86-64 and ARM64 in my testing. Thanks, Yang behavior could be expected. Otherwise, hugepages would be too big to be taken advantage of by your test program. Jerome
KASAN can't change FRAME_WARN
Hi Andrey, When I enable KASAN for 4.5 and 4.6 (I didn't try with older versions), I got FRAME_WARN warning for frame size exceeds 2048 bytes. Then I found the kconfig looks like: range 0 8192 default 0 if KASAN default 1024 if !64BIT default 2048 if 64BIT In my understanding, FRAME_WARN should be 0 once KASAN is enabled, but it is still 2048. I tried a couple of fixes, i.e. default 0 if KASAN default 1024 if (!KASAN && !64BIT) default 2048 if (!KASAN && 64BIT) But, nothing works, so I have to add "depends on !KASAN" to disable FRAME_WARN completely, but it causes the kernel image size increased. Any hint is appreciated. Thanks, Yang
[BUG linux-next] Kernel panic found with linux-next-20160414
Hi folks, When I ran ltp on linux-next-20160414 on my ARM64 machine, I got the below kernel panic: Unable to handle kernel paging request at virtual address ffc007846000 pgd = ffc01e21d000 [ffc007846000] *pgd=, *pud= Internal error: Oops: 9647 [#11] PREEMPT SMP Modules linked in: loop CPU: 7 PID: 274 Comm: systemd-journal Tainted: G D 4.6.0-rc3-next-20160414-WR8.0.0.0_standard+ #9 Hardware name: Freescale Layerscape 2085a RDB Board (DT) task: ffc01e3fcf80 ti: ffc01ea8c000 task.ti: ffc01ea8c000 PC is at copy_page+0x38/0x120 LR is at migrate_page_copy+0x604/0x1660 pc : [] lr : [] pstate: 2145 sp : ffc01ea8ecd0 x29: ffc01ea8ecd0 x28: x27: 1ff7b80240f8 x26: ffc018196f20 x25: ffbdc01e1180 x24: ffbdc01e1180 x23: x22: ffc01e3fcf80 x21: ffc00481f000 x20: ff900a31d000 x19: ffbdc01207c0 x18: 0f00 x17: x16: x15: x14: x13: x12: x11: x10: x9 : x8 : x7 : x6 : x5 : x4 : x3 : x2 : x1 : ffc00481f080 x0 : ffc007846000 Call trace: Exception stack(0xffc021fc2ed0 to 0xffc021fc2ff0) 2ec0: ffbdc00887c0 ff900a31d000 2ee0: ffc021fc30f0 ff9008ff2318 2145 0025 2f00: ffbdc025a280 ffc020adc4c0 41b58ab3 ff900a085fd0 2f20: ff9008200658 ffbdc00887c0 2f40: ff900b0f1320 ffc021fc3078 41b58ab3 ff900a0864f8 2f60: ff9008210010 ffc021fb8960 ff900867bacc 1ff8043f712d 2f80: ffc021fc2fb0 ff9008210564 ffc021fc3070 ffc021fb8940 2fa0: 08221f78 ff900862f9c8 ffc021fc2fe0 ff9008215dc8 2fc0: 1ff8043f8602 ffc021fc ffc00968a000 ffc00221f080 2fe0: f9407e11d1f0 d61f02209103e210 [] copy_page+0x38/0x120 [] migrate_page+0x74/0x98 [] nfs_migrate_page+0x58/0x80 [] move_to_new_page+0x15c/0x4d8 [] migrate_pages+0x7c8/0x11f0 [] compact_zone+0xdfc/0x2570 [] compact_zone_order+0xe0/0x170 [] try_to_compact_pages+0x2e8/0x8f8 [] __alloc_pages_direct_compact+0x100/0x540 [] __alloc_pages_nodemask+0xc40/0x1c58 [] khugepaged+0x468/0x19c8 [] kthread+0x248/0x2c0 [] ret_from_fork+0x10/0x40 Code: d281f012 91020021 f1020252 d503201f (a8000c02) I did some initial investigation and found it is caused by DEBUG_PAGEALLOC and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT. And, mainline 4.6-rc3 works well. It should be not arch specific although I got it caught on ARM64. I suspect this might be caused by Hugh's huge tmpfs patches. Thanks, Yang
Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages
On 4/19/2016 9:50 AM, Andrea Arcangeli wrote: Hello, On Mon, Apr 18, 2016 at 03:55:44PM -0700, Shi, Yang wrote: Hi Kirill, Finally, I got some time to look into and try yours and Hugh's patches, got two problems. One thing that come to mind to test is this: qemu with -machine accel=kvm -mem-path=/dev/shm/,share=on . Thanks for the suggestion, I will definitely have a try with KVM. It would be better if Kirill and Hugh could share what benchmark they ran and how much they got improved since my test case is very simple and may just cover a small part of it. Yang The THP Compound approach in tmpfs may just happen to work already with KVM (or at worst it'd require minor adjustments) because it uses the exact same model KVM is already aware about from THP in anonymous memory, example from arch/x86/kvm/mmu.c: static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t *gfnp, kvm_pfn_t *pfnp, int *levelp) { kvm_pfn_t pfn = *pfnp; gfn_t gfn = *gfnp; int level = *levelp; /* * Check if it's a transparent hugepage. If this would be an * hugetlbfs page, level wouldn't be set to * PT_PAGE_TABLE_LEVEL and there would be no adjustment done * here. */ if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL && PageTransCompound(pfn_to_page(pfn)) && !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) { Not using two different models between THP in tmpfs and THP in anon is essential not just to significantly reduce the size of the kernel code, but also because THP knowledge can't be self contained in the mm/shmem.c file. Having to support two different models would complicate things for secondary MMU drivers (i.e. mmu notifer users) like KVM who also need to create huge mapping in the shadow pagetable layer in arch/x86/kvm if the primary MMU allows for it. x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got the program execution time reduced by ~12% on x86-64, it looks very impressive. Agreed, both patchset are impressive works and achieving amazing results! My view is that in terms of long-lived computation from userland point of view, both models are malleable enough and could achieve everything we need in the end, but as far as the overall kernel efficiency is concerned the compound model will always retain a slight advantage in performance by leveraging a native THP compound refcounting that requires just one atomic_inc/dec per THP mapcount instead of 512 of them. Other advantages of the compound model is that it's half in code size despite already including khugepaged (i.e. the same split_huge_page works for both tmpfs and anon) and like said above it won't introduce much complications for drivers like KVM as the model didn't change. Thanks, Andrea
Warning triggered by lockdep checks for sock_owned_by_user on linux-next-20160420
Hi David, When I ran some test on a nfs mounted rootfs, I got the below warning with LOCKDEP enabled on linux-next-20160420: WARNING: CPU: 9 PID: 0 at include/net/sock.h:1408 udp_queue_rcv_skb+0x3d0/0x660 Modules linked in: CPU: 9 PID: 0 Comm: swapper/9 Tainted: G D 4.6.0-rc4-next-20160420-WR7.0.0.0_standard+ #6 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 88066fd03a70 8155855f 88066fd03ab0 81062803 058061318ec8 88065d1e39c0 880661318e40 880661318ec8 Call Trace: [] dump_stack+0x67/0x98 Checking out fil [] __warn+0xd3/0xf0 [] warn_slowpath_null+0x1d/0x20 [] udp_queue_rcv_skb+0x3d0/0x660 [] __udp4_lib_rcv+0x4dc/0xc00 [] udp_rcv+0x1a/0x20 [] ip_local_deliver_finish+0xd1/0x2e0 es: 57% (30585/ [] ? ip_local_deliver_finish+0x3f/0x2e0 [] ip_local_deliver+0xc2/0xd0 [] ip_rcv_finish+0x1e2/0x5a0 [] ip_rcv+0x2dc/0x410 [] ? __pskb_pull_tail+0x82/0x400 [] __netif_receive_skb_core+0x3a8/0xa80 [] ? netif_receive_skb_internal+0x1b/0xf0 [] __netif_receive_skb+0x1d/0x60 [] netif_receive_skb_internal+0x55/0xf0 [] ? netif_receive_skb_internal+0x1b/0xf0 [] napi_gro_receive+0xc2/0x180 [] igb_poll+0x5ea/0xdf0 [] net_rx_action+0x15c/0x3d0 [] __do_softirq+0x161/0x413 [] irq_exit+0xd1/0x110 [] do_IRQ+0x62/0xf0 [] common_interrupt+0x8e/0x8e [] ? cpuidle_enter_state+0xc6/0x290 [] cpuidle_enter+0x17/0x20 [] call_cpuidle+0x33/0x50 [] cpu_startup_entry+0x229/0x3b0 [] start_secondary+0x144/0x150 ---[ end trace ba508c424f0d52bf ]--- The warning is triggered by commit fafc4e1ea1a4c1eb13a30c9426fb799f5efacbc3 ("sock: tigthen lockdep checks for sock_owned_by_user"), which checks if slock is held before locking "owned". It looks good to lock_sock which is just called lock_sock_nested. But, bh_lock_sock is different, which just calls spin_lock so it doesn't touch dep_map then the check will fail even though it is locked. So, I'm wondering what a right fix for it should be: 1. Replace bh_lock_sock to bh_lock_sock_nested in the protocols implementation, but there are a lot places calling it. 2. Just like lock_sock, just call bh_lock_sock_nested instead of spin_lock. Or the both approach is wrong or not ideal? Thanks, Yang
Re: Warning triggered by lockdep checks for sock_owned_by_user on linux-next-20160420
On 4/22/2016 9:50 PM, Eric Dumazet wrote: On Fri, 2016-04-22 at 21:02 -0700, Shi, Yang wrote: Hi David, When I ran some test on a nfs mounted rootfs, I got the below warning with LOCKDEP enabled on linux-next-20160420: WARNING: CPU: 9 PID: 0 at include/net/sock.h:1408 udp_queue_rcv_skb+0x3d0/0x660 Modules linked in: CPU: 9 PID: 0 Comm: swapper/9 Tainted: G D 4.6.0-rc4-next-20160420-WR7.0.0.0_standard+ #6 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 88066fd03a70 8155855f 88066fd03ab0 81062803 058061318ec8 88065d1e39c0 880661318e40 880661318ec8 Call Trace: [] dump_stack+0x67/0x98 Checking out fil [] __warn+0xd3/0xf0 [] warn_slowpath_null+0x1d/0x20 [] udp_queue_rcv_skb+0x3d0/0x660 [] __udp4_lib_rcv+0x4dc/0xc00 [] udp_rcv+0x1a/0x20 [] ip_local_deliver_finish+0xd1/0x2e0 es: 57% (30585/ [] ? ip_local_deliver_finish+0x3f/0x2e0 [] ip_local_deliver+0xc2/0xd0 [] ip_rcv_finish+0x1e2/0x5a0 [] ip_rcv+0x2dc/0x410 [] ? __pskb_pull_tail+0x82/0x400 [] __netif_receive_skb_core+0x3a8/0xa80 [] ? netif_receive_skb_internal+0x1b/0xf0 [] __netif_receive_skb+0x1d/0x60 [] netif_receive_skb_internal+0x55/0xf0 [] ? netif_receive_skb_internal+0x1b/0xf0 [] napi_gro_receive+0xc2/0x180 [] igb_poll+0x5ea/0xdf0 [] net_rx_action+0x15c/0x3d0 [] __do_softirq+0x161/0x413 [] irq_exit+0xd1/0x110 [] do_IRQ+0x62/0xf0 [] common_interrupt+0x8e/0x8e [] ? cpuidle_enter_state+0xc6/0x290 [] cpuidle_enter+0x17/0x20 [] call_cpuidle+0x33/0x50 [] cpu_startup_entry+0x229/0x3b0 [] start_secondary+0x144/0x150 ---[ end trace ba508c424f0d52bf ]--- The warning is triggered by commit fafc4e1ea1a4c1eb13a30c9426fb799f5efacbc3 ("sock: tigthen lockdep checks for sock_owned_by_user"), which checks if slock is held before locking "owned". It looks good to lock_sock which is just called lock_sock_nested. But, bh_lock_sock is different, which just calls spin_lock so it doesn't touch dep_map then the check will fail even though it is locked. ?? spin_lock() definitely is lockdep friendly. Yes, this is what I thought too. But, I didn't figure out why the warning was still reported even though spin_lock is called. So, I'm wondering what a right fix for it should be: 1. Replace bh_lock_sock to bh_lock_sock_nested in the protocols implementation, but there are a lot places calling it. 2. Just like lock_sock, just call bh_lock_sock_nested instead of spin_lock. Or the both approach is wrong or not ideal? I sent a patch yesterday, I am not sure what the status is. Thanks for the patch. I just found your original patch and the discussion with Valdis. I think I ran into the same problem. There is kernel BUG is triggered before the warning, but "lockdep is off" information is not printed out, although is it really off. Just tried your patch, it works for me. Thanks, Yang diff --git a/include/net/sock.h b/include/net/sock.h index d997ec13a643..db8301c76d50 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1350,7 +1350,8 @@ static inline bool lockdep_sock_is_held(const struct sock *csk) { struct sock *sk = (struct sock *)csk; - return lockdep_is_held(&sk->sk_lock) || + return !debug_locks || + lockdep_is_held(&sk->sk_lock) || lockdep_is_held(&sk->sk_lock.slock); } #endif
Re: [linux-next PATCH] sched: cgroup: enable interrupt before calling threadgroup_change_begin
On 4/23/2016 2:14 AM, Peter Zijlstra wrote: On Fri, Apr 22, 2016 at 08:56:28PM -0700, Yang Shi wrote: When kernel oops happens in some kernel thread, i.e. kcompactd in the test, the below bug might be triggered by the oops handler: What are you trying to fix? You already oopsed the thing is wrecked. Actually, I ran into the below kernel BUG: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] release_freepages+0x18/0xa0 PGD 0 Oops: [#1] PREEMPT SMP Modules linked in: CPU: 6 PID: 110 Comm: kcompactd0 Not tainted 4.6.0-rc4-next-20160420 #4 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: 880361732680 ti: 88036173c000 task.ti: 88036173c000 RIP: 0010:[] [] release_freepages+0x18/0xa0 RSP: 0018:88036173fcf8 EFLAGS: 00010282 RAX: RBX: 88036ffde7c0 RCX: 0009 RDX: 1bf1 RSI: 000f RDI: 88036173fdd0 RBP: 88036173fd20 R08: 0007 R09: 1600 R10: 88036ffde7c0 R11: R12: R13: 88036173fdd0 R14: 88036173fdc0 R15: 88036173fdb0 FS: () GS:880363cc() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 02206000 CR4: 06e0 Stack: 88036ffde7c0 1a00 88036173fdc0 88036173fdb0 88036173fda0 8119f13d 81196239 880361732680 0001 0010 Call Trace: [] compact_zone+0x55d/0x9f0 [] ? fragmentation_index+0x19/0x70 [] kcompactd_do_work+0x10f/0x230 [] kcompactd+0x90/0x1e0 [] ? wait_woken+0xa0/0xa0 [] ? kcompactd_do_work+0x230/0x230 [] kthread+0xdd/0x100 [] ret_from_fork+0x22/0x40 [] ? kthread_create_on_node+0x180/0x180 Code: c1 fa 06 31 f6 e8 a9 9b fd ff eb 98 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 49 89 fd 41 54 53 48 8b 07 <48> 8b 10 48 8d 78 e0 49 39 c5 4c 8d 62 e0 74 70 49 be 00 00 00 RIP [] release_freepages+0x18/0xa0 RSP CR2: ---[ end trace 2e96d09e0ba6342f ]--- Then the "schedule in atomic context" bug is triggered which cause the system hang. But, the system is still alive without the "schedule in atomic context" bug. The previous null pointer deference issue doesn't bring the system down other than killing the compactd kthread. Thanks, Yang
Re: [linux-next PATCH] sched: cgroup: enable interrupt before calling threadgroup_change_begin
On 4/25/2016 10:35 AM, Shi, Yang wrote: On 4/23/2016 2:14 AM, Peter Zijlstra wrote: On Fri, Apr 22, 2016 at 08:56:28PM -0700, Yang Shi wrote: When kernel oops happens in some kernel thread, i.e. kcompactd in the test, the below bug might be triggered by the oops handler: What are you trying to fix? You already oopsed the thing is wrecked. Actually, I ran into the below kernel BUG: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] release_freepages+0x18/0xa0 PGD 0 Oops: [#1] PREEMPT SMP Modules linked in: CPU: 6 PID: 110 Comm: kcompactd0 Not tainted 4.6.0-rc4-next-20160420 #4 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: 880361732680 ti: 88036173c000 task.ti: 88036173c000 RIP: 0010:[] [] release_freepages+0x18/0xa0 RSP: 0018:88036173fcf8 EFLAGS: 00010282 RAX: RBX: 88036ffde7c0 RCX: 0009 RDX: 1bf1 RSI: 000f RDI: 88036173fdd0 RBP: 88036173fd20 R08: 0007 R09: 1600 R10: 88036ffde7c0 R11: R12: R13: 88036173fdd0 R14: 88036173fdc0 R15: 88036173fdb0 FS: () GS:880363cc() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 02206000 CR4: 06e0 Stack: 88036ffde7c0 1a00 88036173fdc0 88036173fdb0 88036173fda0 8119f13d 81196239 880361732680 0001 0010 Call Trace: [] compact_zone+0x55d/0x9f0 [] ? fragmentation_index+0x19/0x70 [] kcompactd_do_work+0x10f/0x230 [] kcompactd+0x90/0x1e0 [] ? wait_woken+0xa0/0xa0 [] ? kcompactd_do_work+0x230/0x230 [] kthread+0xdd/0x100 [] ret_from_fork+0x22/0x40 [] ? kthread_create_on_node+0x180/0x180 Code: c1 fa 06 31 f6 e8 a9 9b fd ff eb 98 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 49 89 fd 41 54 53 48 8b 07 <48> 8b 10 48 8d 78 e0 49 39 c5 4c 8d 62 e0 74 70 49 be 00 00 00 RIP [] release_freepages+0x18/0xa0 RSP CR2: ---[ end trace 2e96d09e0ba6342f ]--- Then the "schedule in atomic context" bug is triggered which cause the system hang. But, the system is still alive without the "schedule in atomic context" bug. The previous null pointer deference issue doesn't bring the system down other than killing the compactd kthread. BTW, I don't have "panic on oops" set. So, the kernel doesn't panic. Thanks, Yang Thanks, Yang
Re: [PATCH] panic: lockdep: correct lock debugging state check
On 4/26/2016 5:39 AM, Peter Zijlstra wrote: On Mon, Apr 25, 2016 at 08:36:37PM -0700, Yang Shi wrote: When kernel oops happens, lock debugging is turned off by debug_locks_off() in oops_enter() via calling __debug_locks_off() which set debug_locks to 0 via xchg(). But, calling to __debug_locks_off() to check lock debugging state in add_taint() called by oops_end() will always return false since xchg() returns the old value of debug_locks which is cleared in oops_enter() already. This prevents add_taint() from printing out lock debugging disable information although LOCKDEP_NOW_UNRELIABLE is passed to it. Check lock debugging state via !debug_locks to fix this. Although !__debug_locks_off() could do the same thing, it may look confusing. What are you smoking? This is the second completely insane patch you send this week. This breaks add_taint() and gains us nothing except trivialities. Who I apologize in advance, if I misunderstand the code and please ignore all the bullshit below. In my understanding, add_taint() should call that pr_warn if LOCKDEP_NOW_UNRELIABLE is passed and lock debugging is disabled. This is what the code tells me. LOCKDEP_NOW_UNRELIABLE is passed via lock_ok parameter, lock debugging is turned off by debug_locks_off() already, so it should print out something, but it doesn't since __debug_locks_off() always returns 0. So, it looks the if statement logic is broken. There are alternatives to fix it, I may pick up the not ideal one. bloody cares about that print if you've just had an OOPS. I do agree not too many people care about that print and such information is too trivial to draw attention from people. However, it doesn't mean oops print is a perfect place to hide something wrong. I just happened to find this by checking the oops information to try to get some clue for another issue. Then I thought it is just a quick fix, why not I should do that to make kernel better. Thanks, Yang
Re: [PATCH] arm64: remove redundant FRAME_POINTER kconfig option
On 11/6/2015 8:25 AM, Will Deacon wrote: On Fri, Nov 06, 2015 at 04:21:09PM +, Catalin Marinas wrote: On Fri, Nov 06, 2015 at 12:50:02PM +, Mark Rutland wrote: On Fri, Nov 06, 2015 at 12:30:09PM +, Will Deacon wrote: On Wed, Nov 04, 2015 at 09:37:51AM -0800, Yang Shi wrote: FRAME_POINTER is defined in lib/Kconfig.debug, it is unnecessary to redefine it in arch/arm64/Kconfig.debug. It might be worth noting that this adds a dependency on DEBUG_KERNEL for building with frame pointers. I'm ok with that (it appears to be enabled in defconfig and follows the vast majority of other archs) but it is a change in behaviour. With that: Acked-by: Will Deacon The code in arch/arm64/kernel/stacktrace.c assumes we have frame pointers regardless of FRAME_POINTER. Depending on what the compiler decides to use x29 for, we could get some weird fake unwinding and/or dodgy memory accesses. I think we should first audit the uses of frame pointers to ensure that they are guarded for !FRAME_POINTER. Or we just select FRAME_POINTER in the ARM64 Kconfig entry. Yang, did you see any benefit disabling frame pointers, or was this patch purely based on you spotting a duplicate Kconfig entry? It just spots a duplicate Kconfig entry. FRAME_POINTER is defined in both lib/Kconfig.debug and arch/arm64/Kconfig.debug. The lib/Kconfig.debug one looks like: config FRAME_POINTER bool "Compile the kernel with frame pointers" depends on DEBUG_KERNEL && \ (CRIS || M68K || FRV || UML || \ AVR32 || SUPERH || BLACKFIN || MN10300 || METAG) || \ ARCH_WANT_FRAME_POINTERS default y if (DEBUG_INFO && UML) || ARCH_WANT_FRAME_POINTERS help If you say Y here the resulting kernel image will be slightly larger and slower, but it gives very useful debugging information in case of kernel bugs. (precise oopses/stacktraces/warnings) The common one just depends on DEBUG_KERNEL && ARCH_WANT_FRAME_POINTERS. ARCH_WANT_FRAME_POINTERS is selected by ARM64 kconfig entry. To answer Catalin's question about: However, the patch would allow one to disable FRAME_POINTERS (not sure it has any effect on the aarch64 gcc though). No, it doesn't. Actually, FRAME_POINTER could be disabled regardless of the patch. Thanks, Yang Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: remove redundant FRAME_POINTER kconfig option
On 11/6/2015 9:35 AM, Catalin Marinas wrote: On Fri, Nov 06, 2015 at 09:23:38AM -0800, Shi, Yang wrote: On 11/6/2015 8:25 AM, Will Deacon wrote: However, the patch would allow one to disable FRAME_POINTERS (not sure it has any effect on the aarch64 gcc though). No, it doesn't. Actually, FRAME_POINTER could be disabled regardless of the patch. In which case I suggest that we always select it just as a clearer statement that the feature cannot be disabled (and you never know what the compiler people decide to do in the future). Do you mean select FRAME_POINTER in ARCH_WANT_FRAME_POINTERS? Yes, we could, but this may cause other architectures which select ARCH_WANT_FRAME_POINTERS to have FRAME_POINTER selected too. Or we could do: select FRAME_POINTER is ARM64 Thanks, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: remove redundant FRAME_POINTER kconfig option
On 11/6/2015 9:51 AM, Catalin Marinas wrote: On Fri, Nov 06, 2015 at 09:39:07AM -0800, Shi, Yang wrote: On 11/6/2015 9:35 AM, Catalin Marinas wrote: On Fri, Nov 06, 2015 at 09:23:38AM -0800, Shi, Yang wrote: On 11/6/2015 8:25 AM, Will Deacon wrote: However, the patch would allow one to disable FRAME_POINTERS (not sure it has any effect on the aarch64 gcc though). No, it doesn't. Actually, FRAME_POINTER could be disabled regardless of the patch. In which case I suggest that we always select it just as a clearer statement that the feature cannot be disabled (and you never know what the compiler people decide to do in the future). Do you mean select FRAME_POINTER in ARCH_WANT_FRAME_POINTERS? Yes, we could, but this may cause other architectures which select ARCH_WANT_FRAME_POINTERS to have FRAME_POINTER selected too. This would have been the ideal option, something like: --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -322,7 +322,7 @@ config ARCH_WANT_FRAME_POINTERS help config FRAME_POINTER - bool "Compile the kernel with frame pointers" + bool "Compile the kernel with frame pointers" if !ARCH_WANT_FRAME_POINTERS depends on DEBUG_KERNEL && \ (CRIS || M68K || FRV || UML || \ AVR32 || SUPERH || BLACKFIN || MN10300 || METAG) || \ But, as you said, we would need to check the other architectures selecting ARCH_WANT_FRAME_POINTERS. How about: diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 1d1521c..709255a 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -319,6 +319,7 @@ config DEBUG_SECTION_MISMATCH # config ARCH_WANT_FRAME_POINTERS bool + select FRAME_POINTER if ARM64 help config FRAME_POINTER If other architectures want the same behavior, they could easily append to the is statement. If all arches which selects ARCH_WANT_FRAME_POINTERS, the if statement could be just removed. Yang In the meantime: --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -27,6 +27,7 @@ config ARM64 select CPU_PM if (SUSPEND || CPU_IDLE) select DCACHE_WORD_ACCESS select EDAC_SUPPORT + select FRAME_POINTER select GENERIC_ALLOCATOR select GENERIC_CLOCKEVENTS select GENERIC_CLOCKEVENTS_BROADCAST -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: bpf: fix JIT stack setup
Please ignore this one, forgot to cc to linux-arm-kernel list. Sorry for the inconvenience. Yang On 11/6/2015 9:34 PM, Yang Shi wrote: ARM64 JIT used FP (x29) as eBPF fp register, but FP is subjected to change during function call so it may cause the BPF prog stack base address change too. Whenever, it pointed to the bottom of BPF prog stack instead of the top. So, when copying data via bpf_probe_read, it will be copied to (SP - offset), then it may overwrite the saved FP/LR. Use x25 to replace FP as BPF stack base register (fp). Since x25 is callee saved register, so it will keep intact during function call. It is initialized in BPF prog prologue when BPF prog is started to run everytime. When BPF prog exits, it could be just tossed. Other than this the BPf prog stack base need to be setup before function call stack. So, the BPF stack layout looks like: high original A64_SP => 0:+-+ BPF prologue | | FP/LR and callee saved registers BPF fp register => +64:+-+ | | | ... | BPF prog stack | | | | current A64_SP => +-+ | | | ... | Function call stack | | +-+ low Signed-off-by: Yang Shi CC: Zi Shen Lim CC: Xi Wang --- arch/arm64/net/bpf_jit_comp.c | 38 +++--- 1 file changed, 31 insertions(+), 7 deletions(-) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index a44e529..6809647 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -50,7 +50,7 @@ static const int bpf2a64[] = { [BPF_REG_8] = A64_R(21), [BPF_REG_9] = A64_R(22), /* read-only frame pointer to access stack */ - [BPF_REG_FP] = A64_FP, + [BPF_REG_FP] = A64_R(25), /* temporary register for internal BPF JIT */ [TMP_REG_1] = A64_R(23), [TMP_REG_2] = A64_R(24), @@ -155,18 +155,42 @@ static void build_prologue(struct jit_ctx *ctx) stack_size += 4; /* extra for skb_copy_bits buffer */ stack_size = STACK_ALIGN(stack_size); + /* +* BPF prog stack layout +* +* high +* original A64_SP => 0:+-+ BPF prologue +*| | FP/LR and callee saved registers +* BPF fp register => +64:+-+ +*| | + *| ... | BPF prog stack +*| | +*| | +* current A64_SP => +-+ +*| | +*| ... | Function call stack +*| | +*+-+ +* low +* +*/ + + /* Save FP and LR registers to stay align with ARM64 AAPCS */ + emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx); + /* Save callee-saved register */ emit(A64_PUSH(r6, r7, A64_SP), ctx); emit(A64_PUSH(r8, r9, A64_SP), ctx); if (ctx->tmp_used) emit(A64_PUSH(tmp1, tmp2, A64_SP), ctx); - /* Set up BPF stack */ - emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx); - - /* Set up frame pointer */ + /* Set up BPF prog stack base register (x25) */ emit(A64_MOV(1, fp, A64_SP), ctx); + /* Set up function call stack */ + emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx); + emit(A64_MOV(1, A64_FP, A64_SP), ctx); + /* Clear registers A and X */ emit_a64_mov_i64(ra, 0, ctx); emit_a64_mov_i64(rx, 0, ctx); @@ -196,8 +220,8 @@ static void build_epilogue(struct jit_ctx *ctx) emit(A64_POP(r8, r9, A64_SP), ctx); emit(A64_POP(r6, r7, A64_SP), ctx); - /* Restore frame pointer */ - emit(A64_MOV(1, fp, A64_SP), ctx); + /* Restore FP/LR registers */ + emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx); /* Set return value */ emit(A64_MOV(1, A64_R(0), r0), ctx); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bpf: convert hashtab lock to raw lock
On 11/2/2015 12:59 AM, Thomas Gleixner wrote: On Sun, 1 Nov 2015, Alexei Starovoitov wrote: On Sat, Oct 31, 2015 at 09:47:36AM -0400, Steven Rostedt wrote: On Fri, 30 Oct 2015 17:03:58 -0700 Alexei Starovoitov wrote: On Fri, Oct 30, 2015 at 03:16:26PM -0700, Yang Shi wrote: When running bpf samples on rt kernel, it reports the below warning: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917 in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping Preemption disabled at:[] kprobe_perf_func+0x30/0x228 ... diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 83c209d..972b76b 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -17,7 +17,7 @@ struct bpf_htab { struct bpf_map map; struct hlist_head *buckets; - spinlock_t lock; + raw_spinlock_t lock; How do we address such things in general? I bet there are tons of places around the kernel that call spin_lock from atomic. I'd hate to lose the benefits of lockdep of non-raw spin_lock just to make rt happy. You wont lose any benefits of lockdep. Lockdep still checks raw_spin_lock(). The only difference between raw_spin_lock and spin_lock is that in -rt spin_lock turns into an rt_mutex() and raw_spin_lock stays a spin lock. I see. The patch makes sense then. Would be good to document this peculiarity of spin_lock. I'm working on a document. Thanks Steven and Thomas for your elaboration and comment. Yang Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bpf: convert hashtab lock to raw lock
On 10/31/2015 11:37 AM, Daniel Borkmann wrote: On 10/31/2015 02:47 PM, Steven Rostedt wrote: On Fri, 30 Oct 2015 17:03:58 -0700 Alexei Starovoitov wrote: On Fri, Oct 30, 2015 at 03:16:26PM -0700, Yang Shi wrote: When running bpf samples on rt kernel, it reports the below warning: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917 in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping Preemption disabled at:[] kprobe_perf_func+0x30/0x228 ... diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 83c209d..972b76b 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -17,7 +17,7 @@ struct bpf_htab { struct bpf_map map; struct hlist_head *buckets; -spinlock_t lock; +raw_spinlock_t lock; How do we address such things in general? I bet there are tons of places around the kernel that call spin_lock from atomic. I'd hate to lose the benefits of lockdep of non-raw spin_lock just to make rt happy. You wont lose any benefits of lockdep. Lockdep still checks raw_spin_lock(). The only difference between raw_spin_lock and spin_lock is that in -rt spin_lock turns into an rt_mutex() and raw_spin_lock stays a spin lock. ( Btw, Yang, would have been nice if your commit description would have already included such info, not only that you convert it, but also why it's okay to do so. ) I think Thomas's document will include all the information about rt spin lock/raw spin lock, etc. Alexei & Daniel, If you think such info is necessary, I definitely could add it into the commit log in v2. The error is that in -rt, you called a mutex and not a spin lock while atomic. You are right, I think this happens due to the preempt_disable() in the trace_call_bpf() handler. So, I think the patch seems okay. The dep_map is btw union'ed in the struct spinlock case to the same offset of the dep_map from raw_spinlock. It's a bit inconvenient, though, when we add other library code as maps in future, f.e. things like rhashtable as they would first need to be converted to raw_spinlock_t as well, but judging from the git log, it looks like common practice. Yes, it is common practice for converting sleepable spin lock to raw spin lock in -rt to avoid scheduling in atomic context bug. Thanks, Yang Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/