Re: [PATCH net-next v2 01/10] net: dsa: lan9303: Fixed MDIO interface
On 26. juli 2017 18:55, Andrew Lunn wrote: On Tue, Jul 25, 2017 at 06:15:44PM +0200, Egil Hjelmeland wrote: It is better to use mdiobus_read/write or if you are nesting mdio busses, mdiobus_read_nested/mdiobus_write_nested. Please test this code with lockdep enabled. I have CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_MUTEXES. Should I enable more? Egil -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] cgroup: add cgroup.stat interface with basic hierarchy stats
On Thu, Jul 27, 2017 at 12:22:43PM -0400, Tejun Heo wrote: > Hello, > > On Thu, Jul 27, 2017 at 05:14:20PM +0100, Roman Gushchin wrote: > > Add a cgroup.stat interface to the base cgroup control files > > with the following metrics: > > > > nr_descendants total number of descendant cgroups > > nr_dying_descendantstotal number of dying descendant cgroups > > max_descendant_depthmaximum descent depth below the current cgroup > > Yeah, this'd be great to have. Some comments below. > > > + cgroup.stat > > + A read-only flat-keyed file with the following entries: > > + > > + nr_descendants > > + Total number of descendant cgroups. > > + > > + nr_dying_descendants > > + Total number of dying descendant cgroups. > > Can you please go into more detail on what's going on with dying > descendants here? Sure. Don't we plan do describe cgroup/css lifecycle in details in a separate section? > > +static int cgroup_stats_show(struct seq_file *seq, void *v) > > +{ > > + struct cgroup_subsys_state *css; > > + unsigned long total = 0; > > + unsigned long offline = 0; > > + int max_level = 0; > > + > > + rcu_read_lock(); > > + css_for_each_descendant_pre(css, seq_css(seq)) { > > + if (css == seq_css(seq)) > > + continue; > > + ++total; > > Let's do post increment for consistency. Ok. > > > + if (!(css->flags & CSS_ONLINE)) > > + ++offline; > > + if (css->cgroup->level > max_level) > > + max_level = css->cgroup->level; > > + } > > + rcu_read_unlock(); > > I wonder whether we want to keep these counters in sync instead of > trying to gather the number on read. Walking all descendants can get > expensive pretty quickly and things like nr_descendants will be useful > for other purposes too. Ok, assuming that I'm working on adding an ability to set limits on cgroup hierarchy, it seems very reasonable. I'll implement this and post the updated path as a part of a patchset. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2 01/10] net: dsa: lan9303: Fixed MDIO interface
On Fri, Jul 28, 2017 at 01:08:25PM +0200, Egil Hjelmeland wrote: > On 26. juli 2017 18:55, Andrew Lunn wrote: > >On Tue, Jul 25, 2017 at 06:15:44PM +0200, Egil Hjelmeland wrote: > >It is better to use mdiobus_read/write or if you are nesting mdio > >busses, mdiobus_read_nested/mdiobus_write_nested. Please test this > >code with lockdep enabled. > > > > I have CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_MUTEXES. Should I enable > more? Hi Egil Enable CONFIG_LOCKDEP and CONFIG_PROVE_LOCKING. Any lockdep splat you get while accessing the mdio bus at this point are probably false positives, since it is a different mutex. Using the _nested() version should avoid these false positives. But you might find other places your locking is not right. Andrew -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] printk: Add boottime and real timestamps
On 07/25/2017 09:00 AM, Peter Zijlstra wrote: > On Tue, Jul 25, 2017 at 08:17:27AM -0400, Prarit Bhargava wrote: >> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug >> index 5b1662ec546f..6cd38a25f8ea 100644 >> --- a/lib/Kconfig.debug >> +++ b/lib/Kconfig.debug >> @@ -1,8 +1,8 @@ >> menu "printk and dmesg options" >> >> config PRINTK_TIME >> -int "Show timing information on printks (0-1)" >> -range 0 1 >> +int "Show timing information on printks (0-3)" >> +range 0 3 >> default "0" >> depends on PRINTK >> help >> @@ -13,7 +13,8 @@ config PRINTK_TIME >>The timestamp is always recorded internally, and exported >>to /dev/kmsg. This flag just specifies if the timestamp should >>be included, not that the timestamp is recorded. 0 disables the >> - timestamp and 1 uses the local clock. >> + timestamp and 1 uses the local clock, 2 uses the monotonic clock, and >> + 3 uses real clock. >> >>The behavior is also controlled by the kernel command line >>parameter printk.time=1. See >> Documentation/admin-guide/kernel-parameters.rst > > > choice > prompt "printk default clock" > default PRIMTK_TIME_DISABLE > help >goes here > > config PRINTK_TIME_DISABLE > bool "Disabled" > help >goes here > > config PRINTK_TIME_LOCAL > bool "local clock" > help >goes here > > config PRINTK_TIME_MONO > bool "CLOCK_MONOTONIC" > help >goes here > > config PRINTK_TIME_REAL > bool "CLOCK_REALTIME" > help >goes here > > endchoice > > config PRINTK_TIME > int > default 0 if PRINTK_TIME_DISABLE > default 1 if PRINTK_TIME_LOCAL > default 2 if PRINTK_TIME_MONO > default 3 if PRINTK_TIME_REAL > > Thanks for the above change. I can see that makes the code simpler. > Although I must strongly discourage using REALTIME, DST will make > untangling your logs an absolute nightmare. I would simply not provide > it. I understand your concern, however, I've been in situations where REALTIME stamping has pointed me in the direction of where a bug was. Even with the complicated logs I think it is worthwhile. P. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] cgroup: add cgroup.stat interface with basic hierarchy stats
Hello, On Fri, Jul 28, 2017 at 02:01:55PM +0100, Roman Gushchin wrote: > > > + nr_dying_descendants > > > + Total number of dying descendant cgroups. > > > > Can you please go into more detail on what's going on with dying > > descendants here? > > Sure. > Don't we plan do describe cgroup/css lifecycle in details > in a separate section? We should but it'd still be nice to have a short description here too. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] printk: Add boottime and real timestamps
On Fri, 28 Jul 2017, Prarit Bhargava wrote: > On 07/25/2017 09:00 AM, Peter Zijlstra wrote: > Thanks for the above change. I can see that makes the code simpler. > > > Although I must strongly discourage using REALTIME, DST will make > > untangling your logs an absolute nightmare. I would simply not provide > > it. > > I understand your concern, however, I've been in situations where REALTIME > stamping has pointed me in the direction of where a bug was. Even with the > complicated logs I think it is worthwhile. As Mark pointed out. ktime_get_real() and the fast variant return UTC. The timezone mess plus the DST nonsense are done in user space. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/5] fs/dcache: Limit # of negative dentries
v2->v3: - Add a faster pruning rate when the free pool is closed to depletion. - As suggested by James Bottomley, add an artificial delay waiting loop before killing a negative dentry and properly clear the DCACHE_KILL_NEGATIVE flag if killing doesn't happen. - Add a new patch to track number of negative dentries that are forcifully killed. v1->v2: - Move the new nr_negative field to the end of dentry_stat_t structure as suggested by Matthew Wilcox. - With the help of Miklos Szeredi, fix incorrect locking order in dentry_kill() by using lock_parent() instead of locking the parent's d_lock directly. - Correctly account for positive to negative dentry transitions. - Automatic pruning of negative dentries will now ignore the reference bit in negative dentries but not the regular shrinking. A rogue application can potentially create a large number of negative dentries in the system consuming most of the memory available. This can impact performance of other applications running on the system. This patchset introduces changes to the dcache subsystem to limit the number of negative dentries allowed to be created thus limiting the amount of memory that can be consumed by negative dentries. Patch 1 tracks the number of negative dentries used and disallow the creation of more when the limit is reached. Patch 2 enables /proc/sys/fs/dentry-state to report the number of negative dentries in the system. Patch 3 enables automatic pruning of negative dentries when it is close to the limit so that we won't end up killing recently used negative dentries. Patch 4 prevents racing between negative dentry pruning and umount operation. Patch 5 shows the number of forced negative dentry killings in /proc/sys/fs/dentry-state. End users can then tune the neg_dentry_pc= kernel boot parameter if they want to reduce forced negative dentry killings. Waiman Long (5): fs/dcache: Limit numbers of negative dentries fs/dcache: Report negative dentry number in dentry-state fs/dcache: Enable automatic pruning of negative dentries fs/dcache: Protect negative dentry pruning from racing with umount fs/dcache: Track count of negative dentries forcibly killed Documentation/admin-guide/kernel-parameters.txt | 7 + fs/dcache.c | 451 ++-- include/linux/dcache.h | 8 +- include/linux/list_lru.h| 1 + mm/list_lru.c | 4 +- 5 files changed, 435 insertions(+), 36 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/5] fs/dcache: Enable automatic pruning of negative dentries
Having a limit for the number of negative dentries may have an undesirable side effect that no new negative dentries will be allowed when the limit is reached. This may have a performance impact on some workloads. To prevent this from happening, we need a way to prune the negative dentries so that new ones can be created before it is too late. This is done by using the workqueue API to do the pruning gradually when a threshold is reached to minimize performance impact on other running tasks. The current threshold is 1/4 of the initial value of the free pool count. Once the threshold is reached, the automatic pruning process will be kicked in to replenish the free pool. Each pruning run will scan at most 256 LRU dentries and 64 dentries per node to minimize the LRU locks hold time. The pruning rate will be 50 Hz if the free pool count is less than 1/8 of the original and 10 Hz otherwise. A short artificial delay loop is added to wait for changes in the negative dentry count before killing the negative dentry. Sleeping in this case may be problematic as the callers of dput() may not be in a state that is sleepable. Allowing tasks needing negative dentries to potentially go to do the pruning synchronously themselves can cause lock and cacheline contention. The end result may not be better than that of killing recently created negative dentries. Signed-off-by: Waiman Long --- fs/dcache.c | 156 +-- include/linux/list_lru.h | 1 + mm/list_lru.c| 4 +- 3 files changed, 155 insertions(+), 6 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index fb7e041..3482972 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -134,13 +134,21 @@ struct dentry_stat_t dentry_stat = { * Macros and variables to manage and count negative dentries. */ #define NEG_DENTRY_BATCH (1 << 8) +#define NEG_PRUNING_SIZE (1 << 6) +#define NEG_PRUNING_SLOW_RATE (HZ/10) +#define NEG_PRUNING_FAST_RATE (HZ/50) static long neg_dentry_percpu_limit __read_mostly; static long neg_dentry_nfree_init __read_mostly; /* Free pool initial value */ static struct { raw_spinlock_t nfree_lock; long nfree; /* Negative dentry free pool */ + struct super_block *prune_sb; /* Super_block for pruning */ + int neg_count, prune_count; /* Pruning counts */ } ndblk cacheline_aligned_in_smp; +static void prune_negative_dentry(struct work_struct *work); +static DECLARE_DELAYED_WORK(prune_neg_dentry_work, prune_negative_dentry); + static DEFINE_PER_CPU(long, nr_dentry); static DEFINE_PER_CPU(long, nr_dentry_unused); static DEFINE_PER_CPU(long, nr_dentry_neg); @@ -329,6 +337,15 @@ static void __neg_dentry_inc(struct dentry *dentry) */ if (!cnt) dentry->d_flags |= DCACHE_KILL_NEGATIVE; + + /* +* Initiate negative dentry pruning if free pool has less than +* 1/4 of its initial value. +*/ + if (READ_ONCE(ndblk.nfree) < neg_dentry_nfree_init/4) { + WRITE_ONCE(ndblk.prune_sb, dentry->d_sb); + schedule_delayed_work(&prune_neg_dentry_work, 1); + } } static inline void neg_dentry_inc(struct dentry *dentry) @@ -770,10 +787,8 @@ static struct dentry *dentry_kill(struct dentry *dentry) * disappear under the hood even if the dentry * lock is temporarily released. */ - unsigned int dflags; + unsigned int dflags = dentry->d_flags; - dentry->d_flags &= ~DCACHE_KILL_NEGATIVE; - dflags = dentry->d_flags; parent = lock_parent(dentry); /* * Abort the killing if the reference count or @@ -964,8 +979,35 @@ void dput(struct dentry *dentry) dentry_lru_add(dentry); - if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) - goto kill_it; + if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) { + /* +* Kill the dentry if it is really negative and the per-cpu +* negative dentry count has still exceeded the limit even +* after a short artificial delay. +*/ + if (d_is_negative(dentry) && + (this_cpu_read(nr_dentry_neg) > neg_dentry_percpu_limit)) { + int loop = 256; + + /* +* Waiting to transfer free negative dentries from the +* free pool to the percpu count. +*/ + while (--loop) { + if (READ_ONCE(ndblk.nfree)) { + long cnt = __neg_dentry_nfree_dec(); + + this_cpu_sub(nr_dentry_neg, cnt); +
[PATCH v3 5/5] fs/dcache: Track count of negative dentries forcibly killed
There is performance concern about killing recently created negative dentries. This should rarely happen under normal working condition. To understand the extent of how often this negative dentry killing is happening, the /proc/sys/fs/denty-state file is extended to track this number. This allows us to see if additional measures will be needed to reduce the chance of negative dentries killing. One possible measure is to increase the percentage of system memory allowed for negative dentries by adding or adjusting the "neg_dentry_pc=" parameter in the kernel boot command line. Signed-off-by: Waiman Long --- fs/dcache.c| 4 include/linux/dcache.h | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/dcache.c b/fs/dcache.c index 360185e..3796c3f 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -145,6 +145,7 @@ struct dentry_stat_t dentry_stat = { long nfree; /* Negative dentry free pool */ struct super_block *prune_sb; /* Super_block for pruning */ int neg_count, prune_count; /* Pruning counts */ + atomic_long_t nr_neg_killed;/* # of negative entries killed */ } ndblk cacheline_aligned_in_smp; static void clear_prune_sb_for_umount(struct super_block *sb); @@ -204,6 +205,7 @@ int proc_nr_dentry(struct ctl_table *table, int write, void __user *buffer, dentry_stat.nr_dentry = get_nr_dentry(); dentry_stat.nr_unused = get_nr_dentry_unused(); dentry_stat.nr_negative = get_nr_dentry_neg(); + dentry_stat.nr_killed = atomic_long_read(&ndblk.nr_neg_killed); return proc_doulongvec_minmax(table, write, buffer, lenp, ppos); } #endif @@ -802,6 +804,7 @@ static struct dentry *dentry_kill(struct dentry *dentry) spin_unlock(&parent->d_lock); goto failed; } + atomic_long_inc(&ndblk.nr_neg_killed); } else if (unlikely(!spin_trylock(&parent->d_lock))) { if (inode) @@ -3932,6 +3935,7 @@ static void __init neg_dentry_init(void) raw_spin_lock_init(&ndblk.nfree_lock); spin_lock_init(&ndblk.prune_lock); + atomic_long_set(&ndblk.nr_neg_killed, 0); /* 20% in global pool & 80% in percpu free */ ndblk.nfree = neg_dentry_nfree_init diff --git a/include/linux/dcache.h b/include/linux/dcache.h index e42c8fc..227ed83 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -66,7 +66,7 @@ struct dentry_stat_t { long age_limit; /* age in seconds */ long want_pages;/* pages requested by system */ long nr_negative; /* # of negative dentries */ - long dummy; + long nr_killed; /* # of negative dentries killed */ }; extern struct dentry_stat_t dentry_stat; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/5] fs/dcache: Protect negative dentry pruning from racing with umount
The negative dentry pruning is done on a specific super_block set in the ndblk.prune_sb variable. If the super_block is also being un-mounted concurrently, the content of the super_block may no longer be valid. To protect against such racing condition, a new lock is added to the ndblk structure to synchronize the negative dentry pruning and umount operation. This is a regular spinlock as the pruning operation can be quite time consuming. Signed-off-by: Waiman Long --- fs/dcache.c | 42 +++--- 1 file changed, 39 insertions(+), 3 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 3482972..360185e 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -141,11 +141,13 @@ struct dentry_stat_t dentry_stat = { static long neg_dentry_nfree_init __read_mostly; /* Free pool initial value */ static struct { raw_spinlock_t nfree_lock; + spinlock_t prune_lock; /* Lock for protecting pruning */ long nfree; /* Negative dentry free pool */ struct super_block *prune_sb; /* Super_block for pruning */ int neg_count, prune_count; /* Pruning counts */ } ndblk cacheline_aligned_in_smp; +static void clear_prune_sb_for_umount(struct super_block *sb); static void prune_negative_dentry(struct work_struct *work); static DECLARE_DELAYED_WORK(prune_neg_dentry_work, prune_negative_dentry); @@ -1355,6 +1357,7 @@ void shrink_dcache_sb(struct super_block *sb) { long freed; + clear_prune_sb_for_umount(sb); do { LIST_HEAD(dispose); @@ -1385,7 +1388,8 @@ static enum lru_status dentry_negative_lru_isolate(struct list_head *item, * list. */ if ((ndblk.neg_count >= NEG_PRUNING_SIZE) || - (ndblk.prune_count >= NEG_PRUNING_SIZE)) { + (ndblk.prune_count >= NEG_PRUNING_SIZE) || + !READ_ONCE(ndblk.prune_sb)) { ndblk.prune_count = 0; return LRU_STOP; } @@ -1441,15 +1445,24 @@ static void prune_negative_dentry(struct work_struct *work) { int freed; long nfree; - struct super_block *sb = READ_ONCE(ndblk.prune_sb); + struct super_block *sb; LIST_HEAD(dispose); - if (!sb) + /* +* The prune_lock is used to protect negative dentry pruning from +* racing with concurrent umount operation. +*/ + spin_lock(&ndblk.prune_lock); + sb = READ_ONCE(ndblk.prune_sb); + if (!sb) { + spin_unlock(&ndblk.prune_lock); return; + } ndblk.neg_count = ndblk.prune_count = 0; freed = list_lru_walk(&sb->s_dentry_lru, dentry_negative_lru_isolate, &dispose, NEG_DENTRY_BATCH); + spin_unlock(&ndblk.prune_lock); if (freed) shrink_dentry_list(&dispose); @@ -1472,6 +1485,27 @@ static void prune_negative_dentry(struct work_struct *work) WRITE_ONCE(ndblk.prune_sb, NULL); } +/* + * This is called before an umount to clear ndblk.prune_sb if it + * matches the given super_block. + */ +static void clear_prune_sb_for_umount(struct super_block *sb) +{ + if (likely(READ_ONCE(ndblk.prune_sb) != sb)) + return; + WRITE_ONCE(ndblk.prune_sb, NULL); + /* +* Need to wait until an ongoing pruning operation, if present, +* is completed. +* +* Clearing ndblk.prune_sb will hasten the completion of pruning. +* In the unlikely event that ndblk.prune_sb is set to another +* super_block, the waiting will last the complete pruning operation +* which shouldn't be that long either. +*/ + spin_unlock_wait(&ndblk.prune_lock); +} + /** * enum d_walk_ret - action to talke during tree walk * @D_WALK_CONTINUE: contrinue walk @@ -1794,6 +1828,7 @@ void shrink_dcache_for_umount(struct super_block *sb) WARN(down_read_trylock(&sb->s_umount), "s_umount should've been locked"); + clear_prune_sb_for_umount(sb); dentry = sb->s_root; sb->s_root = NULL; do_one_tree(dentry); @@ -3896,6 +3931,7 @@ static void __init neg_dentry_init(void) unsigned long cnt; raw_spin_lock_init(&ndblk.nfree_lock); + spin_lock_init(&ndblk.prune_lock); /* 20% in global pool & 80% in percpu free */ ndblk.nfree = neg_dentry_nfree_init -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/5] fs/dcache: Report negative dentry number in dentry-state
The number of negative dentries currently in the system is now reported in the /proc/sys/fs/dentry-state file. Signed-off-by: Waiman Long --- fs/dcache.c| 16 +++- include/linux/dcache.h | 7 --- 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index ab10b96..fb7e041 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -135,6 +135,7 @@ struct dentry_stat_t dentry_stat = { */ #define NEG_DENTRY_BATCH (1 << 8) static long neg_dentry_percpu_limit __read_mostly; +static long neg_dentry_nfree_init __read_mostly; /* Free pool initial value */ static struct { raw_spinlock_t nfree_lock; long nfree; /* Negative dentry free pool */ @@ -176,11 +177,23 @@ static long get_nr_dentry_unused(void) return sum < 0 ? 0 : sum; } +static long get_nr_dentry_neg(void) +{ + int i; + long sum = 0; + + for_each_possible_cpu(i) + sum += per_cpu(nr_dentry_neg, i); + sum += neg_dentry_nfree_init - ndblk.nfree; + return sum < 0 ? 0 : sum; +} + int proc_nr_dentry(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { dentry_stat.nr_dentry = get_nr_dentry(); dentry_stat.nr_unused = get_nr_dentry_unused(); + dentry_stat.nr_negative = get_nr_dentry_neg(); return proc_doulongvec_minmax(table, write, buffer, lenp, ppos); } #endif @@ -3739,7 +3752,8 @@ static void __init neg_dentry_init(void) raw_spin_lock_init(&ndblk.nfree_lock); /* 20% in global pool & 80% in percpu free */ - ndblk.nfree = totalram_pages * nr_dentry_page * neg_dentry_pc / 500; + ndblk.nfree = neg_dentry_nfree_init + = totalram_pages * nr_dentry_page * neg_dentry_pc / 500; cnt = ndblk.nfree * 4 / num_possible_cpus(); if (unlikely(cnt < 2 * NEG_DENTRY_BATCH)) cnt = 2 * NEG_DENTRY_BATCH; diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 5ffcc46..e42c8fc 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -63,9 +63,10 @@ struct qstr { struct dentry_stat_t { long nr_dentry; long nr_unused; - long age_limit; /* age in seconds */ - long want_pages; /* pages requested by system */ - long dummy[2]; + long age_limit; /* age in seconds */ + long want_pages;/* pages requested by system */ + long nr_negative; /* # of negative dentries */ + long dummy; }; extern struct dentry_stat_t dentry_stat; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/5] fs/dcache: Limit numbers of negative dentries
The number of positive dentries is limited by the number of files in the filesystems. The number of negative dentries, however, has no limit other than the total amount of memory available in the system. So a rogue application that generates a lot of negative dentries can potentially exhaust most of the memory available in the system impacting performance on other running applications. To prevent this from happening, the dcache code is now updated to limit the amount of the negative dentries in the LRU lists that can be kept as a percentage of total available system memory. The default is 5% and can be changed by specifying the "neg_dentry_pc=" kernel command line option. If the negative dentry limit is exceeded, newly created negative dentries will be killed right after use to avoid adding unpredictable latency to the directory lookup operation. Signed-off-by: Waiman Long --- Documentation/admin-guide/kernel-parameters.txt | 7 + fs/dcache.c | 251 +--- include/linux/dcache.h | 1 + 3 files changed, 227 insertions(+), 32 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 372cc66..7f5497b 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2383,6 +2383,13 @@ n2= [NET] SDL Inc. RISCom/N2 synchronous serial card + neg_dentry_pc= [KNL] + Range: 1-50 + Default: 5 + This parameter specifies the amount of negative + dentries allowed in the system as a percentage of + total system memory. + netdev= [NET] Network devices parameters Format: Note that mem_start is often overloaded to mean diff --git a/fs/dcache.c b/fs/dcache.c index f901413..ab10b96 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -130,8 +130,19 @@ struct dentry_stat_t dentry_stat = { .age_limit = 45, }; +/* + * Macros and variables to manage and count negative dentries. + */ +#define NEG_DENTRY_BATCH (1 << 8) +static long neg_dentry_percpu_limit __read_mostly; +static struct { + raw_spinlock_t nfree_lock; + long nfree; /* Negative dentry free pool */ +} ndblk cacheline_aligned_in_smp; + static DEFINE_PER_CPU(long, nr_dentry); static DEFINE_PER_CPU(long, nr_dentry_unused); +static DEFINE_PER_CPU(long, nr_dentry_neg); #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) @@ -227,6 +238,92 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char #endif +/* + * There is a system-wide limit to the amount of negative dentries allowed + * in the super blocks' LRU lists. The default limit is 5% of the total + * system memory. This limit can be changed by using the kernel command line + * option "neg_dentry_pc=" to specify the percentage of the total memory + * that can be used for negative dentries. That percentage must be in the + * 1-50% range. + * + * To avoid performance problem with a global counter on an SMP system, + * the tracking is done mostly on a per-cpu basis. The total limit is + * distributed in a 80/20 ratio to per-cpu counters and a global free pool. + * + * If a per-cpu counter runs out of negative dentries, it can borrow extra + * ones from the global free pool. If it has more than its percpu limit, + * the extra ones will be returned back to the global pool. + */ + +/* + * Decrement negative dentry count if applicable. + */ +static void __neg_dentry_dec(struct dentry *dentry) +{ + if (unlikely(this_cpu_dec_return(nr_dentry_neg) < 0)) { + long *pcnt = get_cpu_ptr(&nr_dentry_neg); + + if ((*pcnt < 0) && raw_spin_trylock(&ndblk.nfree_lock)) { + ACCESS_ONCE(ndblk.nfree) += NEG_DENTRY_BATCH; + *pcnt += NEG_DENTRY_BATCH; + raw_spin_unlock(&ndblk.nfree_lock); + } + put_cpu_ptr(&nr_dentry_neg); + } +} + +static inline void neg_dentry_dec(struct dentry *dentry) +{ + if (unlikely(d_is_negative(dentry))) + __neg_dentry_dec(dentry); +} + +/* + * Decrement the negative dentry free pool by NEG_DENTRY_BATCH & return + * the actual number decremented. + */ +static long __neg_dentry_nfree_dec(void) +{ + long cnt = NEG_DENTRY_BATCH; + + raw_spin_lock(&ndblk.nfree_lock); + if (ndblk.nfree < cnt) + cnt = ndblk.nfree; + ACCESS_ONCE(ndblk.nfree) -= cnt; + raw_spin_unlock(&ndblk.nfree_lock); + return cnt; +} + +/* + * Increment negative dentry count if applicable. + */ +static void __neg_dentry_inc(struct dentry *dentry) +{ + long cnt = 0, *pcnt; + + if (this_cpu_inc_return(nr_dentry_neg) <= neg_dentry_percpu_limit) +
Re: [RFC PATCH v2 00/38] Nested Virtualization on KVM/ARM
Jintack Lim writes: ... >> >> I'll share my experiment setup shortly. > > I summarized my experiment setup here. > > https://github.com/columbia/nesting-pub/wiki/Nested-virtualization-on-ARM-setup Thanks Jintack! I was able to test L2 boot up with these instructions. Next, I will try to run some simple tests. Any suggestions on reducing the L2 bootup time in my test setup ? I think I will try to make the L2 kernel print less messages; and maybe just get rid of some of the userspace services. I also applied the patch to reduce the timer frequency btw. Bandan >> >> Even though this work has some limitations and TODOs, I'd appreciate early >> feedback on this RFC. Specifically, I'm interested in: >> >> - Overall design to manage vcpu context for the virtual EL2 >> - Verifying correct EL2 register configurations such as HCR_EL2, CPTR_EL2 >> (Patch 30 and 32) >> - Patch organization and coding style > > I also wonder if the hardware and/or KVM do not support nested > virtualization but the userspace uses nested virtualization option, > which one is better: giving an error or launching a regular VM > silently. > >> >> This patch series is based on kvm/next d38338e. >> The whole patch series including memory, VGIC, and timer patches is available >> here: >> >> g...@github.com:columbia/nesting-pub.git rfc-v2 >> >> Limitations: >> - There are some cases that the target exception level of a VM is ambiguous >> when >> emulating eret instruction. I'm discussing this issue with Christoffer and >> Marc. Meanwhile, I added a temporary patch (not included in this >> series. f1beaba in the repo) and used 4.10.0 kernel when testing the guest >> hypervisor with VHE. >> - Recursive nested virtualization is not tested yet. >> - Other hypervisors (such as Xen) on KVM are not tested. >> >> TODO: >> - Submit memory, VGIC, and timer patches >> - Evaluate regular VM performance to see if there's a negative impact. >> - Test other hypervisors such as Xen on KVM >> - Test recursive nested virtualization >> >> v1-->v2: >> - Added support for the virtual EL2 with VHE >> - Rewrote commit messages and comments from the perspective of supporting >> execution environments to VMs, rather than from the perspective of the >> guest >> hypervisor running in them. >> - Fixed a few bugs to make it run on the FastModel. >> - Tested on ARMv8.3 with four configurations. (host/guest. with/without VHE.) >> - Rebased to kvm/next >> >> [1] >> https://www.community.arm.com/processors/b/blog/posts/armv8-a-architecture-2016-additions >> >> Christoffer Dall (7): >> KVM: arm64: Add KVM nesting feature >> KVM: arm64: Allow userspace to set PSR_MODE_EL2x >> KVM: arm64: Add vcpu_mode_el2 primitive to support nesting >> KVM: arm/arm64: Add a framework to prepare virtual EL2 execution >> arm64: Add missing TCR hw defines >> KVM: arm64: Create shadow EL1 registers >> KVM: arm64: Trap EL1 VM register accesses in virtual EL2 >> >> Jintack Lim (31): >> arm64: Add ARM64_HAS_NESTED_VIRT feature >> KVM: arm/arm64: Enable nested virtualization via command-line >> KVM: arm/arm64: Check if nested virtualization is in use >> KVM: arm64: Add EL2 system registers to vcpu context >> KVM: arm64: Add EL2 special registers to vcpu context >> KVM: arm64: Add the shadow context for virtual EL2 execution >> KVM: arm64: Set vcpu context depending on the guest exception level >> KVM: arm64: Synchronize EL1 system registers on virtual EL2 entry and >> exit >> KVM: arm64: Move exception macros and enums to a common file >> KVM: arm64: Support to inject exceptions to the virtual EL2 >> KVM: arm64: Trap SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2 >> KVM: arm64: Trap CPACR_EL1 access in virtual EL2 >> KVM: arm64: Handle eret instruction traps >> KVM: arm64: Set a handler for the system instruction traps >> KVM: arm64: Handle PSCI call via smc from the guest >> KVM: arm64: Inject HVC exceptions to the virtual EL2 >> KVM: arm64: Respect virtual HCR_EL2.TWX setting >> KVM: arm64: Respect virtual CPTR_EL2.TFP setting >> KVM: arm64: Add macros to support the virtual EL2 with VHE >> KVM: arm64: Add EL2 registers defined in ARMv8.1 to vcpu context >> KVM: arm64: Emulate EL12 register accesses from the virtual EL2 >> KVM: arm64: Support a VM with VHE considering EL0 of the VHE host >> KVM: arm64: Allow the virtual EL2 to access EL2 states without trap >> KVM: arm64: Manage the shadow states when virtual E2H bit enabled >> KVM: arm64: Trap and emulate CPTR_EL2 accesses via CPACR_EL1 from the >> virtual EL2 with VHE >> KVM: arm64: Emulate appropriate VM control system registers >> KVM: arm64: Respect the virtual HCR_EL2.NV bit setting >> KVM: arm64: Respect the virtual HCR_EL2.NV bit setting for EL12 >> register traps >> KVM: arm64: Respect virtual HCR_EL2.TVM and TRVM settings >> KVM: arm64: Respect the virtual HCR_EL2.NV1 bit setting >> KVM: arm64: Respect the virtual CPTR_EL2.TCPAC s
Re: [RFC v6 27/62] powerpc: helper to validate key-access permissions of a pte
Ram Pai writes: > --- a/arch/powerpc/mm/pkeys.c > +++ b/arch/powerpc/mm/pkeys.c > @@ -201,3 +201,36 @@ int __arch_override_mprotect_pkey(struct vm_area_struct > *vma, int prot, >*/ > return vma_pkey(vma); > } > + > +static bool pkey_access_permitted(int pkey, bool write, bool execute) > +{ > + int pkey_shift; > + u64 amr; > + > + if (!pkey) > + return true; > + > + pkey_shift = pkeyshift(pkey); > + if (!(read_uamor() & (0x3UL << pkey_shift))) > + return true; > + > + if (execute && !(read_iamr() & (IAMR_EX_BIT << pkey_shift))) > + return true; > + > + if (!write) { > + amr = read_amr(); > + if (!(amr & (AMR_RD_BIT << pkey_shift))) > + return true; > + } > + > + amr = read_amr(); /* delay reading amr uptil absolutely needed */ Actually, this is causing amr to be read twice in case control enters the "if (!write)" block above but doesn't enter the other if block nested in it. read_amr should be called only once, right before "if (!write)". -- Thiago Jung Bauermann IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 00/38] Nested Virtualization on KVM/ARM
On Fri, Jul 28, 2017 at 4:13 PM, Bandan Das wrote: > Jintack Lim writes: > ... >>> >>> I'll share my experiment setup shortly. >> >> I summarized my experiment setup here. >> >> https://github.com/columbia/nesting-pub/wiki/Nested-virtualization-on-ARM-setup > > Thanks Jintack! I was able to test L2 boot up with these instructions. Thanks for the confirmation! > > Next, I will try to run some simple tests. Any suggestions on reducing the L2 > bootup > time in my test setup ? I think I will try to make the L2 kernel print > less messages; and maybe just get rid of some of the userspace services. > I also applied the patch to reduce the timer frequency btw. I think you can try to use those kernel parameters: "loglevel=1", with which the kernel print (almost) nothing during the boot process but the init process will print something, or "console=none", with which you don't see anything but the login message. I didn't used them because I wanted to see the L2 boot message as soon as possible :) Thanks, Jintack > > Bandan > >>> >>> Even though this work has some limitations and TODOs, I'd appreciate early >>> feedback on this RFC. Specifically, I'm interested in: >>> >>> - Overall design to manage vcpu context for the virtual EL2 >>> - Verifying correct EL2 register configurations such as HCR_EL2, CPTR_EL2 >>> (Patch 30 and 32) >>> - Patch organization and coding style >> >> I also wonder if the hardware and/or KVM do not support nested >> virtualization but the userspace uses nested virtualization option, >> which one is better: giving an error or launching a regular VM >> silently. >> >>> >>> This patch series is based on kvm/next d38338e. >>> The whole patch series including memory, VGIC, and timer patches is >>> available >>> here: >>> >>> g...@github.com:columbia/nesting-pub.git rfc-v2 >>> >>> Limitations: >>> - There are some cases that the target exception level of a VM is ambiguous >>> when >>> emulating eret instruction. I'm discussing this issue with Christoffer and >>> Marc. Meanwhile, I added a temporary patch (not included in this >>> series. f1beaba in the repo) and used 4.10.0 kernel when testing the guest >>> hypervisor with VHE. >>> - Recursive nested virtualization is not tested yet. >>> - Other hypervisors (such as Xen) on KVM are not tested. >>> >>> TODO: >>> - Submit memory, VGIC, and timer patches >>> - Evaluate regular VM performance to see if there's a negative impact. >>> - Test other hypervisors such as Xen on KVM >>> - Test recursive nested virtualization >>> >>> v1-->v2: >>> - Added support for the virtual EL2 with VHE >>> - Rewrote commit messages and comments from the perspective of supporting >>> execution environments to VMs, rather than from the perspective of the >>> guest >>> hypervisor running in them. >>> - Fixed a few bugs to make it run on the FastModel. >>> - Tested on ARMv8.3 with four configurations. (host/guest. with/without >>> VHE.) >>> - Rebased to kvm/next >>> >>> [1] >>> https://www.community.arm.com/processors/b/blog/posts/armv8-a-architecture-2016-additions >>> >>> Christoffer Dall (7): >>> KVM: arm64: Add KVM nesting feature >>> KVM: arm64: Allow userspace to set PSR_MODE_EL2x >>> KVM: arm64: Add vcpu_mode_el2 primitive to support nesting >>> KVM: arm/arm64: Add a framework to prepare virtual EL2 execution >>> arm64: Add missing TCR hw defines >>> KVM: arm64: Create shadow EL1 registers >>> KVM: arm64: Trap EL1 VM register accesses in virtual EL2 >>> >>> Jintack Lim (31): >>> arm64: Add ARM64_HAS_NESTED_VIRT feature >>> KVM: arm/arm64: Enable nested virtualization via command-line >>> KVM: arm/arm64: Check if nested virtualization is in use >>> KVM: arm64: Add EL2 system registers to vcpu context >>> KVM: arm64: Add EL2 special registers to vcpu context >>> KVM: arm64: Add the shadow context for virtual EL2 execution >>> KVM: arm64: Set vcpu context depending on the guest exception level >>> KVM: arm64: Synchronize EL1 system registers on virtual EL2 entry and >>> exit >>> KVM: arm64: Move exception macros and enums to a common file >>> KVM: arm64: Support to inject exceptions to the virtual EL2 >>> KVM: arm64: Trap SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2 >>> KVM: arm64: Trap CPACR_EL1 access in virtual EL2 >>> KVM: arm64: Handle eret instruction traps >>> KVM: arm64: Set a handler for the system instruction traps >>> KVM: arm64: Handle PSCI call via smc from the guest >>> KVM: arm64: Inject HVC exceptions to the virtual EL2 >>> KVM: arm64: Respect virtual HCR_EL2.TWX setting >>> KVM: arm64: Respect virtual CPTR_EL2.TFP setting >>> KVM: arm64: Add macros to support the virtual EL2 with VHE >>> KVM: arm64: Add EL2 registers defined in ARMv8.1 to vcpu context >>> KVM: arm64: Emulate EL12 register accesses from the virtual EL2 >>> KVM: arm64: Support a VM with VHE considering EL0 of the VHE host >>> KVM: arm64: Allow the virtual EL2 to access EL2 states without trap >
Здравствуйте! Вас интересуют клиентские базы данных? Ответ на Email: prodawez...@gmail.com
Здравствуйте! Вас интересуют клиентские базы данных? Ответ на Email: prodawez...@gmail.com
Re: [RFC v6 21/62] powerpc: introduce execute-only pkey
Ram Pai writes: > --- a/arch/powerpc/mm/pkeys.c > +++ b/arch/powerpc/mm/pkeys.c > @@ -97,3 +97,60 @@ int __arch_set_user_pkey_access(struct task_struct *tsk, > int pkey, > init_iamr(pkey, new_iamr_bits); > return 0; > } > + > +static inline bool pkey_allows_readwrite(int pkey) > +{ > + int pkey_shift = pkeyshift(pkey); > + > + if (!(read_uamor() & (0x3UL << pkey_shift))) > + return true; > + > + return !(read_amr() & ((AMR_RD_BIT|AMR_WR_BIT) << pkey_shift)); > +} > + > +int __execute_only_pkey(struct mm_struct *mm) > +{ > + bool need_to_set_mm_pkey = false; > + int execute_only_pkey = mm->context.execute_only_pkey; > + int ret; > + > + /* Do we need to assign a pkey for mm's execute-only maps? */ > + if (execute_only_pkey == -1) { > + /* Go allocate one to use, which might fail */ > + execute_only_pkey = mm_pkey_alloc(mm); > + if (execute_only_pkey < 0) > + return -1; > + need_to_set_mm_pkey = true; > + } > + > + /* > + * We do not want to go through the relatively costly > + * dance to set AMR if we do not need to. Check it > + * first and assume that if the execute-only pkey is > + * readwrite-disabled than we do not have to set it > + * ourselves. > + */ > + if (!need_to_set_mm_pkey && > + !pkey_allows_readwrite(execute_only_pkey)) > + return execute_only_pkey; > + > + /* > + * Set up AMR so that it denies access for everything > + * other than execution. > + */ > + ret = __arch_set_user_pkey_access(current, execute_only_pkey, > + (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)); > + /* > + * If the AMR-set operation failed somehow, just return > + * 0 and effectively disable execute-only support. > + */ > + if (ret) { > + mm_set_pkey_free(mm, execute_only_pkey); > + return -1; > + } > + > + /* We got one, store it and use it from here on out */ > + if (need_to_set_mm_pkey) > + mm->context.execute_only_pkey = execute_only_pkey; > + return execute_only_pkey; > +} If you follow the code flow in __execute_only_pkey, the AMR and UAMOR are read 3 times in total, and AMR is written twice. IAMR is read and written twice. Since they are SPRs and access to them is slow (or isn't it?), is it worth it to read them once in __execute_only_pkey and pass down their values to the callees, and then write them once at the end of the function? This function is used both by the mmap syscall and the mprotect syscall (but not by pkey_mprotect) if the requested protection is execute-only. -- Thiago Jung Bauermann IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html