Re: [PATCH net-next v2 01/10] net: dsa: lan9303: Fixed MDIO interface

2017-07-28 Thread Egil Hjelmeland

On 26. juli 2017 18:55, Andrew Lunn wrote:

On Tue, Jul 25, 2017 at 06:15:44PM +0200, Egil Hjelmeland wrote:
It is better to use mdiobus_read/write or if you are nesting mdio
busses, mdiobus_read_nested/mdiobus_write_nested. Please test this
code with lockdep enabled.



I have CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_MUTEXES. Should I enable
more?

Egil
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cgroup: add cgroup.stat interface with basic hierarchy stats

2017-07-28 Thread Roman Gushchin
On Thu, Jul 27, 2017 at 12:22:43PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jul 27, 2017 at 05:14:20PM +0100, Roman Gushchin wrote:
> > Add a cgroup.stat interface to the base cgroup control files
> > with the following metrics:
> > 
> > nr_descendants  total number of descendant cgroups
> > nr_dying_descendantstotal number of dying descendant cgroups
> > max_descendant_depthmaximum descent depth below the current cgroup
> 
> Yeah, this'd be great to have.  Some comments below.
> 
> > +  cgroup.stat
> > +   A read-only flat-keyed file with the following entries:
> > +
> > + nr_descendants
> > +   Total number of descendant cgroups.
> > +
> > + nr_dying_descendants
> > +   Total number of dying descendant cgroups.
> 
> Can you please go into more detail on what's going on with dying
> descendants here?

Sure.
Don't we plan do describe cgroup/css lifecycle in details
in a separate section?

> > +static int cgroup_stats_show(struct seq_file *seq, void *v)
> > +{
> > +   struct cgroup_subsys_state *css;
> > +   unsigned long total = 0;
> > +   unsigned long offline = 0;
> > +   int max_level = 0;
> > +
> > +   rcu_read_lock();
> > +   css_for_each_descendant_pre(css, seq_css(seq)) {
> > +   if (css == seq_css(seq))
> > +   continue;
> > +   ++total;
> 
> Let's do post increment for consistency.

Ok.

> 
> > +   if (!(css->flags & CSS_ONLINE))
> > +   ++offline;
> > +   if (css->cgroup->level > max_level)
> > +   max_level = css->cgroup->level;
> > +   }
> > +   rcu_read_unlock();
> 
> I wonder whether we want to keep these counters in sync instead of
> trying to gather the number on read.  Walking all descendants can get
> expensive pretty quickly and things like nr_descendants will be useful
> for other purposes too.

Ok, assuming that I'm working on adding an ability to set limits
on cgroup hierarchy, it seems very reasonable. I'll implement this
and post the updated path as a part of a patchset.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2 01/10] net: dsa: lan9303: Fixed MDIO interface

2017-07-28 Thread Andrew Lunn
On Fri, Jul 28, 2017 at 01:08:25PM +0200, Egil Hjelmeland wrote:
> On 26. juli 2017 18:55, Andrew Lunn wrote:
> >On Tue, Jul 25, 2017 at 06:15:44PM +0200, Egil Hjelmeland wrote:
> >It is better to use mdiobus_read/write or if you are nesting mdio
> >busses, mdiobus_read_nested/mdiobus_write_nested. Please test this
> >code with lockdep enabled.
> >
> 
> I have CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_MUTEXES. Should I enable
> more?

Hi Egil

Enable CONFIG_LOCKDEP and CONFIG_PROVE_LOCKING.

Any lockdep splat you get while accessing the mdio bus at this point
are probably false positives, since it is a different mutex. Using the
_nested() version should avoid these false positives. But you might
find other places your locking is not right.

Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] printk: Add boottime and real timestamps

2017-07-28 Thread Prarit Bhargava


On 07/25/2017 09:00 AM, Peter Zijlstra wrote:
> On Tue, Jul 25, 2017 at 08:17:27AM -0400, Prarit Bhargava wrote:
>> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
>> index 5b1662ec546f..6cd38a25f8ea 100644
>> --- a/lib/Kconfig.debug
>> +++ b/lib/Kconfig.debug
>> @@ -1,8 +1,8 @@
>>  menu "printk and dmesg options"
>>  
>>  config PRINTK_TIME
>> -int "Show timing information on printks (0-1)"
>> -range 0 1
>> +int "Show timing information on printks (0-3)"
>> +range 0 3
>>  default "0"
>>  depends on PRINTK
>>  help
>> @@ -13,7 +13,8 @@ config PRINTK_TIME
>>The timestamp is always recorded internally, and exported
>>to /dev/kmsg. This flag just specifies if the timestamp should
>>be included, not that the timestamp is recorded. 0 disables the
>> -  timestamp and 1 uses the local clock.
>> +  timestamp and 1 uses the local clock, 2 uses the monotonic clock, and
>> +  3 uses real clock.
>>  
>>The behavior is also controlled by the kernel command line
>>parameter printk.time=1. See 
>> Documentation/admin-guide/kernel-parameters.rst
> 
> 
> choice
>   prompt "printk default clock"
>   default PRIMTK_TIME_DISABLE
>   help
>goes here
> 
>   config PRINTK_TIME_DISABLE
>   bool "Disabled"
>   help
>goes here
> 
>   config PRINTK_TIME_LOCAL
>   bool "local clock"
>   help
>goes here
> 
>   config PRINTK_TIME_MONO
>   bool "CLOCK_MONOTONIC"
>   help
>goes here
> 
>   config PRINTK_TIME_REAL
>   bool "CLOCK_REALTIME"
>   help
>goes here
> 
> endchoice
> 
> config PRINTK_TIME
>   int
>   default 0 if PRINTK_TIME_DISABLE
>   default 1 if PRINTK_TIME_LOCAL
>   default 2 if PRINTK_TIME_MONO
>   default 3 if PRINTK_TIME_REAL
> 
> 

Thanks for the above change.  I can see that makes the code simpler.

> Although I must strongly discourage using REALTIME, DST will make
> untangling your logs an absolute nightmare. I would simply not provide
> it.

I understand your concern, however, I've been in situations where REALTIME
stamping has pointed me in the direction of where a bug was.  Even with the
complicated logs I think it is worthwhile.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cgroup: add cgroup.stat interface with basic hierarchy stats

2017-07-28 Thread Tejun Heo
Hello,

On Fri, Jul 28, 2017 at 02:01:55PM +0100, Roman Gushchin wrote:
> > > +   nr_dying_descendants
> > > + Total number of dying descendant cgroups.
> > 
> > Can you please go into more detail on what's going on with dying
> > descendants here?
>
> Sure.
> Don't we plan do describe cgroup/css lifecycle in details
> in a separate section?

We should but it'd still be nice to have a short description here too.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] printk: Add boottime and real timestamps

2017-07-28 Thread Thomas Gleixner
On Fri, 28 Jul 2017, Prarit Bhargava wrote:
> On 07/25/2017 09:00 AM, Peter Zijlstra wrote:
> Thanks for the above change.  I can see that makes the code simpler.
> 
> > Although I must strongly discourage using REALTIME, DST will make
> > untangling your logs an absolute nightmare. I would simply not provide
> > it.
> 
> I understand your concern, however, I've been in situations where REALTIME
> stamping has pointed me in the direction of where a bug was.  Even with the
> complicated logs I think it is worthwhile.

As Mark pointed out. ktime_get_real() and the fast variant return UTC. The
timezone mess plus the DST nonsense are done in user space.

Thanks,

tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/5] fs/dcache: Limit # of negative dentries

2017-07-28 Thread Waiman Long
 v2->v3:
  - Add a faster pruning rate when the free pool is closed to depletion.
  - As suggested by James Bottomley, add an artificial delay waiting
loop before killing a negative dentry and properly clear the
DCACHE_KILL_NEGATIVE flag if killing doesn't happen.
  - Add a new patch to track number of negative dentries that are
forcifully killed.

 v1->v2:
  - Move the new nr_negative field to the end of dentry_stat_t structure
as suggested by Matthew Wilcox.
  - With the help of Miklos Szeredi, fix incorrect locking order in
dentry_kill() by using lock_parent() instead of locking the parent's
d_lock directly.
  - Correctly account for positive to negative dentry transitions.
  - Automatic pruning of negative dentries will now ignore the reference
bit in negative dentries but not the regular shrinking.

A rogue application can potentially create a large number of negative
dentries in the system consuming most of the memory available. This
can impact performance of other applications running on the system.

This patchset introduces changes to the dcache subsystem to limit
the number of negative dentries allowed to be created thus limiting
the amount of memory that can be consumed by negative dentries.

Patch 1 tracks the number of negative dentries used and disallow
the creation of more when the limit is reached.

Patch 2 enables /proc/sys/fs/dentry-state to report the number of
negative dentries in the system.

Patch 3 enables automatic pruning of negative dentries when it is
close to the limit so that we won't end up killing recently used
negative dentries.

Patch 4 prevents racing between negative dentry pruning and umount
operation.

Patch 5 shows the number of forced negative dentry killings in
/proc/sys/fs/dentry-state. End users can then tune the neg_dentry_pc=
kernel boot parameter if they want to reduce forced negative dentry
killings.

Waiman Long (5):
  fs/dcache: Limit numbers of negative dentries
  fs/dcache: Report negative dentry number in dentry-state
  fs/dcache: Enable automatic pruning of negative dentries
  fs/dcache: Protect negative dentry pruning from racing with umount
  fs/dcache: Track count of negative dentries forcibly killed

 Documentation/admin-guide/kernel-parameters.txt |   7 +
 fs/dcache.c | 451 ++--
 include/linux/dcache.h  |   8 +-
 include/linux/list_lru.h|   1 +
 mm/list_lru.c   |   4 +-
 5 files changed, 435 insertions(+), 36 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/5] fs/dcache: Enable automatic pruning of negative dentries

2017-07-28 Thread Waiman Long
Having a limit for the number of negative dentries may have an
undesirable side effect that no new negative dentries will be allowed
when the limit is reached. This may have a performance impact on some
workloads.

To prevent this from happening, we need a way to prune the negative
dentries so that new ones can be created before it is too late. This
is done by using the workqueue API to do the pruning gradually when a
threshold is reached to minimize performance impact on other running
tasks.

The current threshold is 1/4 of the initial value of the free pool
count. Once the threshold is reached, the automatic pruning process
will be kicked in to replenish the free pool. Each pruning run will
scan at most 256 LRU dentries and 64 dentries per node to minimize
the LRU locks hold time. The pruning rate will be 50 Hz if the free
pool count is less than 1/8 of the original and 10 Hz otherwise.

A short artificial delay loop is added to wait for changes in the
negative dentry count before killing the negative dentry. Sleeping
in this case may be problematic as the callers of dput() may not
be in a state that is sleepable.

Allowing tasks needing negative dentries to potentially go to do
the pruning synchronously themselves can cause lock and cacheline
contention. The end result may not be better than that of killing
recently created negative dentries.

Signed-off-by: Waiman Long 
---
 fs/dcache.c  | 156 +--
 include/linux/list_lru.h |   1 +
 mm/list_lru.c|   4 +-
 3 files changed, 155 insertions(+), 6 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fb7e041..3482972 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -134,13 +134,21 @@ struct dentry_stat_t dentry_stat = {
  * Macros and variables to manage and count negative dentries.
  */
 #define NEG_DENTRY_BATCH   (1 << 8)
+#define NEG_PRUNING_SIZE   (1 << 6)
+#define NEG_PRUNING_SLOW_RATE  (HZ/10)
+#define NEG_PRUNING_FAST_RATE  (HZ/50)
 static long neg_dentry_percpu_limit __read_mostly;
 static long neg_dentry_nfree_init __read_mostly; /* Free pool initial value */
 static struct {
raw_spinlock_t nfree_lock;
long nfree; /* Negative dentry free pool */
+   struct super_block *prune_sb;   /* Super_block for pruning */
+   int neg_count, prune_count; /* Pruning counts */
 } ndblk cacheline_aligned_in_smp;
 
+static void prune_negative_dentry(struct work_struct *work);
+static DECLARE_DELAYED_WORK(prune_neg_dentry_work, prune_negative_dentry);
+
 static DEFINE_PER_CPU(long, nr_dentry);
 static DEFINE_PER_CPU(long, nr_dentry_unused);
 static DEFINE_PER_CPU(long, nr_dentry_neg);
@@ -329,6 +337,15 @@ static void __neg_dentry_inc(struct dentry *dentry)
 */
if (!cnt)
dentry->d_flags |= DCACHE_KILL_NEGATIVE;
+
+   /*
+* Initiate negative dentry pruning if free pool has less than
+* 1/4 of its initial value.
+*/
+   if (READ_ONCE(ndblk.nfree) < neg_dentry_nfree_init/4) {
+   WRITE_ONCE(ndblk.prune_sb, dentry->d_sb);
+   schedule_delayed_work(&prune_neg_dentry_work, 1);
+   }
 }
 
 static inline void neg_dentry_inc(struct dentry *dentry)
@@ -770,10 +787,8 @@ static struct dentry *dentry_kill(struct dentry *dentry)
 * disappear under the hood even if the dentry
 * lock is temporarily released.
 */
-   unsigned int dflags;
+   unsigned int dflags = dentry->d_flags;
 
-   dentry->d_flags &= ~DCACHE_KILL_NEGATIVE;
-   dflags = dentry->d_flags;
parent = lock_parent(dentry);
/*
 * Abort the killing if the reference count or
@@ -964,8 +979,35 @@ void dput(struct dentry *dentry)
 
dentry_lru_add(dentry);
 
-   if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE))
-   goto kill_it;
+   if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) {
+   /*
+* Kill the dentry if it is really negative and the per-cpu
+* negative dentry count has still exceeded the limit even
+* after a short artificial delay.
+*/
+   if (d_is_negative(dentry) &&
+  (this_cpu_read(nr_dentry_neg) > neg_dentry_percpu_limit)) {
+   int loop = 256;
+
+   /*
+* Waiting to transfer free negative dentries from the
+* free pool to the percpu count.
+*/
+   while (--loop) {
+   if (READ_ONCE(ndblk.nfree)) {
+   long cnt = __neg_dentry_nfree_dec();
+
+   this_cpu_sub(nr_dentry_neg, cnt);
+   

[PATCH v3 5/5] fs/dcache: Track count of negative dentries forcibly killed

2017-07-28 Thread Waiman Long
There is performance concern about killing recently created negative
dentries. This should rarely happen under normal working condition. To
understand the extent of how often this negative dentry killing is
happening, the /proc/sys/fs/denty-state file is extended to track this
number. This allows us to see if additional measures will be needed
to reduce the chance of negative dentries killing.

One possible measure is to increase the percentage of system
memory allowed for negative dentries by adding or adjusting the
"neg_dentry_pc=" parameter in the kernel boot command line.

Signed-off-by: Waiman Long 
---
 fs/dcache.c| 4 
 include/linux/dcache.h | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 360185e..3796c3f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -145,6 +145,7 @@ struct dentry_stat_t dentry_stat = {
long nfree; /* Negative dentry free pool */
struct super_block *prune_sb;   /* Super_block for pruning */
int neg_count, prune_count; /* Pruning counts */
+   atomic_long_t nr_neg_killed;/* # of negative entries killed */
 } ndblk cacheline_aligned_in_smp;
 
 static void clear_prune_sb_for_umount(struct super_block *sb);
@@ -204,6 +205,7 @@ int proc_nr_dentry(struct ctl_table *table, int write, void 
__user *buffer,
dentry_stat.nr_dentry = get_nr_dentry();
dentry_stat.nr_unused = get_nr_dentry_unused();
dentry_stat.nr_negative = get_nr_dentry_neg();
+   dentry_stat.nr_killed = atomic_long_read(&ndblk.nr_neg_killed);
return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 #endif
@@ -802,6 +804,7 @@ static struct dentry *dentry_kill(struct dentry *dentry)
spin_unlock(&parent->d_lock);
goto failed;
}
+   atomic_long_inc(&ndblk.nr_neg_killed);
 
} else if (unlikely(!spin_trylock(&parent->d_lock))) {
if (inode)
@@ -3932,6 +3935,7 @@ static void __init neg_dentry_init(void)
 
raw_spin_lock_init(&ndblk.nfree_lock);
spin_lock_init(&ndblk.prune_lock);
+   atomic_long_set(&ndblk.nr_neg_killed, 0);
 
/* 20% in global pool & 80% in percpu free */
ndblk.nfree = neg_dentry_nfree_init
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index e42c8fc..227ed83 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -66,7 +66,7 @@ struct dentry_stat_t {
long age_limit; /* age in seconds */
long want_pages;/* pages requested by system */
long nr_negative;   /* # of negative dentries */
-   long dummy;
+   long nr_killed; /* # of negative dentries killed */
 };
 extern struct dentry_stat_t dentry_stat;
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/5] fs/dcache: Protect negative dentry pruning from racing with umount

2017-07-28 Thread Waiman Long
The negative dentry pruning is done on a specific super_block set
in the ndblk.prune_sb variable. If the super_block is also being
un-mounted concurrently, the content of the super_block may no longer
be valid.

To protect against such racing condition, a new lock is added to
the ndblk structure to synchronize the negative dentry pruning and
umount operation. This is a regular spinlock as the pruning operation
can be quite time consuming.

Signed-off-by: Waiman Long 
---
 fs/dcache.c | 42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3482972..360185e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -141,11 +141,13 @@ struct dentry_stat_t dentry_stat = {
 static long neg_dentry_nfree_init __read_mostly; /* Free pool initial value */
 static struct {
raw_spinlock_t nfree_lock;
+   spinlock_t prune_lock;  /* Lock for protecting pruning */
long nfree; /* Negative dentry free pool */
struct super_block *prune_sb;   /* Super_block for pruning */
int neg_count, prune_count; /* Pruning counts */
 } ndblk cacheline_aligned_in_smp;
 
+static void clear_prune_sb_for_umount(struct super_block *sb);
 static void prune_negative_dentry(struct work_struct *work);
 static DECLARE_DELAYED_WORK(prune_neg_dentry_work, prune_negative_dentry);
 
@@ -1355,6 +1357,7 @@ void shrink_dcache_sb(struct super_block *sb)
 {
long freed;
 
+   clear_prune_sb_for_umount(sb);
do {
LIST_HEAD(dispose);
 
@@ -1385,7 +1388,8 @@ static enum lru_status dentry_negative_lru_isolate(struct 
list_head *item,
 * list.
 */
if ((ndblk.neg_count >= NEG_PRUNING_SIZE) ||
-   (ndblk.prune_count >= NEG_PRUNING_SIZE)) {
+   (ndblk.prune_count >= NEG_PRUNING_SIZE) ||
+   !READ_ONCE(ndblk.prune_sb)) {
ndblk.prune_count = 0;
return LRU_STOP;
}
@@ -1441,15 +1445,24 @@ static void prune_negative_dentry(struct work_struct 
*work)
 {
int freed;
long nfree;
-   struct super_block *sb = READ_ONCE(ndblk.prune_sb);
+   struct super_block *sb;
LIST_HEAD(dispose);
 
-   if (!sb)
+   /*
+* The prune_lock is used to protect negative dentry pruning from
+* racing with concurrent umount operation.
+*/
+   spin_lock(&ndblk.prune_lock);
+   sb = READ_ONCE(ndblk.prune_sb);
+   if (!sb) {
+   spin_unlock(&ndblk.prune_lock);
return;
+   }
 
ndblk.neg_count = ndblk.prune_count = 0;
freed = list_lru_walk(&sb->s_dentry_lru, dentry_negative_lru_isolate,
  &dispose, NEG_DENTRY_BATCH);
+   spin_unlock(&ndblk.prune_lock);
 
if (freed)
shrink_dentry_list(&dispose);
@@ -1472,6 +1485,27 @@ static void prune_negative_dentry(struct work_struct 
*work)
WRITE_ONCE(ndblk.prune_sb, NULL);
 }
 
+/*
+ * This is called before an umount to clear ndblk.prune_sb if it
+ * matches the given super_block.
+ */
+static void clear_prune_sb_for_umount(struct super_block *sb)
+{
+   if (likely(READ_ONCE(ndblk.prune_sb) != sb))
+   return;
+   WRITE_ONCE(ndblk.prune_sb, NULL);
+   /*
+* Need to wait until an ongoing pruning operation, if present,
+* is completed.
+*
+* Clearing ndblk.prune_sb will hasten the completion of pruning.
+* In the unlikely event that ndblk.prune_sb is set to another
+* super_block, the waiting will last the complete pruning operation
+* which shouldn't be that long either.
+*/
+   spin_unlock_wait(&ndblk.prune_lock);
+}
+
 /**
  * enum d_walk_ret - action to talke during tree walk
  * @D_WALK_CONTINUE:   contrinue walk
@@ -1794,6 +1828,7 @@ void shrink_dcache_for_umount(struct super_block *sb)
 
WARN(down_read_trylock(&sb->s_umount), "s_umount should've been 
locked");
 
+   clear_prune_sb_for_umount(sb);
dentry = sb->s_root;
sb->s_root = NULL;
do_one_tree(dentry);
@@ -3896,6 +3931,7 @@ static void __init neg_dentry_init(void)
unsigned long cnt;
 
raw_spin_lock_init(&ndblk.nfree_lock);
+   spin_lock_init(&ndblk.prune_lock);
 
/* 20% in global pool & 80% in percpu free */
ndblk.nfree = neg_dentry_nfree_init
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/5] fs/dcache: Report negative dentry number in dentry-state

2017-07-28 Thread Waiman Long
The number of negative dentries currently in the system is now reported
in the /proc/sys/fs/dentry-state file.

Signed-off-by: Waiman Long 
---
 fs/dcache.c| 16 +++-
 include/linux/dcache.h |  7 ---
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ab10b96..fb7e041 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -135,6 +135,7 @@ struct dentry_stat_t dentry_stat = {
  */
 #define NEG_DENTRY_BATCH   (1 << 8)
 static long neg_dentry_percpu_limit __read_mostly;
+static long neg_dentry_nfree_init __read_mostly; /* Free pool initial value */
 static struct {
raw_spinlock_t nfree_lock;
long nfree; /* Negative dentry free pool */
@@ -176,11 +177,23 @@ static long get_nr_dentry_unused(void)
return sum < 0 ? 0 : sum;
 }
 
+static long get_nr_dentry_neg(void)
+{
+   int i;
+   long sum = 0;
+
+   for_each_possible_cpu(i)
+   sum += per_cpu(nr_dentry_neg, i);
+   sum += neg_dentry_nfree_init - ndblk.nfree;
+   return sum < 0 ? 0 : sum;
+}
+
 int proc_nr_dentry(struct ctl_table *table, int write, void __user *buffer,
   size_t *lenp, loff_t *ppos)
 {
dentry_stat.nr_dentry = get_nr_dentry();
dentry_stat.nr_unused = get_nr_dentry_unused();
+   dentry_stat.nr_negative = get_nr_dentry_neg();
return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 #endif
@@ -3739,7 +3752,8 @@ static void __init neg_dentry_init(void)
raw_spin_lock_init(&ndblk.nfree_lock);
 
/* 20% in global pool & 80% in percpu free */
-   ndblk.nfree = totalram_pages * nr_dentry_page * neg_dentry_pc / 500;
+   ndblk.nfree = neg_dentry_nfree_init
+   = totalram_pages * nr_dentry_page * neg_dentry_pc / 500;
cnt = ndblk.nfree * 4 / num_possible_cpus();
if (unlikely(cnt < 2 * NEG_DENTRY_BATCH))
cnt = 2 * NEG_DENTRY_BATCH;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 5ffcc46..e42c8fc 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -63,9 +63,10 @@ struct qstr {
 struct dentry_stat_t {
long nr_dentry;
long nr_unused;
-   long age_limit;  /* age in seconds */
-   long want_pages; /* pages requested by system */
-   long dummy[2];
+   long age_limit; /* age in seconds */
+   long want_pages;/* pages requested by system */
+   long nr_negative;   /* # of negative dentries */
+   long dummy;
 };
 extern struct dentry_stat_t dentry_stat;
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/5] fs/dcache: Limit numbers of negative dentries

2017-07-28 Thread Waiman Long
The number of positive dentries is limited by the number of files
in the filesystems. The number of negative dentries, however,
has no limit other than the total amount of memory available in
the system. So a rogue application that generates a lot of negative
dentries can potentially exhaust most of the memory available in the
system impacting performance on other running applications.

To prevent this from happening, the dcache code is now updated to limit
the amount of the negative dentries in the LRU lists that can be kept
as a percentage of total available system memory. The default is 5%
and can be changed by specifying the "neg_dentry_pc=" kernel command
line option.

If the negative dentry limit is exceeded, newly created negative
dentries will be killed right after use to avoid adding unpredictable
latency to the directory lookup operation.

Signed-off-by: Waiman Long 
---
 Documentation/admin-guide/kernel-parameters.txt |   7 +
 fs/dcache.c | 251 +---
 include/linux/dcache.h  |   1 +
 3 files changed, 227 insertions(+), 32 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 372cc66..7f5497b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2383,6 +2383,13 @@
 
n2= [NET] SDL Inc. RISCom/N2 synchronous serial card
 
+   neg_dentry_pc=  [KNL]
+   Range: 1-50
+   Default: 5
+   This parameter specifies the amount of negative
+   dentries allowed in the system as a percentage of
+   total system memory.
+
netdev= [NET] Network devices parameters
Format: 
Note that mem_start is often overloaded to mean
diff --git a/fs/dcache.c b/fs/dcache.c
index f901413..ab10b96 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -130,8 +130,19 @@ struct dentry_stat_t dentry_stat = {
.age_limit = 45,
 };
 
+/*
+ * Macros and variables to manage and count negative dentries.
+ */
+#define NEG_DENTRY_BATCH   (1 << 8)
+static long neg_dentry_percpu_limit __read_mostly;
+static struct {
+   raw_spinlock_t nfree_lock;
+   long nfree; /* Negative dentry free pool */
+} ndblk cacheline_aligned_in_smp;
+
 static DEFINE_PER_CPU(long, nr_dentry);
 static DEFINE_PER_CPU(long, nr_dentry_unused);
+static DEFINE_PER_CPU(long, nr_dentry_neg);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 
@@ -227,6 +238,92 @@ static inline int dentry_string_cmp(const unsigned char 
*cs, const unsigned char
 
 #endif
 
+/*
+ * There is a system-wide limit to the amount of negative dentries allowed
+ * in the super blocks' LRU lists. The default limit is 5% of the total
+ * system memory. This limit can be changed by using the kernel command line
+ * option "neg_dentry_pc=" to specify the percentage of the total memory
+ * that can be used for negative dentries. That percentage must be in the
+ * 1-50% range.
+ *
+ * To avoid performance problem with a global counter on an SMP system,
+ * the tracking is done mostly on a per-cpu basis. The total limit is
+ * distributed in a 80/20 ratio to per-cpu counters and a global free pool.
+ *
+ * If a per-cpu counter runs out of negative dentries, it can borrow extra
+ * ones from the global free pool. If it has more than its percpu limit,
+ * the extra ones will be returned back to the global pool.
+ */
+
+/*
+ * Decrement negative dentry count if applicable.
+ */
+static void __neg_dentry_dec(struct dentry *dentry)
+{
+   if (unlikely(this_cpu_dec_return(nr_dentry_neg) < 0)) {
+   long *pcnt = get_cpu_ptr(&nr_dentry_neg);
+
+   if ((*pcnt < 0) && raw_spin_trylock(&ndblk.nfree_lock)) {
+   ACCESS_ONCE(ndblk.nfree) += NEG_DENTRY_BATCH;
+   *pcnt += NEG_DENTRY_BATCH;
+   raw_spin_unlock(&ndblk.nfree_lock);
+   }
+   put_cpu_ptr(&nr_dentry_neg);
+   }
+}
+
+static inline void neg_dentry_dec(struct dentry *dentry)
+{
+   if (unlikely(d_is_negative(dentry)))
+   __neg_dentry_dec(dentry);
+}
+
+/*
+ * Decrement the negative dentry free pool by NEG_DENTRY_BATCH & return
+ * the actual number decremented.
+ */
+static long __neg_dentry_nfree_dec(void)
+{
+   long cnt = NEG_DENTRY_BATCH;
+
+   raw_spin_lock(&ndblk.nfree_lock);
+   if (ndblk.nfree < cnt)
+   cnt = ndblk.nfree;
+   ACCESS_ONCE(ndblk.nfree) -= cnt;
+   raw_spin_unlock(&ndblk.nfree_lock);
+   return cnt;
+}
+
+/*
+ * Increment negative dentry count if applicable.
+ */
+static void __neg_dentry_inc(struct dentry *dentry)
+{
+   long cnt = 0, *pcnt;
+
+   if (this_cpu_inc_return(nr_dentry_neg) <= neg_dentry_percpu_limit)
+  

Re: [RFC PATCH v2 00/38] Nested Virtualization on KVM/ARM

2017-07-28 Thread Bandan Das
Jintack Lim  writes:
...
>>
>> I'll share my experiment setup shortly.
>
> I summarized my experiment setup here.
>
> https://github.com/columbia/nesting-pub/wiki/Nested-virtualization-on-ARM-setup

Thanks Jintack! I was able to test L2 boot up with these instructions.

Next, I will try to run some simple tests. Any suggestions on reducing the L2 
bootup
time in my test setup ? I think I will try to make the L2 kernel print
less messages; and maybe just get rid of some of the userspace services.
I also applied the patch to reduce the timer frequency btw.

Bandan

>>
>> Even though this work has some limitations and TODOs, I'd appreciate early
>> feedback on this RFC. Specifically, I'm interested in:
>>
>> - Overall design to manage vcpu context for the virtual EL2
>> - Verifying correct EL2 register configurations such as HCR_EL2, CPTR_EL2
>>   (Patch 30 and 32)
>> - Patch organization and coding style
>
> I also wonder if the hardware and/or KVM do not support nested
> virtualization but the userspace uses nested virtualization option,
> which one is better: giving an error or launching a regular VM
> silently.
>
>>
>> This patch series is based on kvm/next d38338e.
>> The whole patch series including memory, VGIC, and timer patches is available
>> here:
>>
>> g...@github.com:columbia/nesting-pub.git rfc-v2
>>
>> Limitations:
>> - There are some cases that the target exception level of a VM is ambiguous 
>> when
>>   emulating eret instruction. I'm discussing this issue with Christoffer and
>>   Marc. Meanwhile, I added a temporary patch (not included in this
>>   series. f1beaba in the repo) and used 4.10.0 kernel when testing the guest
>>   hypervisor with VHE.
>> - Recursive nested virtualization is not tested yet.
>> - Other hypervisors (such as Xen) on KVM are not tested.
>>
>> TODO:
>> - Submit memory, VGIC, and timer patches
>> - Evaluate regular VM performance to see if there's a negative impact.
>> - Test other hypervisors such as Xen on KVM
>> - Test recursive nested virtualization
>>
>> v1-->v2:
>> - Added support for the virtual EL2 with VHE
>> - Rewrote commit messages and comments from the perspective of supporting
>>   execution environments to VMs, rather than from the perspective of the 
>> guest
>>   hypervisor running in them.
>> - Fixed a few bugs to make it run on the FastModel.
>> - Tested on ARMv8.3 with four configurations. (host/guest. with/without VHE.)
>> - Rebased to kvm/next
>>
>> [1] 
>> https://www.community.arm.com/processors/b/blog/posts/armv8-a-architecture-2016-additions
>>
>> Christoffer Dall (7):
>>   KVM: arm64: Add KVM nesting feature
>>   KVM: arm64: Allow userspace to set PSR_MODE_EL2x
>>   KVM: arm64: Add vcpu_mode_el2 primitive to support nesting
>>   KVM: arm/arm64: Add a framework to prepare virtual EL2 execution
>>   arm64: Add missing TCR hw defines
>>   KVM: arm64: Create shadow EL1 registers
>>   KVM: arm64: Trap EL1 VM register accesses in virtual EL2
>>
>> Jintack Lim (31):
>>   arm64: Add ARM64_HAS_NESTED_VIRT feature
>>   KVM: arm/arm64: Enable nested virtualization via command-line
>>   KVM: arm/arm64: Check if nested virtualization is in use
>>   KVM: arm64: Add EL2 system registers to vcpu context
>>   KVM: arm64: Add EL2 special registers to vcpu context
>>   KVM: arm64: Add the shadow context for virtual EL2 execution
>>   KVM: arm64: Set vcpu context depending on the guest exception level
>>   KVM: arm64: Synchronize EL1 system registers on virtual EL2 entry and
>> exit
>>   KVM: arm64: Move exception macros and enums to a common file
>>   KVM: arm64: Support to inject exceptions to the virtual EL2
>>   KVM: arm64: Trap SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2
>>   KVM: arm64: Trap CPACR_EL1 access in virtual EL2
>>   KVM: arm64: Handle eret instruction traps
>>   KVM: arm64: Set a handler for the system instruction traps
>>   KVM: arm64: Handle PSCI call via smc from the guest
>>   KVM: arm64: Inject HVC exceptions to the virtual EL2
>>   KVM: arm64: Respect virtual HCR_EL2.TWX setting
>>   KVM: arm64: Respect virtual CPTR_EL2.TFP setting
>>   KVM: arm64: Add macros to support the virtual EL2 with VHE
>>   KVM: arm64: Add EL2 registers defined in ARMv8.1 to vcpu context
>>   KVM: arm64: Emulate EL12 register accesses from the virtual EL2
>>   KVM: arm64: Support a VM with VHE considering EL0 of the VHE host
>>   KVM: arm64: Allow the virtual EL2 to access EL2 states without trap
>>   KVM: arm64: Manage the shadow states when virtual E2H bit enabled
>>   KVM: arm64: Trap and emulate CPTR_EL2 accesses via CPACR_EL1 from the
>> virtual EL2 with VHE
>>   KVM: arm64: Emulate appropriate VM control system registers
>>   KVM: arm64: Respect the virtual HCR_EL2.NV bit setting
>>   KVM: arm64: Respect the virtual HCR_EL2.NV bit setting for EL12
>> register traps
>>   KVM: arm64: Respect virtual HCR_EL2.TVM and TRVM settings
>>   KVM: arm64: Respect the virtual HCR_EL2.NV1 bit setting
>>   KVM: arm64: Respect the virtual CPTR_EL2.TCPAC s

Re: [RFC v6 27/62] powerpc: helper to validate key-access permissions of a pte

2017-07-28 Thread Thiago Jung Bauermann

Ram Pai  writes:
> --- a/arch/powerpc/mm/pkeys.c
> +++ b/arch/powerpc/mm/pkeys.c
> @@ -201,3 +201,36 @@ int __arch_override_mprotect_pkey(struct vm_area_struct 
> *vma, int prot,
>*/
>   return vma_pkey(vma);
>  }
> +
> +static bool pkey_access_permitted(int pkey, bool write, bool execute)
> +{
> + int pkey_shift;
> + u64 amr;
> +
> + if (!pkey)
> + return true;
> +
> + pkey_shift = pkeyshift(pkey);
> + if (!(read_uamor() & (0x3UL << pkey_shift)))
> + return true;
> +
> + if (execute && !(read_iamr() & (IAMR_EX_BIT << pkey_shift)))
> + return true;
> +
> + if (!write) {
> + amr = read_amr();
> + if (!(amr & (AMR_RD_BIT << pkey_shift)))
> + return true;
> + }
> +
> + amr = read_amr(); /* delay reading amr uptil absolutely needed */

Actually, this is causing amr to be read twice in case control enters
the "if (!write)" block above but doesn't enter the other if block nested
in it.

read_amr should be called only once, right before "if (!write)".

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 00/38] Nested Virtualization on KVM/ARM

2017-07-28 Thread Jintack Lim
On Fri, Jul 28, 2017 at 4:13 PM, Bandan Das  wrote:
> Jintack Lim  writes:
> ...
>>>
>>> I'll share my experiment setup shortly.
>>
>> I summarized my experiment setup here.
>>
>> https://github.com/columbia/nesting-pub/wiki/Nested-virtualization-on-ARM-setup
>
> Thanks Jintack! I was able to test L2 boot up with these instructions.

Thanks for the confirmation!

>
> Next, I will try to run some simple tests. Any suggestions on reducing the L2 
> bootup
> time in my test setup ? I think I will try to make the L2 kernel print
> less messages; and maybe just get rid of some of the userspace services.
> I also applied the patch to reduce the timer frequency btw.

I think you can try to use those kernel parameters: "loglevel=1", with
which the kernel print (almost) nothing during the boot process but
the init process will print something,  or "console=none", with which
you don't see anything but the login message.  I didn't used them
because I wanted to see the L2 boot message as soon as possible :)

Thanks,
Jintack

>
> Bandan
>
>>>
>>> Even though this work has some limitations and TODOs, I'd appreciate early
>>> feedback on this RFC. Specifically, I'm interested in:
>>>
>>> - Overall design to manage vcpu context for the virtual EL2
>>> - Verifying correct EL2 register configurations such as HCR_EL2, CPTR_EL2
>>>   (Patch 30 and 32)
>>> - Patch organization and coding style
>>
>> I also wonder if the hardware and/or KVM do not support nested
>> virtualization but the userspace uses nested virtualization option,
>> which one is better: giving an error or launching a regular VM
>> silently.
>>
>>>
>>> This patch series is based on kvm/next d38338e.
>>> The whole patch series including memory, VGIC, and timer patches is 
>>> available
>>> here:
>>>
>>> g...@github.com:columbia/nesting-pub.git rfc-v2
>>>
>>> Limitations:
>>> - There are some cases that the target exception level of a VM is ambiguous 
>>> when
>>>   emulating eret instruction. I'm discussing this issue with Christoffer and
>>>   Marc. Meanwhile, I added a temporary patch (not included in this
>>>   series. f1beaba in the repo) and used 4.10.0 kernel when testing the guest
>>>   hypervisor with VHE.
>>> - Recursive nested virtualization is not tested yet.
>>> - Other hypervisors (such as Xen) on KVM are not tested.
>>>
>>> TODO:
>>> - Submit memory, VGIC, and timer patches
>>> - Evaluate regular VM performance to see if there's a negative impact.
>>> - Test other hypervisors such as Xen on KVM
>>> - Test recursive nested virtualization
>>>
>>> v1-->v2:
>>> - Added support for the virtual EL2 with VHE
>>> - Rewrote commit messages and comments from the perspective of supporting
>>>   execution environments to VMs, rather than from the perspective of the 
>>> guest
>>>   hypervisor running in them.
>>> - Fixed a few bugs to make it run on the FastModel.
>>> - Tested on ARMv8.3 with four configurations. (host/guest. with/without 
>>> VHE.)
>>> - Rebased to kvm/next
>>>
>>> [1] 
>>> https://www.community.arm.com/processors/b/blog/posts/armv8-a-architecture-2016-additions
>>>
>>> Christoffer Dall (7):
>>>   KVM: arm64: Add KVM nesting feature
>>>   KVM: arm64: Allow userspace to set PSR_MODE_EL2x
>>>   KVM: arm64: Add vcpu_mode_el2 primitive to support nesting
>>>   KVM: arm/arm64: Add a framework to prepare virtual EL2 execution
>>>   arm64: Add missing TCR hw defines
>>>   KVM: arm64: Create shadow EL1 registers
>>>   KVM: arm64: Trap EL1 VM register accesses in virtual EL2
>>>
>>> Jintack Lim (31):
>>>   arm64: Add ARM64_HAS_NESTED_VIRT feature
>>>   KVM: arm/arm64: Enable nested virtualization via command-line
>>>   KVM: arm/arm64: Check if nested virtualization is in use
>>>   KVM: arm64: Add EL2 system registers to vcpu context
>>>   KVM: arm64: Add EL2 special registers to vcpu context
>>>   KVM: arm64: Add the shadow context for virtual EL2 execution
>>>   KVM: arm64: Set vcpu context depending on the guest exception level
>>>   KVM: arm64: Synchronize EL1 system registers on virtual EL2 entry and
>>> exit
>>>   KVM: arm64: Move exception macros and enums to a common file
>>>   KVM: arm64: Support to inject exceptions to the virtual EL2
>>>   KVM: arm64: Trap SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2
>>>   KVM: arm64: Trap CPACR_EL1 access in virtual EL2
>>>   KVM: arm64: Handle eret instruction traps
>>>   KVM: arm64: Set a handler for the system instruction traps
>>>   KVM: arm64: Handle PSCI call via smc from the guest
>>>   KVM: arm64: Inject HVC exceptions to the virtual EL2
>>>   KVM: arm64: Respect virtual HCR_EL2.TWX setting
>>>   KVM: arm64: Respect virtual CPTR_EL2.TFP setting
>>>   KVM: arm64: Add macros to support the virtual EL2 with VHE
>>>   KVM: arm64: Add EL2 registers defined in ARMv8.1 to vcpu context
>>>   KVM: arm64: Emulate EL12 register accesses from the virtual EL2
>>>   KVM: arm64: Support a VM with VHE considering EL0 of the VHE host
>>>   KVM: arm64: Allow the virtual EL2 to access EL2 states without trap
>

Здравствуйте! Вас интересуют клиентские базы данных? Ответ на Email: prodawez...@gmail.com

2017-07-28 Thread dmyrnsijpvfftsad63...@yandex.ru
Здравствуйте! Вас интересуют клиентские базы  данных? Ответ на Email: 
prodawez...@gmail.com


Re: [RFC v6 21/62] powerpc: introduce execute-only pkey

2017-07-28 Thread Thiago Jung Bauermann

Ram Pai  writes:
> --- a/arch/powerpc/mm/pkeys.c
> +++ b/arch/powerpc/mm/pkeys.c
> @@ -97,3 +97,60 @@ int __arch_set_user_pkey_access(struct task_struct *tsk, 
> int pkey,
>   init_iamr(pkey, new_iamr_bits);
>   return 0;
>  }
> +
> +static inline bool pkey_allows_readwrite(int pkey)
> +{
> + int pkey_shift = pkeyshift(pkey);
> +
> + if (!(read_uamor() & (0x3UL << pkey_shift)))
> + return true;
> +
> + return !(read_amr() & ((AMR_RD_BIT|AMR_WR_BIT) << pkey_shift));
> +}
> +
> +int __execute_only_pkey(struct mm_struct *mm)
> +{
> + bool need_to_set_mm_pkey = false;
> + int execute_only_pkey = mm->context.execute_only_pkey;
> + int ret;
> +
> + /* Do we need to assign a pkey for mm's execute-only maps? */
> + if (execute_only_pkey == -1) {
> + /* Go allocate one to use, which might fail */
> + execute_only_pkey = mm_pkey_alloc(mm);
> + if (execute_only_pkey < 0)
> + return -1;
> + need_to_set_mm_pkey = true;
> + }
> +
> + /*
> +  * We do not want to go through the relatively costly
> +  * dance to set AMR if we do not need to.  Check it
> +  * first and assume that if the execute-only pkey is
> +  * readwrite-disabled than we do not have to set it
> +  * ourselves.
> +  */
> + if (!need_to_set_mm_pkey &&
> + !pkey_allows_readwrite(execute_only_pkey))
> + return execute_only_pkey;
> +
> + /*
> +  * Set up AMR so that it denies access for everything
> +  * other than execution.
> +  */
> + ret = __arch_set_user_pkey_access(current, execute_only_pkey,
> + (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
> + /*
> +  * If the AMR-set operation failed somehow, just return
> +  * 0 and effectively disable execute-only support.
> +  */
> + if (ret) {
> + mm_set_pkey_free(mm, execute_only_pkey);
> + return -1;
> + }
> +
> + /* We got one, store it and use it from here on out */
> + if (need_to_set_mm_pkey)
> + mm->context.execute_only_pkey = execute_only_pkey;
> + return execute_only_pkey;
> +}

If you follow the code flow in __execute_only_pkey, the AMR and UAMOR
are read 3 times in total, and AMR is written twice. IAMR is read and
written twice. Since they are SPRs and access to them is slow (or isn't
it?), is it worth it to read them once in __execute_only_pkey and pass
down their values to the callees, and then write them once at the end of
the function?

This function is used both by the mmap syscall and the mprotect syscall
(but not by pkey_mprotect) if the requested protection is execute-only.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html