from:"Waiman Long"

[PATCH RFC 2/3] mutex: restrict mutex spinning to only one task per mutex

2013-04-04 Thread Waiman Long

The current mutex spinning code allow multiple tasks to spin on a
single mutex concurrently. There are two major problems with this
approach:

 1. This is not very energy efficient as the spinning tasks are not
doing useful work. The spinning tasks may also block other more
important or useful tasks from running as preemption is disabled.
Only one of the spinners will get the mutex at any time. The
other spinners will have to wait for much longer to get it.

 2. The mutex data structure on x86-64 should be 32 bytes. The spinning
code spin on lock->owner which, in most cases, should be in the same
64-byte cache line as the lock->wait_lock spinlock. As a result,
the mutex spinners are contending the same cacheline with other
CPUs trying to get the spinlock leading to increased time spent
on the spinlock as well as on the mutex spinning.

These problems are worse on system with large number of CPUs. One way
to reduce the effect of these two problems is to allow only one task
to be spinning on a mutex at any time.

This patch adds a new spinner field in the mutex.h to limit the
number of spinner to only one task. That will increase the size of
the mutex by 8 bytes in a 64-bit environment (4 bytes in a 32-bit
environment).

The AIM7 benchmarks were run on 3.7.10 derived kernels to show the
performance changes with this patch on a 8-socket 80-core system
with hyperthreading off.  The table below shows the mean % change
in performance over a range of users for some AIM7 workloads with
just the less atomic operation patch (patch 1) vs the less atomic
operation patch plus this one (patches 1+2).

+--+-+-+-+
|   Workload   | mean % change   | mean % change   | mean % change   |
|  | 10-100 users| 200-1000 users  | 1100-2000 users |
+--+-+-+-+
| alltests | -0.2%   | -3.8%   |-4.2%|
| five_sec | -0.6%   | -2.0%   |-2.4%|
| fserver  | +2.2%   |+16.2%   |+2.2%|
| high_systime | -0.3%   | -4.3%   |-3.0%|
| new_fserver  | +3.9%   |+16.0%   |+9.5%|
| shared   | -1.7%   | -5.0%   |-4.0%|
| short| -7.7%   | +0.2%   |+1.3%|
+--+-+-+-+

It can be seen that this patch improves performance for the fserver and
new_fserver workloads while suffering some slight drop in performance
for the other workloads.

Signed-off-by: Waiman Long 
Reviewed-by: Davidlohr Bueso 
---
 include/linux/mutex.h |3 +++
 kernel/mutex.c|   12 
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 9121595..dd8fdb8 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -50,6 +50,9 @@ struct mutex {
atomic_tcount;
spinlock_t  wait_lock;
struct list_headwait_list;
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+   struct task_struct  *spinner;
+#endif
 #if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
struct task_struct  *owner;
 #endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 5e5ea03..965f59f 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -158,7 +158,12 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
 *
 * We can't do this for DEBUG_MUTEXES because that relies on wait_lock
 * to serialize everything.
+*
+* Only first task is allowed to spin on a given mutex and that
+* task will put its task_struct pointer into the spinner field.
 */
+   if (lock->spinner || (cmpxchg(&lock->spinner, NULL, current) != NULL))
+   goto slowpath;
 
for (;;) {
struct task_struct *owner;
@@ -175,6 +180,7 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
(atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
+   lock->spinner = NULL;
preempt_enable();
return 0;
}
@@ -196,6 +202,12 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
 */
arch_mutex_cpu_relax();
}
+
+   /*
+* Done with spinning
+*/
+   lock->spinner = NULL;
+slowpath:
 #endif
spin_lock_mutex(&lock->wait_lock, flags);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://

[PATCH RFC 1/3] mutex: Make more scalable by doing less atomic operations

2013-04-04 Thread Waiman Long

).

+--+---++-+
|   Workload   | mean % change | mean % change  | mean % change   |
|  | 10-100 users  | 200-1000 users | 1100-2000 users |
+--+---++-+
| alltests | +0.6% |   +104.2%  |   +185.9%   |
| five_sec | +1.9% | +0.9%  | +0.9%   |
| fserver  | +1.4% | -7.7%  | +5.1%   |
| new_fserver  | -0.5% | +3.2%  | +3.1%   |
| shared   |+13.1% |   +146.1%  |   +181.5%   |
| short| +7.4% | +5.0%  | +4.2%   |
+--+---++-+

Signed-off-by: Waiman Long 
Reviewed-by: Davidlohr Bueso 
---
 arch/x86/include/asm/mutex.h |   16 
 kernel/mutex.c   |9 ++---
 kernel/mutex.h   |8 
 3 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index 7d3a482..aa6a3ec 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -3,3 +3,19 @@
 #else
 # include 
 #endif
+
+#ifndef__ASM_MUTEX_H
+#define__ASM_MUTEX_H
+
+#ifdef MUTEX_SHOULD_XCHG_COUNT
+#undef MUTEX_SHOULD_XCHG_COUNT
+#endif
+/*
+ * For the x86 architecture, it allows any negative number (besides -1) in
+ * the mutex counter to indicate that some other threads are waiting on the
+ * mutex. So the atomic_xchg() function should not be called in
+ * __mutex_lock_common() if the value of the counter has already been set
+ * to a negative number.
+ */
+#define MUTEX_SHOULD_XCHG_COUNT(mutex) (atomic_read(&(mutex)->count) >= 0)
+#endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 52f2301..5e5ea03 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -171,7 +171,8 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
if (owner && !mutex_spin_on_owner(lock, owner))
break;
 
-   if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
+   if ((atomic_read(&lock->count) == 1) &&
+   (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
lock_acquired(&lock->dep_map, ip);
mutex_set_owner(lock);
preempt_enable();
@@ -205,7 +206,8 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
list_add_tail(&waiter.list, &lock->wait_list);
waiter.task = task;
 
-   if (atomic_xchg(&lock->count, -1) == 1)
+   if (MUTEX_SHOULD_XCHG_COUNT(lock) &&
+  (atomic_xchg(&lock->count, -1) == 1))
goto done;
 
lock_contended(&lock->dep_map, ip);
@@ -220,7 +222,8 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
 * that when we release the lock, we properly wake up the
 * other waiters:
 */
-   if (atomic_xchg(&lock->count, -1) == 1)
+   if (MUTEX_SHOULD_XCHG_COUNT(lock) &&
+  (atomic_xchg(&lock->count, -1) == 1))
break;
 
/*
diff --git a/kernel/mutex.h b/kernel/mutex.h
index 4115fbf..b873f8e 100644
--- a/kernel/mutex.h
+++ b/kernel/mutex.h
@@ -46,3 +46,11 @@ static inline void
 debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
 }
+
+/*
+ * The atomic_xchg() function should not be called in __mutex_lock_common()
+ * if the value of the counter has already been set to -1.
+ */
+#ifndef MUTEX_SHOULD_XCHG_COUNT
+#defineMUTEX_SHOULD_XCHG_COUNT(mutex)  (atomic_read(&(mutex)->count) 
!= -1)
+#endif
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 0/3] mutex: Improve mutex performance by doing less atomic-ops & spinning

2013-04-04 Thread Waiman Long

This patch set is a collection of 3 different mutex related patches
aimed at improving mutex performance especially for system with large
number of CPUs. This is achieved by doing less atomic operations and
mutex spinning (when the CONFIG_MUTEX_SPIN_ON_OWNER is on).

The first patch reduces the number of atomic operations executed. It
can produce dramatic performance improvement in the AIM7 benchmark
with large number of CPUs. For example, there was a more than 3X
improvement in the high_systime workload with a 3.7.10 kernel on
an 8-socket x86-64 system with 80 cores. The 3.8 kernels, on the
other hand, are not mutex limited for that workload anymore. So the
performance improvement is only about 1% for the high_systime workload.

Patches 2 and 3 represent different ways to reduce mutex spinning. Of
the 2, the third one is better from both a performance perspective
and the fact that no mutex data structure change is needed. See the
individual patch descriptions for more information on those patches.

The table below shows the performance impact on the AIM7 benchmark with
a 3.8.5 kernel running on the same 8-socket system mentioned above:

+--+--+
|   Workload   |  Mean % Change 10-100 users  |
|  +-+-+--+
|  |   Patches 1+2   |   Patches 1+3   | Relative %Change |
+--+-+-+--+
| fserver  | +1.7%   |  0.0%   | -1.7%|
| new_fserver  | -0.2%   | -1.5%   | -1.2%|
+--+-+-+--+
|   Workload   | Mean % Change 100-1000 users |
|  +-+-+--+
|  |   Patches 1+2   |   Patches 1+3   | Relative %Change |
+--+-+-+--+
| fserver  |+18.6%   |+43.4%   |+21.0%|
| new_fserver  |+14.0%   |+23.4%   | +8.2%|
+--+-+-+--+
|   Workload   | Mean % Change 1100-2000 users|
|  +-+-+--+
|  |   Patches 1+2   |   Patches 1+3   | Relative %Change |
+--+-+-+--+
| fserver  |+11.6%   | +5.1%   | -5.8%|
| new_fserver  |+13.3%   | +7.6%   | -5.0%|
+--+-+-+--+

So patch 2 is better at low and high load. Patch 3 is better at
intermediate load. For other AIM7 workloads, patch 3 is generally
better.

Waiman Long (3):
  mutex: Make more scalable by doing less atomic operations
  mutex: restrict mutex spinning to only one task per mutex
  mutex: dynamically disable mutex spinning at high load

 arch/x86/include/asm/mutex.h |   16 
 include/linux/mutex.h|3 +++
 kernel/mutex.c   |   21 ++---
 kernel/mutex.h   |8 
 kernel/sched/core.c  |   22 ++
 5 files changed, 67 insertions(+), 3 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 3/3] mutex: dynamically disable mutex spinning at high load

2013-04-04 Thread Waiman Long

The Linux mutex code has a MUTEX_SPIN_ON_OWNER configuration
option that was enabled by default in major distributions like Red
Hat. Allowing threads waiting on mutex to spin while the mutex owner
is running will theoretically reduce latency on the acquisition of
mutex at the expense of energy efficiency as the spinning threads
are doing no useful work.

This is not a problem on a lightly loaded system where the CPU may
be idle anyway. On a highly loaded system, the spinning tasks may be
blocking other tasks from running even if they have higher priority
because the spinning was done with preemption disabled.

This patch will disable mutex spinning if the current load is high
enough. The load is considered high if there are 2 or more active tasks
waiting to run on the current CPU. If there is only one task waiting,
it will check the average load at the past minute (calc_load_tasks).
If it is more than double the number of active CPUs, the load is
considered high too.  This is a rather simple metric that does not
incur that much additional overhead.

The AIM7 benchmarks were run on 3.7.10 derived kernels to show the
performance changes with this patch on a 8-socket 80-core system
with hyperthreading off.  The table below shows the mean % change
in performance over a range of users for some AIM7 workloads with
just the less atomic operation patch (patch 1) vs the less atomic
operation patch plus this one (patches 1+3).

+--+-+-+-+
|   Workload   | mean % change   | mean % change   | mean % change   |
|  | 10-100 users| 200-1000 users  | 1100-2000 users |
+--+-+-+-+
| alltests |  0.0%   | -0.1%   |+5.0%|
| five_sec | +1.5%   | +1.3%   |+1.3%|
| fserver  | +1.5%   |+25.4%   |+9.6%|
| high_systime | +0.1%   |  0.0%   |+0.8%|
| new_fserver  | +0.2%   |+11.9%   |   +14.1%|
| shared   | -1.2%   | +0.3%   |+1.8%|
| short| +6.4%   | +2.5%   |+3.0%|
+--+-+-+-+

It can be seen that this patch provides some big performance
improvement for the fserver and new_fserver workloads while is still
generally positive for the other AIM7 workloads.

Signed-off-by: Waiman Long 
Reviewed-by: Davidlohr Bueso 
---
 kernel/sched/core.c |   22 ++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..f667d63 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3021,9 +3021,31 @@ static inline bool owner_running(struct mutex *lock, 
struct task_struct *owner)
  */
 int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 {
+   unsigned int nrun;
+
if (!sched_feat(OWNER_SPIN))
return 0;
 
+   /*
+* Mutex spinning should be temporarily disabled if the load on
+* the current CPU is high. The load is considered high if there
+* are 2 or more active tasks waiting to run on this CPU. On the
+* other hand, if there is another task waiting and the global
+* load (calc_load_tasks - including uninterruptible tasks) is
+* bigger than 2X the # of CPUs available, it is also considered
+* to be high load.
+*/
+   nrun = this_rq()->nr_running;
+   if (nrun >= 3)
+   return 0;
+   else if (nrun == 2) {
+   long active = atomic_long_read(&calc_load_tasks);
+   int  ncpu   = num_online_cpus();
+
+   if (active > 2*ncpu)
+   return 0;
+   }
+
rcu_read_lock();
while (owner_running(lock, owner)) {
if (need_resched())
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 4/4] dcache: don't need to take d_lock in prepend_path()

2013-04-05 Thread Waiman Long

The d_lock was used in prepend_path() to protect dentry->d_name from
being changed under the hood. As the caller of prepend_path() has
to take the rename_lock before calling into it, there is no chance
that d_name will be changed. The d_lock lock is only needed when the
rename_lock is not taken.

Signed-off-by: Waiman Long 
---
 fs/dcache.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9477d80..e3d6543 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2529,6 +2529,7 @@ static int prepend_name(char **buffer, int *buflen, 
struct qstr *name)
  * @buflen: pointer to buffer length
  *
  * Caller holds the rename_lock.
+ * There is no need to lock the dentry as its name cannot be changed.
  */
 static int prepend_path(const struct path *path,
const struct path *root,
@@ -2555,9 +2556,7 @@ static int prepend_path(const struct path *path,
}
parent = dentry->d_parent;
prefetch(parent);
-   spin_lock(&dentry->d_lock);
error = prepend_name(buffer, buflen, &dentry->d_name);
-   spin_unlock(&dentry->d_lock);
if (!error)
error = prepend(buffer, buflen, "/", 1);
if (error)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 RFC 3/4] dcache: change rename_lock to a sequence read/write lock

2013-04-05 Thread Waiman Long

The d_path() and related kernel functions currently take a writer
lock on rename_lock because they need to follow pointers. By changing
rename_lock to be the new sequence read/write lock, a reader lock
can be taken and multiple d_path() threads can proceed concurrently
without blocking each other.

It is unlikely that the frequency of filesystem changes and d_path()
name lookup will be high enough to cause writer starvation, the current
limitation of the read/write lock should be acceptable in that case.

All the sites where rename_lock is referenced were modified to use the
sequence read/write lock declaration and access functions.

When apply this patch to 3.8 or earlier releases, the unused function
d_path_with_unreachable() in fs/dcache.c should be removed to avoid
compilation warning.

Signed-off-by: Waiman Long 
---
 fs/autofs4/waitq.c |6 ++--
 fs/ceph/mds_client.c   |4 +-
 fs/cifs/dir.c  |4 +-
 fs/dcache.c|   83 ---
 fs/nfs/namespace.c |6 ++--
 include/linux/dcache.h |4 +-
 kernel/auditsc.c   |4 +-
 7 files changed, 56 insertions(+), 55 deletions(-)

diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 3db70da..3afc4db 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -197,7 +197,7 @@ rename_retry:
buf = *name;
len = 0;
 
-   seq = read_seqbegin(&rename_lock);
+   seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
spin_lock(&sbi->fs_lock);
for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
@@ -206,7 +206,7 @@ rename_retry:
if (!len || --len > NAME_MAX) {
spin_unlock(&sbi->fs_lock);
rcu_read_unlock();
-   if (read_seqretry(&rename_lock, seq))
+   if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
return 0;
}
@@ -222,7 +222,7 @@ rename_retry:
}
spin_unlock(&sbi->fs_lock);
rcu_read_unlock();
-   if (read_seqretry(&rename_lock, seq))
+   if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
 
return len;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 442880d..da565c4 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1486,7 +1486,7 @@ char *ceph_mdsc_build_path(struct dentry *dentry, int 
*plen, u64 *base,
 
 retry:
len = 0;
-   seq = read_seqbegin(&rename_lock);
+   seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
for (temp = dentry; !IS_ROOT(temp);) {
struct inode *inode = temp->d_inode;
@@ -1536,7 +1536,7 @@ retry:
temp = temp->d_parent;
}
rcu_read_unlock();
-   if (pos != 0 || read_seqretry(&rename_lock, seq)) {
+   if (pos != 0 || read_seqrwretry(&rename_lock, seq)) {
pr_err("build_path did not end path lookup where "
   "expected, namelen is %d, pos is %d\n", len, pos);
/* presumably this is only possible if racing with a
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 1cd0162..707d849 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -96,7 +96,7 @@ build_path_from_dentry(struct dentry *direntry)
dfsplen = 0;
 cifs_bp_rename_retry:
namelen = dfsplen;
-   seq = read_seqbegin(&rename_lock);
+   seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
for (temp = direntry; !IS_ROOT(temp);) {
namelen += (1 + temp->d_name.len);
@@ -136,7 +136,7 @@ cifs_bp_rename_retry:
}
}
rcu_read_unlock();
-   if (namelen != dfsplen || read_seqretry(&rename_lock, seq)) {
+   if (namelen != dfsplen || read_seqrwretry(&rename_lock, seq)) {
cFYI(1, "did not end path lookup where expected. namelen=%d "
"dfsplen=%d", namelen, dfsplen);
/* presumably this is only possible if racing with a rename
diff --git a/fs/dcache.c b/fs/dcache.c
index 48c0680..9477d80 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -82,7 +83,7 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
+__cacheline_aligned_in_smp DEFINE_SEQRWLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
 
@@ -1020,7 +1021,7 @@ static struct dentry *try_to_ascend(struct dentry *old, 
int locked, unsigned seq
 */
if (new != old->d_parent ||
 (old->d_flags & DCACHE_DENTRY_KILLED) ||
-(!locked && read_seqretry(&rename_lock, seq))) {
+

[PATCH v2 0/4] dcache: make dcache more scalable on large system

2013-04-05 Thread Waiman Long

Change log:

v1->v2
 - Include performance improvement in the AIM7 benchmark results because
   of this patch.
 - Modify dget_parent() to avoid taking the lock, if possible, to further
   improve AIM7 benchmark results.

During some perf-record sessions of the kernel running the high_systime
workload of the AIM7 benchmark, it was found that quite a large portion
of the spinlock contention was due to the perf_event_mmap_event()
function itself. This perf kernel function calls d_path() which,
in turn, call path_get() and dput() indirectly. These 3 functions
were the hottest functions shown in the perf-report output of
the _raw_spin_lock() function in an 8-socket system with 80 cores
(hyperthreading off) with a 3.7.10 kernel with a mutex patch applied:

-  11.97%  reaim  [kernel.kallsyms] [k] _raw_spin_lock
   - _raw_spin_lock
  + 46.17% d_path
  + 20.31% path_get
  + 19.75% dput

In fact, the output of the "perf record -s -a" (without call-graph)
showed:

 11.73%  reaim  [kernel.kallsyms] [k] _raw_spin_lock
  8.85% ls  [kernel.kallsyms] [k] _raw_spin_lock
  3.97%   true  [kernel.kallsyms] [k] _raw_spin_lock

Without using the perf monitoring tool, the actual execution profile
will be quite different. In fact, with this patch set applied, the
output of the same "perf record -s -a" command became:

  2.05%  reaim  [kernel.kallsyms] [k] _raw_spin_lock
  0.30% ls  [kernel.kallsyms] [k] _raw_spin_lock
  0.25%   true  [kernel.kallsyms] [k] _raw_spin_lock

So the time spent on _raw_spin_lock() function went down from 24.55%
to 2.60%. It can be seen that the performance data collected by the
perf-record command can be heavily skewed in some cases on a system
with a large number of CPUs. This set of patches enables the perf
command to give a more accurate and reliable picture of what is really
happening in the system. At the same time, they can also improve the
general performance of systems especially those with a large number
of CPUs.

The d_path() function takes the following two locks:

1. dentry->d_lock [spinlock] from dget()/dget_parent()/dput()
2. rename_lock[seqlock]  from d_path()

This set of patches were designed to minimize the locking overhead
of these code paths.

The current kernel takes the dentry->d_lock lock whenever it wants to
increment or decrement the d_count reference count. However, nothing
big will really happen until the reference count goes all the way to 1
or 0.  Actually, we don't need to take the lock when reference count
is bigger than 1. Instead, atomic cmpxchg() function can be used to
increment or decrement the count in these situations. For safety,
other reference count update operations have to be changed to use
atomic instruction as well.

The rename_lock is a sequence lock. The d_path() function takes the
writer lock because it needs to traverse different dentries through
pointers to get the full path name. Hence it can't tolerate changes
in those pointers. But taking the writer lock also prevent multiple
d_path() calls to proceed concurrently.

A solution is to introduce a new lock type where there will be a
second type of reader which can block the writers - the sequence
read/write lock (seqrwlock). The d_path() and related functions will
then be changed to take the reader lock instead of the writer lock.
This will allow multiple d_path() operations to proceed concurrently.

Additional performance testing was conducted using the AIM7
benchmark.  It is mainly the first patch that has impact on the AIM7
benchmark. Please see the patch description of the first patch on
more information about the benchmark results.

Incidentally, these patches also have a favorable impact on Oracle
database performance when measured by the Oracle SLOB benchmark.

The following tests with multiple threads were also run on kernels
with and without the patch on an 8-socket 80-core system and a PC
with 4-core i5 processor:

1. find $HOME -size 0b
2. cat /proc/*/maps /proc/*/numa_maps
3. git diff

For both the find-size and cat-maps tests, the performance difference
with hot cache was within a few percentage points and hence within
the margin of error. Single-thread performance was slightly worse,
but multithread performance was generally a bit better. Apparently,
reference count update isn't a significant factor in those tests. Their
perf traces indicates that there was less spinlock content in
functions like dput(), but the function itself ran a little bit longer
on average.

The git-diff test showed no difference in performance. There is a
slight increase in system time compensated by a slight decrease in
user time.

Of the 4 patches, patch 3 is dependent on patch 2. The other 2 patches
are independent can be applied individually.

Signed-off-by: Waiman Long 

Waiman Long (4):
  dcache: Don't take unnecessary lock in d_count update
  dcac

[PATCH RFC v2 2/4] dcache: introduce a new sequence read/write lock type

2013-04-05 Thread Waiman Long

The current sequence lock supports 2 types of lock users:

1. A reader who wants a consistent set of information and is willing
   to retry if the information changes. The information that the
   reader needs cannot contain pointers, because any writer could
   invalidate a pointer that a reader was following. This reader
   will never block but they may have to retry if a writer is in
   progress.
2. A writer who may need to modify content of a data structure. Writer
   blocks only if another writer is in progress.

This type of lock is suitable for cases where there are a large number
of readers and much less writers. However, it has a limitation that
reader who may want to follow pointer or cannot tolerate unexpected
changes in the protected data structure must take the writer lock
even if it doesn't need to make any changes.

To more efficiently support this type of readers, a new lock type is
introduced by this patch: sequence read/write lock. Two types of readers
are supported by this new lock:

1. Reader who has the same behavior as a sequence lock reader.
2. Reader who may need to follow pointers. This reader will block if
   a writer is in progress. In turn, it blocks a writer if it is in
   progress. Multiple readers of this type can proceed concurrently.
   Taking this reader lock won't update the sequence number.

This new lock type is a combination of the sequence lock and read/write
lock. Hence it will have the same limitation of a read/write lock that
writers may be starved if there is a lot of contention.

Signed-off-by: Waiman Long 
---
 include/linux/seqrwlock.h |  137 +
 1 files changed, 137 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/seqrwlock.h

diff --git a/include/linux/seqrwlock.h b/include/linux/seqrwlock.h
new file mode 100644
index 000..c6145ff
--- /dev/null
+++ b/include/linux/seqrwlock.h
@@ -0,0 +1,137 @@
+#ifndef __LINUX_SEQRWLOCK_H
+#define __LINUX_SEQRWLOCK_H
+/*
+ * Sequence Read/Write Lock
+ * 
+ * This new lock type is a combination of the sequence lock and read/write
+ * lock. Three types of lock users are supported:
+ * 1. Reader who wants a consistent set of information and is willing to
+ *retry if the information changes. The information that the reader
+ *need cannot contain pointers, because any writer could invalidate
+ *a pointer that a reader was following. This reader never block but
+ *they may have to retry if a writer is in progress.
+ * 2. Reader who may need to follow pointers. This reader will block if
+ *a writer is in progress.
+ * 3. Writer who may need to modify content of a data structure. Writer
+ *blocks if another writer or the 2nd type of reader is in progress.
+ *
+ * The current implementation layered on top of the regular read/write
+ * lock. There is a chance that the writers may be starved by the readers.
+ * So be careful when you decided to use this lock.
+ *
+ * Expected 1st type reader usage:
+ * do {
+ * seq = read_seqrwbegin(&foo);
+ * ...
+ *  } while (read_seqrwretry(&foo, seq));
+ *
+ * Expected 2nd type reader usage:
+ * read_seqrwlock(&foo)
+ * ...
+ * read_seqrwunlock(&foo)
+ *
+ * Expected writer usage:
+ * write_seqrwlock(&foo)
+ * ...
+ * write_seqrwunlock(&foo)
+ *
+ * Based on the seqlock.h file
+ * by Waiman Long
+ */
+
+#include 
+#include 
+#include 
+
+typedef struct {
+   unsigned sequence;
+   rwlock_t lock;
+} seqrwlock_t;
+
+#define __SEQRWLOCK_UNLOCKED(lockname) \
+{ 0, __RW_LOCK_UNLOCKED(lockname) }
+
+#define seqrwlock_init(x)  \
+   do {\
+   (x)->sequence = 0;  \
+   rwlock_init(&(x)->lock);\
+   } while (0)
+
+#define DEFINE_SEQRWLOCK(x) \
+   seqrwlock_t x = __SEQRWLOCK_UNLOCKED(x)
+
+/* For writer:
+ * Lock out other writers and 2nd type of readers and update the sequence
+ * number. Don't need preempt_disable() because that is in the read_lock and
+ * write_lock already.
+ */
+static inline void write_seqrwlock(seqrwlock_t *sl)
+{
+   write_lock(&sl->lock);
+   ++sl->sequence;
+   smp_wmb();
+}
+
+static inline void write_seqrwunlock(seqrwlock_t *sl)
+{
+   smp_wmb();
+   sl->sequence++;
+   write_unlock(&sl->lock);
+}
+
+static inline int write_tryseqrwlock(seqrwlock_t *sl)
+{
+   int ret = write_trylock(&sl->lock);
+
+   if (ret) {
+   ++sl->sequence;
+   smp_wmb();
+   }
+   return ret;
+}
+
+/* For 2nd type of reader:
+ * Lock out other writers, but don't update the sequence number
+ */
+static inline void read_seqrwlock(seqrwlock_t *sl)
+{
+   read_lock(&sl->lock);
+}
+
+static inline void read_seqrwunlock(se

[PATCH v2 1/4] dcache: Don't take unnecessary lock in d_count update

2013-04-05 Thread Waiman Long

 Almost all of which
can be attributed to the following 2 kernel functions:
 1. dget_parent (50.14%)
 2. dput (49.48%)

With this patch on, the time spent on _raw_spin_lock() is only 1.31%
which is a huge improvement.

This impact of this patch on other AIM7 workloads were much more
modest.  The table below show the mean %change due to this patch on
the same 8-socket system with a 3.7.10 kernel.

+--+---++-+
|   Workload   | mean % change | mean % change  | mean % change   |
|  | 10-100 users  | 200-1000 users | 1100-2000 users |
+--+---++-+
| alltests | +0.6% | +9.7%  | +2.9%   |
| five_sec | +0.4% |  0.0%  | +0.3%   |
| fserver  | +0.7% | +2.4%  | +2.0%   |
| high_systime | -1.5% |-15.2%  |-38.0%   |
| new_fserver  | -2.2% | +6.5%  | +0.4%   |
| shared   | +3.9% | +1.1%  | +6.1%   |
+--+---++-+

The regression in the high_systime workload was probably caused
by the decrease in spinlock contention led to a larger increase in
mutex contention. In fact, after applying a mutex patch to reduce
mutex contention, the performance difference due to the addition of
this dcache patch changed to:

+--+---++-+
|   Workload   | mean % change | mean % change  | mean % change   |
|  | 10-100 users  | 200-1000 users | 1100-2000 users |
+--+---++-+
| high_systime | -0.1% | -0.2%  | +1.2%   |
+--+---++-----+

Signed-off-by: Waiman Long 
---
 fs/dcache.c|   39 --
 fs/namei.c |2 +-
 include/linux/dcache.h |  101 ++-
 3 files changed, 117 insertions(+), 25 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fbfae00..48c0680 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -484,7 +484,7 @@ relock:
}
 
if (ref)
-   dentry->d_count--;
+   dcount_dec(dentry);
/*
 * if dentry was on the d_lru list delete it from there.
 * inform the fs via d_prune that this dentry is about to be
@@ -530,10 +530,13 @@ void dput(struct dentry *dentry)
 repeat:
if (dentry->d_count == 1)
might_sleep();
+   if (dcount_dec_cmpxchg(dentry))
+   return;
+
spin_lock(&dentry->d_lock);
BUG_ON(!dentry->d_count);
if (dentry->d_count > 1) {
-   dentry->d_count--;
+   dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -550,7 +553,7 @@ repeat:
dentry->d_flags |= DCACHE_REFERENCED;
dentry_lru_add(dentry);
 
-   dentry->d_count--;
+   dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
 
@@ -621,11 +624,13 @@ EXPORT_SYMBOL(d_invalidate);
 /* This must be called with d_lock held */
 static inline void __dget_dlock(struct dentry *dentry)
 {
-   dentry->d_count++;
+   dcount_inc(dentry);
 }
 
 static inline void __dget(struct dentry *dentry)
 {
+   if (dcount_inc_cmpxchg(dentry))
+   return;
spin_lock(&dentry->d_lock);
__dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
@@ -635,22 +640,14 @@ struct dentry *dget_parent(struct dentry *dentry)
 {
struct dentry *ret;
 
-repeat:
-   /*
-* Don't need rcu_dereference because we re-check it was correct under
-* the lock.
-*/
rcu_read_lock();
-   ret = dentry->d_parent;
-   spin_lock(&ret->d_lock);
-   if (unlikely(ret != dentry->d_parent)) {
-   spin_unlock(&ret->d_lock);
-   rcu_read_unlock();
-   goto repeat;
-   }
+   ret = rcu_dereference(dentry->d_parent);
rcu_read_unlock();
+   if (dcount_inc_cmpxchg(ret))
+   return ret;
+   spin_lock(&ret->d_lock);
BUG_ON(!ret->d_count);
-   ret->d_count++;
+   dcount_inc(ret);
spin_unlock(&ret->d_lock);
return ret;
 }
@@ -780,7 +777,7 @@ static void try_prune_one_dentry(struct dentry *dentry)
while (dentry) {
spin_lock(&dentry->d_lock);
if (dentry->d_count > 1) {
-   dentry->d_count--;
+   dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -1981,7 +1978,7 @@ struct dentry *__d_lookup(const struct dentry *parent,

Re: [PATCH v2 1/4] dcache: Don't take unnecessary lock in d_count update

2013-04-05 Thread Waiman Long


On 04/05/2013 01:12 PM, Al Viro wrote:

@@ -635,22 +640,14 @@ struct dentry *dget_parent(struct dentry *dentry)
  {
struct dentry *ret;

-repeat:
-   /*
-* Don't need rcu_dereference because we re-check it was correct under
-* the lock.
-*/
rcu_read_lock();
-   ret = dentry->d_parent;
-   spin_lock(&ret->d_lock);
-   if (unlikely(ret != dentry->d_parent)) {
-   spin_unlock(&ret->d_lock);
-   rcu_read_unlock();
-   goto repeat;
-   }
+   ret = rcu_dereference(dentry->d_parent);
rcu_read_unlock();
+   if (dcount_inc_cmpxchg(ret))
+   return ret;
+   spin_lock(&ret->d_lock);

And WTF is going to protect your "ret" from being freed just as you'd done
rcu_read_unlock()?


I think I had made a mistake here. I should move the rcu_read_unlock() 
down to before the return statement as well as after the spin_lock(). 
Thank for pointing this out. I will fix that in the next version. 
Anything else that needs to be fixed?


Regards,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] dcache: make Oracle more scalable on large systems

2013-02-28 Thread Waiman Long


On 02/22/2013 07:13 PM, Andi Kleen wrote:

That seems to me like an application problem - poking at what the
kernel is doing via diagnostic interfaces so often that it gets in
the way of the kernel actually doing stuff is not a problem the
kernel can solve.

I agree with you that the application shouldn't be doing that, but
if there is a cheap way to lower the d_path overhead that is also
attractive.  There will be always applications doing broken things.
Any scaling problem less in the kernel is good.

But the real fix in this case is to fix the application.

-Andi

Further investigation into the d_path() bottleneck revealed some
interesting facts about Oracle. First of all, the invocation of the
d_path() kernel function is concentrated in a few processes only
rather than distributed across many of them.  On a 1 minute test run,
the following three standard long-running Oracle processes will call
indo d_path():

1. MMNL - Memory monitor light (gathers and stores AWR statistics) [272]
2. CKPT - Checkpoint process [17]
3. DBRM - DB resource manager (new in 11g) [16]

The numbers within [] are the number of times d_path() will be
called, which are not much for a 1-minutes interval. Beyond those
standard processes, Oracle also seems to spawn transient processes
(last a few seconds) periodically to issue a bunch of d_path() calls
(about 1000) within a short time before they die. I am not sure what
the purpose of those processes are.  In an one minute interval, 2-7
of those transient processes may be spawned depending probably on the
activity level. Most of the d_path() call last for about 1ms. There
are a couple of those that last for more than 10ms.

Other system daemons that call into d_open() include irqbalance and
automount. irqbalance issues about 2000 d_path() call in a minute in
a bursty fashion. The contribution of automount is only about 50 in
the same time period which is not really significant. Regular commands
like cp, ps may also issue a couple of d_path() calls per invocation.

As I was using "perf record --call-graph" command to profile the Oracle
application, I found out that another major user of the d_path()
function happens to be perf_event_mmap_event() of the perf-event
subsystem. It took about 10% of the total d_path() calls. So the
metrics that I collected were skewed a bit because of that.

I am thinking that the impact of my patch on Oracle write performance
is probably due to its impact on the open() system call which has
to update the reference counts on dentry. On the collected perf
traces, a certain portion of the spinlock time was consumed by
dput() and path_get().  The 2 major consumers of those calls are
the d_path() and the open() system call. On test run with no writer,
I saw significantly less open() call in the perf trace and hence much
less impact on Oracle performance.

I do agree that Oracle should probably fix the application to issue less
calls to the d_path() function. However, I would argue that my patch will
still be useful for the following reasons:

1. Changing how the reference counting works (patch 1 of 4) will certainly
   help in situation when processes are issuing intensive batches of file
   system operations as is the case here.
2. Changing the rename_lock to use a sequence r/w lock (patches 2-4 of 4)
   will help to minimize the overhead of the perf-event subsystem when it
   is activated with the call-graph feature which is pretty common.

Regards,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] dcache: make Oracle more scalable on large systems

2013-02-28 Thread Waiman Long


On 02/28/2013 03:39 PM, Waiman Long wrote:


activity level. Most of the d_path() call last for about 1ms. There
are a couple of those that last for more than 10ms.



A correction. The time unit here should be us, not ms. Sorry for the 
mistake.


-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/4] dcache: make Oracle more scalable on large systems

2013-02-19 Thread Waiman Long

It was found that the Oracle database software issues a lot of call
to the seq_path() kernel function which translates a (dentry, mnt)
pair to an absolute path. The seq_path() function will eventually
take the following two locks:

1. dentry->d_lock (spinlock) from dget()/dput()
2. rename_lock(seqlock)  from d_path()

With a lot of database activities, the spinning of the 2 locks takes
a major portion of the kernel time and slow down the database software.

This set of patches were designed to minimize the locking overhead of
this code path and improve Oracle performance on systems with a large
number of CPUs.

The current kernel takes the dentry->d_lock lock whenever it wants to
increment or decrement the d_count reference count. However, nothing
big will really happen until the reference count goes all the way to 1
or 0.  Actually, we don't need to take the lock when reference count
is bigger than 1. Instead, atomic cmpxchg() function can be used to
increment or decrement the count in these situations. For safety,
other reference count update operations have to be changed to use
atomic instruction as well.

The rename_lock is a sequence lock. The d_path() function takes the
writer lock because it needs to traverse different dentries through
pointers to get the full path name. Hence it can't tolerate changes
in those pointers. But taking the writer lock also prevent multiple
d_path() calls to proceed concurrently.

A solution is to introduce a new lock type where there will be a
second type of reader which can block the writers - the sequence
read/write lock (seqrwlock). The d_path() and related functions will
then be changed to take the reader lock instead of the writer lock.
This will allow multiple d_path() operations to proceed concurrently.

Performance testing was done using the Oracle SLOB benchmark with the
latest 11.2.0.3 release of Oracle on a 3.8-rc3 kernel. Database files
were put in a tmpfs partition to minimize physical I/O overhead. Huge
pages were used with 30GB of SGA. The test machine was an 8-socket,
80-core HP Proliant DL980 with 1TB of memory and hyperthreading off.
The tests were run 5 times and the averages were taken.

The patch only has a slight positive impact on logical read
performance. The impact on write (redo size) performance, however,
is much greater. The redo size is a proxy of how much database write
has happened. So a larger value means a higher transaction rate.

+-+-+-++--+
| Readers | Writers | Redo Size   | Redo Size  | % Change |
| | | w/o patch   | with patch |  |
| | |   (MB/s)|   (MB/s)   |  |
+-+-+-++--+
|8|   64|802  |903 |  12.6%   |
|   32|   64|798  |892 |  11.8%   |
|   80|   64|658  |714 |   8.5%   |
|  128|   64|748  |907 |  21.3%   |
+-+-+-++--+

The table below shows the %system and %user times reported by Oracle's
AWR tool as well as the %time spent in the spinlocking code in kernel
with (inside parenthesis) and without (outside parenthesis) the patch.

+-+-++++
| Readers | Writers |  % System  |   % User   | % spinlock |
+-+-++++
|   32|0|  0.3(0.3)  | 39.0(39.0) |  6.3(17.4) |
|   80|0|  0.7(0.7)  | 97.4(94.2) |  2.9(31.7) |
|  128|0|  1.4(1.4)  | 34.4(32.2) | 43.5(62.2) |
|   32|   64|  3.8(3.5)  | 55.4(53.6) |  9.1(35.0) |
|   80|   64|  3.0(2.9)  | 94.4(93.9) |  4.5(38.8) |
|  128|   64|  4.7(4.3)  | 38.2(40.3) | 34.8(58.7) |
+-+-++++

The following tests with multiple threads were also run on kernels with
and without the patch on both DL980 and a PC with 4-core i5 processor:

1. find $HOME -size 0b
2. cat /proc/*/maps /proc/*/numa_maps
3. git diff

For both the find-size and cat-maps tests, the performance difference
with hot cache was within a few percentage points and hence within
the margin of error. Single-thread performance was slightly worse,
but multithread performance was generally a bit better. Apparently,
reference count update isn't a significant factor in those tests. Their
perf traces indicates that there was less spinlock content in
functions like dput(), but the function itself ran a little bit longer
on average.

The git-diff test showed no difference in performance. There is a
slight increase in system time compensated by a slight decrease in
user time.

Signed-off-by: Waiman Long 

Waiman Long (4):
  dcache: Don't take unncessary lock in d_count update
  dcache: introduce a new sequence read/write lock type
  dcache: change rename_lock to a sequence read/write lock
  dcache: don't

[PATCH 1/4] dcache: Don't take unncessary lock in d_count update

2013-02-19 Thread Waiman Long

The current code takes the dentry's d_lock lock whenever the d_count
reference count is being updated. In reality, nothing big really
happens until d_count goes to 0 in dput(). So it is not necessary to
take the lock if the reference count won't go to 0.

Without using a lock, multiple threads may update d_count
simultaneously.  Therefore, atomic instructions must be used to
ensure consistency except in shrink_dcache_for_umount*() where the
whole superblock is being dismounted and locking is not needed.

The worst case scenarios are:

1. d_lock taken in dput with d_count = 2 in one thread and another
   thread comes in to atomically decrement d_count without taking
   the lock. This may result in a d_count of 0 with no deleting
   action taken.

2. d_lock taken in dput with d_count = 1 in one thread and another
   thread comes in to atomically increment d_count without taking
   the lock. This may result in the dentry in the deleted state while
   having a d_count of 1.

Without taking a lock, we need to make sure the decrementing or
incrementing action should not be taken while other threads are
updating d_count simultaneously. This can be done by using the
atomic cmpxchg instruction which will fail if the underlying value
is changed.  If the lock is taken, it should be safe to use a simpler
atomic increment or decrement instruction.

To make sure that the above worst case scenerios will not happen,
the dget() function must take the lock if d_count <= 1. Similarly,
the dput() function must take the lock if d_count <= 2. The cmpxchg()
call to update d_count will be tried twice before falling back to
using the lock as there is a fairly good chance that the cmpxchg()
may fail in a busy situation.

Finally, the CPU must have an instructional level cmpxchg instruction
or the emulated cmpxchg() function may be too expensive to
use. Therefore, the above mentioned changes will only be applied if
the __HAVE_ARCH_CMPXCHG flag is set. Most of the major architectures
supported by Linux have this flag set with the notation exception
of ARM.

As for the performance of the updated reference counting code, it
all depends on whether the cmpxchg instruction is used or not. The
original code has 2 atomic instructions to lock and unlock the
spinlock. The new code path has either 1 atomic cmpxchg instruction
or 3 atomic instructions if the lock has to be taken. Depending on
how frequent the cmpxchg instruction is used (d_count > 1 or 2),
the new code can be faster or slower than the original one.

Signed-off-by: Waiman Long 
---
 fs/dcache.c|   23 ++
 fs/namei.c |2 +-
 include/linux/dcache.h |  105 ++-
 3 files changed, 117 insertions(+), 13 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 19153a0..20cc789 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -484,7 +484,7 @@ relock:
}
 
if (ref)
-   dentry->d_count--;
+   dcount_dec(dentry);
/*
 * if dentry was on the d_lru list delete it from there.
 * inform the fs via d_prune that this dentry is about to be
@@ -530,10 +530,13 @@ void dput(struct dentry *dentry)
 repeat:
if (dentry->d_count == 1)
might_sleep();
+   if (dcount_dec_cmpxchg(dentry))
+   return;
+
spin_lock(&dentry->d_lock);
BUG_ON(!dentry->d_count);
if (dentry->d_count > 1) {
-   dentry->d_count--;
+   dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -550,7 +553,7 @@ repeat:
dentry->d_flags |= DCACHE_REFERENCED;
dentry_lru_add(dentry);
 
-   dentry->d_count--;
+   dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
 
@@ -621,11 +624,13 @@ EXPORT_SYMBOL(d_invalidate);
 /* This must be called with d_lock held */
 static inline void __dget_dlock(struct dentry *dentry)
 {
-   dentry->d_count++;
+   dcount_inc(dentry);
 }
 
 static inline void __dget(struct dentry *dentry)
 {
+   if (dcount_inc_cmpxchg(dentry))
+   return;
spin_lock(&dentry->d_lock);
__dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
@@ -650,7 +655,7 @@ repeat:
}
rcu_read_unlock();
BUG_ON(!ret->d_count);
-   ret->d_count++;
+   dcount_inc(ret);
spin_unlock(&ret->d_lock);
return ret;
 }
@@ -782,7 +787,7 @@ static void try_prune_one_dentry(struct dentry *dentry)
while (dentry) {
spin_lock(&dentry->d_lock);
if (dentry->d_count > 1) {
-   dentry->d_count--;
+   dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -1980,7 +198

[PATCH 4/4] dcache: don't need to take d_lock in prepend_path()

2013-02-19 Thread Waiman Long

The d_lock was used in prepend_path() to protect dentry->d_name from
being changed under the hood. As the caller of prepend_path() has
to take the rename_lock before calling into it, there is no chance
that d_name will be changed. The d_lock lock is only needed when the
rename_lock is not taken.

Signed-off-by: Waiman Long 
---
 fs/dcache.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index b1487e2..0e911fc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2547,6 +2547,7 @@ static int prepend_name(char **buffer, int *buflen, 
struct qstr *name)
  * @buflen: pointer to buffer length
  *
  * Caller holds the rename_lock.
+ * There is no need to lock the dentry as its name cannot be changed.
  */
 static int prepend_path(const struct path *path,
const struct path *root,
@@ -2573,9 +2574,7 @@ static int prepend_path(const struct path *path,
}
parent = dentry->d_parent;
prefetch(parent);
-   spin_lock(&dentry->d_lock);
error = prepend_name(buffer, buflen, &dentry->d_name);
-   spin_unlock(&dentry->d_lock);
if (!error)
error = prepend(buffer, buflen, "/", 1);
if (error)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4] dcache: change rename_lock to a sequence read/write lock

2013-02-19 Thread Waiman Long

The d_path() and related kernel functions currently take a writer
lock on rename_lock because they need to follow pointers. By changing
rename_lock to be the new sequence read/write lock, a reader lock
can be taken and multiple d_path() threads can proceed concurrently
without blocking each other.

It is unlikely that the frequency of filesystem changes and d_path()
name lookup will be high enough to cause writer starvation, the current
limitation of the read/write lock should be acceptable in that case.

All the sites where rename_lock is referenced were modified to use the
sequence read/write lock declaration and access functions.

Signed-off-by: Waiman Long 
---
 fs/autofs4/waitq.c |6 ++--
 fs/ceph/mds_client.c   |4 +-
 fs/cifs/dir.c  |4 +-
 fs/dcache.c|   87 ---
 fs/nfs/namespace.c |6 ++--
 include/linux/dcache.h |4 +-
 kernel/auditsc.c   |5 ++-
 7 files changed, 59 insertions(+), 57 deletions(-)

diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 03bc1d3..95eee02 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -199,7 +199,7 @@ rename_retry:
buf = *name;
len = 0;
 
-   seq = read_seqbegin(&rename_lock);
+   seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
spin_lock(&sbi->fs_lock);
for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
@@ -208,7 +208,7 @@ rename_retry:
if (!len || --len > NAME_MAX) {
spin_unlock(&sbi->fs_lock);
rcu_read_unlock();
-   if (read_seqretry(&rename_lock, seq))
+   if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
return 0;
}
@@ -224,7 +224,7 @@ rename_retry:
}
spin_unlock(&sbi->fs_lock);
rcu_read_unlock();
-   if (read_seqretry(&rename_lock, seq))
+   if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
 
return len;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 9165eb8..da6bd2c 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1458,7 +1458,7 @@ char *ceph_mdsc_build_path(struct dentry *dentry, int 
*plen, u64 *base,
 
 retry:
len = 0;
-   seq = read_seqbegin(&rename_lock);
+   seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
for (temp = dentry; !IS_ROOT(temp);) {
struct inode *inode = temp->d_inode;
@@ -1508,7 +1508,7 @@ retry:
temp = temp->d_parent;
}
rcu_read_unlock();
-   if (pos != 0 || read_seqretry(&rename_lock, seq)) {
+   if (pos != 0 || read_seqrwretry(&rename_lock, seq)) {
pr_err("build_path did not end path lookup where "
   "expected, namelen is %d, pos is %d\n", len, pos);
/* presumably this is only possible if racing with a
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 8719bbe..4842523 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -96,7 +96,7 @@ build_path_from_dentry(struct dentry *direntry)
dfsplen = 0;
 cifs_bp_rename_retry:
namelen = dfsplen;
-   seq = read_seqbegin(&rename_lock);
+   seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
for (temp = direntry; !IS_ROOT(temp);) {
namelen += (1 + temp->d_name.len);
@@ -136,7 +136,7 @@ cifs_bp_rename_retry:
}
}
rcu_read_unlock();
-   if (namelen != dfsplen || read_seqretry(&rename_lock, seq)) {
+   if (namelen != dfsplen || read_seqrwretry(&rename_lock, seq)) {
cFYI(1, "did not end path lookup where expected. namelen=%d "
"dfsplen=%d", namelen, dfsplen);
/* presumably this is only possible if racing with a rename
diff --git a/fs/dcache.c b/fs/dcache.c
index 20cc789..b1487e2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -82,7 +83,7 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
+__cacheline_aligned_in_smp DEFINE_SEQRWLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
 
@@ -1030,7 +1031,7 @@ static struct dentry *try_to_ascend(struct dentry *old, 
int locked, unsigned seq
 */
if (new != old->d_parent ||
 (old->d_flags & DCACHE_DENTRY_KILLED) ||
-(!locked && read_seqretry(&rename_lock, seq))) {
+(!locked && read_seqrwretry(&rename_lock, seq))) {
spin_unlock(&new->d_lock);
new = NULL;

[PATCH 2/4] dcache: introduce a new sequence read/write lock type

2013-02-19 Thread Waiman Long

The current sequence lock supports 2 types of lock users:

1. A reader who wants a consistent set of information and is willing
   to retry if the information changes. The information that the
   reader needs cannot contain pointers, because any writer could
   invalidate a pointer that a reader was following. This reader
   will never block but they may have to retry if a writer is in
   progress.
2. A writer who may need to modify content of a data structure. Writer
   blocks only if another writer is in progress.

This type of lock is suitable for cases where there are a large number
of readers and much less writers. However, it has a limitation that
reader who may want to follow pointer or cannot tolerate unexpected
changes in the protected data structure must take the writer lock
even if it doesn't need to make any changes.

To more efficiently support this type of readers, a new lock type is
introduced by this patch: sequence read/write lock. Two types of readers
are supported by this new lock:

1. Reader who has the same behavior as a sequence lock reader.
2. Reader who may need to follow pointers. This reader will block if
   a writer is in progress. In turn, it blocks a writer if it is in
   progress. Multiple readers of this type can proceed concurrently.
   Taking this reader lock won't update the sequence number.

This new lock type is a combination of the sequence lock and read/write
lock. Hence it will have the same limitation of a read/write lock that
writers may be starved if there is a lot of contention.

Signed-off-by: Waiman Long 
---
 include/linux/seqrwlock.h |  138 +
 1 files changed, 138 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/seqrwlock.h

diff --git a/include/linux/seqrwlock.h b/include/linux/seqrwlock.h
new file mode 100644
index 000..3ff5119
--- /dev/null
+++ b/include/linux/seqrwlock.h
@@ -0,0 +1,138 @@
+#ifndef __LINUX_SEQRWLOCK_H
+#define __LINUX_SEQRWLOCK_H
+/*
+ * Sequence Read/Write Lock
+ * 
+ * This new lock type is a combination of the sequence lock and read/write
+ * lock. Three types of lock users are supported:
+ * 1. Reader who wants a consistent set of information and is willing to
+ *retry if the information changes. The information that the reader
+ *need cannot contain pointers, because any writer could invalidate
+ *a pointer that a reader was following. This reader never block but
+ *they may have to retry if a writer is in progress.
+ * 2. Reader who may need to follow pointers. This reader will block if
+ *a writer is in progress.
+ * 3. Writer who may need to modify content of a data structure. Writer
+ *blocks if another writer or the 2nd type of reader is in progress.
+ *
+ * The current implementation layered on top of the regular read/write
+ * lock. There is a chance that the writers may be starved by the readers.
+ * So be careful when you decided to use this lock.
+ *
+ * Expected 1st type reader usage:
+ * do {
+ * seq = read_seqrwbegin(&foo);
+ * ...
+ * } while (read_seqrwretry(&foo, seq));
+ *
+ * Expected 2nd type reader usage:
+ * read_seqrwlock(&foo)
+ * ...
+ * read_seqrwunlock(&foo)
+ *
+ * Expected writer usage:
+ * write_seqrwlock(&foo)
+ * ...
+ * write_seqrwunlock(&foo)
+ *
+ * Based on the seqlock.h file
+ * by Waiman Long
+ */
+
+#include 
+#include 
+#include 
+
+typedef struct {
+   unsigned sequence;
+   rwlock_t lock;
+} seqrwlock_t;
+
+#define __SEQRWLOCK_UNLOCKED(lockname) \
+{ 0, __RW_LOCK_UNLOCKED(lockname) }
+
+#define seqrwlock_init(x)  \
+   do {\
+   (x)->sequence = 0;  \
+   rwlock_init(&(x)->lock);\
+   } while (0)
+
+#define DEFINE_SEQRWLOCK(x) \
+   seqrwlock_t x = __SEQRWLOCK_UNLOCKED(x)
+
+/* For writer:
+ * Lock out other writers and 2nd type of readers and update the sequence
+ * number. Don't need preempt_disable() because that is in the read_lock and
+ * write_lock already.
+ */
+static inline void write_seqrwlock(seqrwlock_t *sl)
+{
+   write_lock(&sl->lock);
+   ++sl->sequence;
+   smp_wmb();
+}
+
+static inline void write_seqrwunlock(seqrwlock_t *sl)
+{
+   smp_wmb();
+   sl->sequence++;
+   write_unlock(&sl->lock);
+}
+
+static inline int write_tryseqrwlock(seqrwlock_t *sl)
+{
+   int ret = write_trylock(&sl->lock);
+
+   if (ret) {
+   ++sl->sequence;
+   smp_wmb();
+   }
+   return ret;
+}
+
+/* For 2nd type of reader:
+ * Lock out other writers, but don't update the sequence number
+ */
+static inline void read_seqrwlock(seqrwlock_t *sl)
+{
+   read_lock(&sl->lock);
+}
+
+static inline void read_seqrwunlock(se

Re: [PATCH 0/4] dcache: make Oracle more scalable on large systems

2013-02-21 Thread Waiman Long


On 02/21/2013 07:13 PM, Andi Kleen wrote:

Dave Chinner  writes:


On Tue, Feb 19, 2013 at 01:50:55PM -0500, Waiman Long wrote:

It was found that the Oracle database software issues a lot of call
to the seq_path() kernel function which translates a (dentry, mnt)
pair to an absolute path. The seq_path() function will eventually
take the following two locks:

Nobody should be doing reverse dentry-to-name lookups in a quantity
sufficient for it to become a performance limiting factor. What is
the Oracle DB actually using this path for?

Yes calling d_path frequently is usually a bug elsewhere.
Is that through /proc ?

-Andi


A sample strace of Oracle indicates that it opens a lot of /proc 
filesystem files such as the stat, maps, etc many times while running. 
Oracle has a very detailed system performance reporting infrastructure 
in place to report almost all aspect of system performance through its 
AWR reporting tool or the browser-base enterprise manager. Maybe that is 
the reason why it is hitting this performance bottleneck.


Regards,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation

2013-08-26 Thread Waiman Long


On 08/22/2013 09:28 AM, Alexander Fyodorov wrote:

22.08.2013, 05:04, "Waiman Long":

On 08/21/2013 11:51 AM, Alexander Fyodorov wrote:
In this case, we should have smp_wmb() before freeing the lock. The
question is whether we need to do a full mb() instead. The x86 ticket
spinlock unlock code is just a regular add instruction except for some
exotic processors. So it is a compiler barrier but not really a memory
fence. However, we may need to do a full memory fence for some other
processors.

The thing is that x86 ticket spinlock code does have full memory barriers both in lock() and 
unlock() code: "add" instruction there has "lock" prefix which implies a full 
memory barrier. So it is better to use smp_mb() and let each architecture define it.


I also thought that the x86 spinlock unlock path was an atomic add. It 
just comes to my realization recently that this is not the case. The 
UNLOCK_LOCK_PREFIX will be mapped to "" except for some old 32-bit x86 
processors.



At this point, I am inclined to have either a smp_wmb() or smp_mb() at
the beginning of the unlock function and a barrier() at the end.

As the lock/unlock functions can be inlined, it is possible that a
memory variable can be accessed earlier in the calling function and the
stale copy may be used in the inlined lock/unlock function instead of
fetching a new copy. That is why I prefer a more liberal use of
ACCESS_ONCE() for safety purpose.

That is impossible: both lock() and unlock() must have either full memory 
barrier or an atomic operation which returns value. Both of them prohibit 
optimizations and compiler cannot reuse any global variable. So this usage of 
ACCESS_ONCE() is unneeded.

You can read more on this in Documentation/volatile-considered-harmful.txt

And although I already suggested that, have you read 
Documentation/memory-barriers.txt? There is a lot of valuable information there.


I did read Documentation/memory-barriers.txt. I will read 
volatile-considered-harmful.txt.


Regards,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

2013-08-29 Thread Waiman Long


On 08/28/2013 09:40 PM, Linus Torvalds wrote:

Just FYI: I've merged two preparatory patches in my tree for the whole
lockref thing. Instead of applying your four patches as-is during the
merge window, I ended up writing two patches that introduce the
concept and use it in the dentry code *without* introducing any of the
new semantics yet.

Waiman, I attributed the patches to you, even if they don't actually
look much like any of the patches you sent out. And because I was
trying very hard to make sure that no actual semantics changed, my
version doesn't have the dget_parent() lockless update code, for
example. I literally just did a search-and-replace of "->d_count" with
"->d_lockref.count" and then I fixed up a few things by hand (undid
one replacement in a comment, and used the helper functions where they
were semantically identical).

  You don't have to rewrite your patches if you don't want to, I'm
planning on cherry-picking the actual code changes during the merge
window.

   Linus


Thanks for merging the 2 preparatory patches for me. I will rebase my 
patches with the latest linux git tree. A new v8 patch set will be sent 
out sometime next week. I am looking forward to the v3.12 merge window 
which I think is coming soon.


Cheer,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation

2013-08-29 Thread Waiman Long

On 08/27/2013 08:09 AM, Alexander Fyodorov wrote:

I also thought that the x86 spinlock unlock path was an atomic add. It
just comes to my realization recently that this is not the case. The
UNLOCK_LOCK_PREFIX will be mapped to "" except for some old 32-bit x86
processors.

Hmm, I didn't know that. Looking through Google found these rules for x86
memory ordering:
* Loads are not reordered with other loads.
* Stores are not reordered with other stores.
* Stores are not reordered with older loads.
So x86 memory model is rather strict and memory barrier is really not needed in the unlock path -
xadd is a store and thus behaves like a memory barrier, and since only lock's owner modifies
"ticket.head" the "add" instruction need not be atomic.

But this is true only for x86, other architectures have more relaxed memory
ordering. Maybe we should allow arch code to redefine queue_spin_unlock()? And
define a version without smp_mb() for x86?

What I have been thinking is to set a flag in an architecture specific
header file to tell if the architecture need a memory barrier. The
generic code will then either do a smp_mb() or barrier() depending on
the presence or absence of the flag. I would prefer to do more in the
generic code, if possible.

Regards,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-10-01 Thread Waiman Long


On 10/01/2013 12:48 PM, Tim Chen wrote:

On Mon, 2013-09-30 at 12:36 -0400, Waiman Long wrote:

On 09/30/2013 12:10 PM, Jason Low wrote:

On Mon, 2013-09-30 at 11:51 -0400, Waiman Long wrote:

On 09/28/2013 12:34 AM, Jason Low wrote:

Also, below is what the mcs_spin_lock() and mcs_spin_unlock()
functions would look like after applying the proposed changes.

static noinline
void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node *node)
{
   struct mcs_spin_node *prev;

   /* Init node */
   node->locked = 0;
   node->next   = NULL;

   prev = xchg(lock, node);
   if (likely(prev == NULL)) {
   /* Lock acquired. No need to set node->locked since it
won't be used */
   return;
   }
   ACCESS_ONCE(prev->next) = node;
   /* Wait until the lock holder passes the lock down */
   while (!ACCESS_ONCE(node->locked))
   arch_mutex_cpu_relax();
   smp_mb();

I wonder if a memory barrier is really needed here.

If the compiler can reorder the while (!ACCESS_ONCE(node->locked)) check
so that the check occurs after an instruction in the critical section,
then the barrier may be necessary.


In that case, just a barrier() call should be enough.

The cpu could still be executing out of order load instruction from the
critical section before checking node->locked?  Probably smp_mb() is
still needed.

Tim


But this is the lock function, a barrier() call should be enough to 
prevent the critical section from creeping up there. We certainly need 
some kind of memory barrier at the end of the unlock function.


-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path

2013-10-01 Thread Waiman Long


On 10/01/2013 03:33 AM, Ingo Molnar wrote:

* Waiman Long  wrote:


I think Waiman's patches (even the later ones) made the queued rwlocks
be a side-by-side implementation with the old rwlocks, and I think
that was just being unnecessarily careful. It might be useful for
testing to have a config option to switch between the two, but we
might as well go all the way.

It is not actually a side-by-side implementation. A user can choose
either regular rwlock or the queue one, but never both by setting a
configuration parameter. However, I now think that maybe we should do it
the lockref way by pre-determining it on a per-architecture level
without user visible configuration option.

Well, as I pointed it out to you during review, such a Kconfig driven
locking API choice is a no-go!

What I suggested instead: there's absolutely no problem with providing a
better rwlock_t implementation, backed with numbers and all that.

Thanks,

Ingo


Yes, this is what I am planning to do. The next version of my qrwlock 
patch will force the switch to queue rwlock for x86 architecture. The 
other architectures have to be done separately.


-Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-10-01 Thread Waiman Long


On 10/01/2013 05:16 PM, Tim Chen wrote:

On Tue, 2013-10-01 at 16:01 -0400, Waiman Long wrote:


The cpu could still be executing out of order load instruction from the
critical section before checking node->locked?  Probably smp_mb() is
still needed.

Tim

But this is the lock function, a barrier() call should be enough to
prevent the critical section from creeping up there. We certainly need
some kind of memory barrier at the end of the unlock function.

I may be missing something.  My understanding is that barrier only
prevents the compiler from rearranging instructions, but not for cpu out
of order execution (as in smp_mb). So cpu could read memory in the next
critical section, before node->locked is true, (i.e. unlock has been
completed).  If we only have a simple barrier at end of mcs_lock, then
say the code on CPU1 is

mcs_lock
x = 1;
...
x = 2;
mcs_unlock

and CPU 2 is

mcs_lock
y = x;
...
mcs_unlock

We expect y to be 2 after the "y = x" assignment.  But we
we may execute the code as

CPU1CPU2

x = 1;
... y = x;  ( y=1, out of order load)
x = 2
mcs_unlock
Check node->locked==true
continue executing critical section (y=1 when we expect 
y=2)

So we get y to be 1 when we expect that it should be 2.  Adding smp_mb
after the node->locked check in lock code

ACCESS_ONCE(prev->next) = node;
/* Wait until the lock holder passes the lock down */
while (!ACCESS_ONCE(node->locked))
 arch_mutex_cpu_relax();
smp_mb();

should prevent this scenario.

Thanks.
Tim


If the lock and unlock functions are done right, there should be no 
overlap of critical section. So it is job of the lock/unlock functions 
to make sure that critical section code won't leak out. There should be 
some kind of memory barrier at the beginning of the lock function and 
the end of the unlock function.


The critical section also likely to have branches. The CPU may 
speculatively execute code on the 2 branches, but one of them will be 
discarded once the branch condition is known. Also 
arch_mutex_cpu_relax() is a compiler barrier by itself. So we may not 
need a barrier() after all. The while statement is a branch instruction, 
any code after that can only be speculatively executed and cannot be 
committed until the branch is done.


In x86, the smp_mb() function translated to a mfence instruction which 
cost time. That is why I try to get rid of it if it is not necessary.


Regards,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 2/3] qrwlock x86: Enable x86 to use queue read/write lock

2013-10-02 Thread Waiman Long

This patch makes the necessary changes at the x86 architecture specific
layer to enable the presence of the CONFIG_QUEUE_RWLOCK kernel option
to replace the read/write lock by the queue read/write lock.

It also enables CONFIG_ARCH_QUEUE_RWLOCK which will force the use
of queue read/write lock for x86 which tends to have the largest
NUMA machines compared with the other architectures. This patch will
improve the scalability of those large machines.

Signed-off-by: Waiman Long 
---
 arch/x86/Kconfig  |1 +
 arch/x86/include/asm/spinlock.h   |2 ++
 arch/x86/include/asm/spinlock_types.h |4 
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ee2fb9d..14b4dca 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -123,6 +123,7 @@ config X86
select COMPAT_OLD_SIGACTION if IA32_EMULATION
select RTC_LIB
select HAVE_DEBUG_STACKOVERFLOW
+   select QUEUE_RWLOCK
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..8fb88c5 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -188,6 +188,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t 
*lock)
cpu_relax();
 }
 
+#ifndef CONFIG_QUEUE_RWLOCK
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
@@ -270,6 +271,7 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
 : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
 }
+#endif /* CONFIG_QUEUE_RWLOCK */
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
 #define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..a585635 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -34,6 +34,10 @@ typedef struct arch_spinlock {
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
 
+#ifdef CONFIG_QUEUE_RWLOCK
+#include 
+#else
 #include 
+#endif
 
 #endif /* _ASM_X86_SPINLOCK_TYPES_H */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 0/3] qrwlock: Introducing a queue read/write lock implementation

2013-10-02 Thread Waiman Long

v3->v4:
 - Optimize the fast path with better cold cache behavior and
   performance.
 - Removing some testing code.
 - Make x86 use queue rwlock with no user configuration.

v2->v3:
 - Make read lock stealing the default and fair rwlock an option with
   a different initializer.
 - In queue_read_lock_slowpath(), check irq_count() and force spinning
   and lock stealing in interrupt context.
 - Unify the fair and classic read-side code path, and make write-side
   to use cmpxchg with 2 different writer states. This slows down the
   write lock fastpath to make the read side more efficient, but is
   still slightly faster than a spinlock.

v1->v2:
 - Improve lock fastpath performance.
 - Optionally provide classic read/write lock behavior for backward
   compatibility.
 - Use xadd instead of cmpxchg for fair reader code path to make it
   immute to reader contention.
 - Run more performance testing.

As mentioned in the LWN article http://lwn.net/Articles/364583/,
the read/write lock suffer from an unfairness problem that it is
possible for a stream of incoming readers to block a waiting writer
from getting the lock for a long time. Also, a waiting reader/writer
contending a rwlock in local memory will have a higher chance of
acquiring the lock than a reader/writer in remote node.

This patch set introduces a queue-based read/write lock implementation
that can largely solve this unfairness problem if the lock owners
choose to use the fair variant of the lock.

The queue rwlock has two variants selected at initialization time
- unfair (with read lock stealing) and fair (to both readers and
writers). The unfair rwlock is the default.

The read lock slowpath will check if the reader is in an interrupt
context. If so, it will force lock spinning and stealing without
waiting in a queue. This is to ensure the read lock will be granted
as soon as possible.

Even the unfair rwlock is fairer than the current version as there
is a higher chance for writers to get the lock and is fair among
the writers.

The queue write lock can also be used as a replacement for ticket
spinlocks that are highly contended if lock size increase is not
an issue.

This patch set currently provides queue read/write lock support on
x86 architecture only. Support for other architectures can be added
later on once architecture specific support infrastructure is added
and proper testing is done.

Signed-off-by: Waiman Long 

Waiman Long (3):
  qrwlock: A queue read/write lock implementation
  qrwlock x86: Enable x86 to use queue read/write lock
  qrwlock: Enable fair queue read/write lock

 arch/x86/Kconfig  |1 +
 arch/x86/include/asm/spinlock.h   |2 +
 arch/x86/include/asm/spinlock_types.h |4 +
 include/asm-generic/qrwlock.h |  256 +
 include/linux/rwlock.h|   15 ++
 include/linux/rwlock_types.h  |   13 ++
 kernel/Kconfig.locks  |7 +
 lib/Makefile  |1 +
 lib/qrwlock.c |  247 +++
 lib/spinlock_debug.c  |   19 +++
 10 files changed, 565 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qrwlock.h
 create mode 100644 lib/qrwlock.c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 3/3] qrwlock: Enable fair queue read/write lock

2013-10-02 Thread Waiman Long

By default, queue rwlock is fair among writers and gives preference
to readers allowing them to steal lock even if a writer is
waiting. However, there is a desire to have a fair variant of
rwlock that is more deterministic. To enable this, fair variants
of lock initializers are added by this patch to allow lock owners
to choose to use fair rwlock. These newly added initializers all
have the _fair or _FAIR suffix to indicate the desire to use a fair
rwlock. If the QUEUE_RWLOCK config option is not selected, the fair
rwlock initializers will be the same as the regular ones.

Signed-off-by: Waiman Long 
---
 include/linux/rwlock.h   |   15 +++
 include/linux/rwlock_types.h |   13 +
 lib/spinlock_debug.c |   19 +++
 3 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h
index bc2994e..5f2628b 100644
--- a/include/linux/rwlock.h
+++ b/include/linux/rwlock.h
@@ -23,9 +23,24 @@ do { 
\
\
__rwlock_init((lock), #lock, &__key);   \
 } while (0)
+
+# ifdef CONFIG_QUEUE_RWLOCK
+extern void __rwlock_init_fair(rwlock_t *lock, const char *name,
+  struct lock_class_key *key);
+#  define rwlock_init_fair(lock)   \
+do {   \
+   static struct lock_class_key __key; \
+   \
+   __rwlock_init_fair((lock), #lock, &__key);  \
+} while (0)
+# else
+#  define __rwlock_init_fair(l, n, k)  __rwlock_init(l, n, k)
+# endif /* CONFIG_QUEUE_RWLOCK */
 #else
 # define rwlock_init(lock) \
do { *(lock) = __RW_LOCK_UNLOCKED(lock); } while (0)
+# define rwlock_init_fair(lock)\
+   do { *(lock) = __RW_LOCK_UNLOCKED_FAIR(lock); } while (0)
 #endif
 
 #ifdef CONFIG_DEBUG_SPINLOCK
diff --git a/include/linux/rwlock_types.h b/include/linux/rwlock_types.h
index cc0072e..d27c812 100644
--- a/include/linux/rwlock_types.h
+++ b/include/linux/rwlock_types.h
@@ -37,12 +37,25 @@ typedef struct {
.owner = SPINLOCK_OWNER_INIT,   \
.owner_cpu = -1,\
RW_DEP_MAP_INIT(lockname) }
+#define __RW_LOCK_UNLOCKED_FAIR(lockname)  \
+   (rwlock_t)  {   .raw_lock = __ARCH_RW_LOCK_UNLOCKED_FAIR,\
+   .magic = RWLOCK_MAGIC,  \
+   .owner = SPINLOCK_OWNER_INIT,   \
+   .owner_cpu = -1,\
+   RW_DEP_MAP_INIT(lockname) }
 #else
 #define __RW_LOCK_UNLOCKED(lockname) \
(rwlock_t)  {   .raw_lock = __ARCH_RW_LOCK_UNLOCKED,\
RW_DEP_MAP_INIT(lockname) }
+#define __RW_LOCK_UNLOCKED_FAIR(lockname) \
+   (rwlock_t)  {   .raw_lock = __ARCH_RW_LOCK_UNLOCKED_FAIR,\
+   RW_DEP_MAP_INIT(lockname) }
 #endif
 
 #define DEFINE_RWLOCK(x)   rwlock_t x = __RW_LOCK_UNLOCKED(x)
+#define DEFINE_RWLOCK_FAIR(x)  rwlock_t x = __RW_LOCK_UNLOCKED_FAIR(x)
 
+#ifndef__ARCH_RW_LOCK_UNLOCKED_FAIR
+#define__ARCH_RW_LOCK_UNLOCKED_FAIR__ARCH_RW_LOCK_UNLOCKED
+#endif
 #endif /* __LINUX_RWLOCK_TYPES_H */
diff --git a/lib/spinlock_debug.c b/lib/spinlock_debug.c
index 0374a59..d6ef7ce 100644
--- a/lib/spinlock_debug.c
+++ b/lib/spinlock_debug.c
@@ -49,6 +49,25 @@ void __rwlock_init(rwlock_t *lock, const char *name,
 
 EXPORT_SYMBOL(__rwlock_init);
 
+#ifdef CONFIG_QUEUE_RWLOCK
+void __rwlock_init_fair(rwlock_t *lock, const char *name,
+   struct lock_class_key *key)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+   /*
+* Make sure we are not reinitializing a held lock:
+*/
+   debug_check_no_locks_freed((void *)lock, sizeof(*lock));
+   lockdep_init_map(&lock->dep_map, name, key, 0);
+#endif
+   lock->raw_lock = (arch_rwlock_t) __ARCH_RW_LOCK_UNLOCKED_FAIR;
+   lock->magic = RWLOCK_MAGIC;
+   lock->owner = SPINLOCK_OWNER_INIT;
+   lock->owner_cpu = -1;
+}
+EXPORT_SYMBOL(__rwlock_init_fair);
+#endif /* CONFIG_QUEUE_RWLOCK */
+
 static void spin_dump(raw_spinlock_t *lock, const char *msg)
 {
struct task_struct *owner = NULL;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 1/3] qrwlock: A queue read/write lock implementation

2013-10-02 Thread Waiman Long

pin_lock
   1.04%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner

Perf profile of kernel (3):

  10.57%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   7.98%   reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
   5.83%   reaim  [kernel.kallsyms]  [k] mspin_lock
   2.86%  ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   2.71%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
   1.52%true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   1.51%   reaim  [kernel.kallsyms]  [k] queue_read_lock_slowpath
   1.35%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
   1.12%   reaim  [kernel.kallsyms]  [k] zap_pte_range
   1.06%   reaim  [kernel.kallsyms]  [k] perf_event_aux_ctx
   1.01%   reaim  [kernel.kallsyms]  [k] perf_event_aux

Tim Chen also tested the qrwlock with Ingo's patch on a 4-socket
machine.  It was found the performance improvement of 11% was the
same with regular rwlock or queue rwlock.

Signed-off-by: Waiman Long 
---
 include/asm-generic/qrwlock.h |  256 +
 kernel/Kconfig.locks  |7 +
 lib/Makefile  |1 +
 lib/qrwlock.c |  247 +++
 4 files changed, 511 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qrwlock.h
 create mode 100644 lib/qrwlock.c

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
new file mode 100644
index 000..e94c69c
--- /dev/null
+++ b/include/asm-generic/qrwlock.h
@@ -0,0 +1,256 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long 
+ */
+#ifndef __ASM_GENERIC_QRWLOCK_H
+#define __ASM_GENERIC_QRWLOCK_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#if !defined(__LITTLE_ENDIAN) && !defined(__BIG_ENDIAN)
+#error "Missing either LITTLE_ENDIAN or BIG_ENDIAN definition."
+#endif
+
+#if (CONFIG_NR_CPUS < 65536)
+typedef u16 __nrcpu_t;
+typedef u32 __nrcpupair_t;
+#defineQRW_READER_BIAS (1U << 16)
+#else
+typedef u32 __nrcpu_t;
+typedef u64 __nrcpupair_t;
+#defineQRW_READER_BIAS (1UL << 32)
+#endif
+
+/*
+ * The queue read/write lock data structure
+ *
+ * Read lock stealing can only happen when there is at least one reader
+ * holding the read lock. When the fair flag is not set, it mimics the
+ * behavior of the regular rwlock at the expense that a perpetual stream
+ * of readers could starve a writer for a long period of time. That
+ * behavior, however, may be beneficial to a workload that is reader heavy
+ * with slow writers, and the writers can wait without undesirable consequence.
+ * This fair flag should only be set at initialization time.
+ *
+ * The layout of the structure is endian-sensitive to make sure that adding
+ * QRW_READER_BIAS to the rw field to increment the reader count won't
+ * disturb the writer and the fair fields.
+ */
+struct qrwnode {
+   struct qrwnode *next;
+   boolwait;   /* Waiting flag */
+};
+
+typedef struct qrwlock {
+   union qrwcnts {
+   struct {
+#ifdef __LITTLE_ENDIAN
+   u8writer;   /* Writer state */
+   u8fair; /* Fair rwlock flag */
+   __nrcpu_t readers;  /* # of active readers  */
+#else
+   __nrcpu_t readers;  /* # of active readers  */
+   u8fair; /* Fair rwlock flag */
+   u8writer;   /* Writer state */
+#endif
+   };
+   __nrcpupair_t rw;   /* Reader/writer number pair */
+   } cnts;
+   struct qrwnode *waitq;  /* Tail of waiting queue */
+} arch_rwlock_t;
+
+/*
+ * Writer state values & mask
+ */
+#defineQW_WAITING  1   /* A writer is waiting  
   */
+#defineQW_LOCKED   0xff/* A writer holds the 
lock */
+#define QW_MASK_FAIR   ((u8)~QW_WAITING)   /* Mask for fair reader*/
+#define QW_MASK_UNFAIR ((u8)~0)/* Mask for unfair reader  */
+
+/*
+ * External function declarations
+ */
+extern void queue_read_lock_slowpath(struct qrwlock *lock);
+extern void queue_write_lock_slowpath(struct qrwlock *lock);
+
+/**
+ * queue_read_can_lock- would read_trylock() succe

Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-10-02 Thread Waiman Long


On 09/26/2013 06:42 PM, Jason Low wrote:

On Thu, 2013-09-26 at 14:41 -0700, Tim Chen wrote:

Okay, that would makes sense for consistency because we always
first set node->lock = 0 at the top of the function.

If we prefer to optimize this a bit though, perhaps we can
first move the node->lock = 0 so that it gets executed after the
"if (likely(prev == NULL)) {}" code block and then delete
"node->lock = 1" inside the code block.

static noinline
void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node *node)
{
struct mcs_spin_node *prev;

/* Init node */
node->next   = NULL;

prev = xchg(lock, node);
if (likely(prev == NULL)) {
/* Lock acquired */
return;
}
node->locked = 0;


You can remove the locked flag setting statement inside if (prev == 
NULL), but you can't clear the locked flag after xchg(). In the interval 
between xchg() and locked=0, the previous lock owner may come in and set 
the flag. Now if your clear it, the thread will loop forever. You have 
to clear it before xchg().


-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-10-02 Thread Waiman Long


On 10/02/2013 02:43 PM, Tim Chen wrote:

On Tue, 2013-10-01 at 21:25 -0400, Waiman Long wrote:


If the lock and unlock functions are done right, there should be no
overlap of critical section. So it is job of the lock/unlock functions
to make sure that critical section code won't leak out. There should be
some kind of memory barrier at the beginning of the lock function and
the end of the unlock function.

The critical section also likely to have branches. The CPU may
speculatively execute code on the 2 branches, but one of them will be
discarded once the branch condition is known. Also
arch_mutex_cpu_relax() is a compiler barrier by itself. So we may not
need a barrier() after all. The while statement is a branch instruction,
any code after that can only be speculatively executed and cannot be
committed until the branch is done.

But the condition code may be checked after speculative execution?
The condition may not be true during speculative execution and only
turns true when we check the condition, and take that branch?

The thing that bothers me is without memory barrier after the while
statement, we could speculatively execute before affirming the lock is
in acquired state. Then when we check the lock, the lock is set
to acquired state in the mean time.
We could be loading some memory entry *before*
the node->locked has been set true.  I think a smp_rmb (if not a
smp_mb) should be set after the while statement.


Yes, I think a smp_rmb() make sense here to correspond to the smp_wmb() 
in the unlock path.


BTW, you need to move the node->locked = 0; statement before xchg() if 
you haven't done so.


-Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-10-02 Thread Waiman Long


On 10/02/2013 03:30 PM, Jason Low wrote:

On Wed, Oct 2, 2013 at 12:19 PM, Waiman Long  wrote:

On 09/26/2013 06:42 PM, Jason Low wrote:

On Thu, 2013-09-26 at 14:41 -0700, Tim Chen wrote:

Okay, that would makes sense for consistency because we always
first set node->lock = 0 at the top of the function.

If we prefer to optimize this a bit though, perhaps we can
first move the node->lock = 0 so that it gets executed after the
"if (likely(prev == NULL)) {}" code block and then delete
"node->lock = 1" inside the code block.

static noinline
void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node
*node)
{
 struct mcs_spin_node *prev;

 /* Init node */
 node->next   = NULL;

 prev = xchg(lock, node);
 if (likely(prev == NULL)) {
 /* Lock acquired */
 return;
 }
 node->locked = 0;


You can remove the locked flag setting statement inside if (prev == NULL),
but you can't clear the locked flag after xchg(). In the interval between
xchg() and locked=0, the previous lock owner may come in and set the flag.
Now if your clear it, the thread will loop forever. You have to clear it
before xchg().

Yes, in my most recent version, I left locked = 0 in its original
place so that the xchg() can act as a barrier for it.

The other option would have been to put another barrier after locked =
0. I went with leaving locked = 0 in its original place so that we
don't need that extra barrier.


I don't think putting another barrier after locked=0 will work. 
Chronologically, the flag must be cleared before the node address is 
saved in the lock field. There is no way to guarantee that except by 
putting the locked=0 before xchg().


-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3826 matches

Mail list logo