Re: [PATCH] microoptimize by trying to avoid locking a locked mutex

Hans Petter Selasky Fri, 06 Nov 2015 06:23:36 -0800

On 11/06/15 01:55, Mateusz Guzik wrote:

On Thu, Nov 05, 2015 at 04:35:22PM -0700, Ian Lepore wrote:

On Thu, 2015-11-05 at 14:19 -0800, John Baldwin wrote:

On Thursday, November 05, 2015 01:45:19 PM Adrian Chadd wrote:

On 5 November 2015 at 11:26, Mateusz Guzik <mjgu...@gmail.com>
wrote:

On Thu, Nov 05, 2015 at 11:04:13AM -0800, John Baldwin wrote:

On Thursday, November 05, 2015 04:26:28 PM Konstantin Belousov
wrote:

On Thu, Nov 05, 2015 at 12:32:18AM +0100, Mateusz Guzik
wrote:

mtx_lock will unconditionally try to grab the lock and if
that fails,
will call __mtx_lock_sleep which will immediately try to do
the same
atomic op again.


So, the obvious microoptimization is to check the state in
__mtx_lock_sleep and avoid the operation if the lock is not
free.

This gives me ~40% speedup in a microbenchmark of 40 find
processes
traversing tmpfs and contending on mount mtx (only used as
an easy
benchmark, I have WIP patches to get rid of it).

Second part of the patch is optional and just checks the
state of the
lock prior to doing any atomic operations, but it gives a
very modest
speed up when applied on top of the __mtx_lock_sleep
change. As such,
I'm not going to defend this part.

Shouldn't the same consideration applied to all spinning
loops, i.e.
also to the spin/thread mutexes, and to the spinning parts of
sx and
lockmgr ?


I agree.  I think both changes are good and worth doing in our
other
primitives.


I glanced over e.g. rw_rlock and it did not have the issue, now
that I
see _sx_xlock_hard it wuld indeed use fixing.

Expect a patch in few h for all primitives I'll find. I'll stress
test
the kernel, but it is unlikely I'll do microbenchmarks for
remaining
primitives.


Is this stuff you're proposing still valid for non-x86 platforms?


Yes.  It just does a read before trying the atomic compare and swap
and
falls through to the hard case as if the atomic op failed if the
result
of the read would result in a compare failure.


The atomic ops include barriers, the new pre-read of the variable
doesn't.  Will that cause problems, especially for code inside a loop
where the compiler may decide to shuffle things around?


Lock value is volatile, so the compiler should not screw us up. I removed
the store to the variable due to a different concern. In worst case we
would have slept instead of spinning. (see below)

I suspect the performance gain will be biggest on the platforms where
atomic ops are expensive (I gather from various code comments that's
the case on x86).


I don't know about other architectures, on x86 atomic op performance on
contended cachelines deteriorates drastically.

=======================================================

For the most port the patch below does 2 simple kinds of changes:
1. in macros for lock/unlock primitives the lock value is fetched and
   compared against relevant _UNLOCKED macro prior to the atomic op
2. in functions:

before:
while (!_mtx_obtain_lock(m, tid)) {
        v = m->mtx_lock;
        if (v != MTX_UNOWNED) {
                ....
        }
        .....
}

after:
for (;;) {
        if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
                break;
        v = m->mtx_lock;
        if (v != MTX_UNOWNED) {
                ....
        }
        .....
}

The original patch preloaded the 'v' variable. If the value was
MTX_UNOWNED and _mtx_obtain_lock failed, we just lost the race.  In
order to spin waiting for the other thread to release the lock, we have
to re-read the variable. Thus for simplicity it is no longer stored in
'v'. Note this is contrary to kib's earlier suggestion which would put
such threads directly to sleep. I don't have proper measurements to have
any strong opinion here, I went the aforementioned route to minimise
changes.

Note this is a trivial patch to improve the situation a little bit, I
have no interest in trying to polish these primitives at this time.

For overall results, on a machine with 32GB of RAM and the following:
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (2800.06-MHz K8-class
CPU)
FreeBSD/SMP: 2 package(s) x 10 core(s) x 2 SMT threads

this reduced make -j 40 kernel in a 10.2 jail from ~2:15 to ~2:05.

=======================================================

diff --git a/sys/kern/kern_lock.c b/sys/kern/kern_lock.c
index aa67180..198e93e 100644
--- a/sys/kern/kern_lock.c
+++ b/sys/kern/kern_lock.c
@@ -787,8 +787,10 @@ __lockmgr_args(struct lock *lk, u_int flags, struct 
lock_object *ilk,
                        break;
                }

-               while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED,
-                   tid)) {
+               for (;;) {
+                       if (lk->lk_lock == LK_UNLOCKED &&
+                           atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, 
tid))
+                               break;
  #ifdef HWPMC_HOOKS
                        PMC_SOFT_CALL( , , lock, failed);
  #endif
@@ -1124,7 +1126,11 @@ __lockmgr_args(struct lock *lk, u_int flags, struct 
lock_object *ilk,
                            __func__, iwmesg, file, line);
                }

-               while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid)) {
+               for (;;) {
+                       if (lk->lk_lock == LK_UNLOCKED &&
+                           atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, 
tid))
+                               break;
+
  #ifdef HWPMC_HOOKS
                        PMC_SOFT_CALL( , , lock, failed);
  #endif
diff --git a/sys/kern/kern_mutex.c b/sys/kern/kern_mutex.c
index bec8f6b..51e82a3 100644
--- a/sys/kern/kern_mutex.c
+++ b/sys/kern/kern_mutex.c
@@ -419,7 +419,9 @@ __mtx_lock_sleep(volatile uintptr_t *c, uintptr_t tid, int 
opts,
        all_time -= lockstat_nsecs(&m->lock_object);
  #endif

-       while (!_mtx_obtain_lock(m, tid)) {
+       for (;;) {
+               if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+                       break;
  #ifdef KDTRACE_HOOKS
                spin_cnt++;
  #endif
@@ -602,8 +604,9 @@ _mtx_lock_spin_cookie(volatile uintptr_t *c, uintptr_t tid, 
int opts,
  #ifdef KDTRACE_HOOKS
        spin_time -= lockstat_nsecs(&m->lock_object);
  #endif
-       while (!_mtx_obtain_lock(m, tid)) {
-
+       for (;;) {
+               if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+                       break;
                /* Give interrupts a chance while we spin. */
                spinlock_exit();
                while (m->mtx_lock != MTX_UNOWNED) {
@@ -675,7 +678,9 @@ retry:
                            m->lock_object.lo_name, file, line));
                WITNESS_CHECKORDER(&m->lock_object,
                    opts | LOP_NEWORDER | LOP_EXCLUSIVE, file, line, NULL);
-               while (!_mtx_obtain_lock(m, tid)) {
+               for (;;) {
+                       if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, 
tid))
+                               break;
                        if (m->mtx_lock == tid) {
                                m->mtx_recurse++;
                                break;
diff --git a/sys/kern/kern_rwlock.c b/sys/kern/kern_rwlock.c
index 6541724..6a904d2 100644
--- a/sys/kern/kern_rwlock.c
+++ b/sys/kern/kern_rwlock.c
@@ -771,7 +771,9 @@ __rw_wlock_hard(volatile uintptr_t *c, uintptr_t tid, const 
char *file,
        all_time -= lockstat_nsecs(&rw->lock_object);
        state = rw->rw_lock;
  #endif
-       while (!_rw_write_lock(rw, tid)) {
+       for (;;) {
+               if (rw->rw_lock == RW_UNLOCKED && _rw_write_lock(rw, tid))
+                       break;
  #ifdef KDTRACE_HOOKS
                spin_cnt++;
  #endif
diff --git a/sys/kern/kern_sx.c b/sys/kern/kern_sx.c
index 96e117b..2a81c04 100644
--- a/sys/kern/kern_sx.c
+++ b/sys/kern/kern_sx.c
@@ -544,7 +544,10 @@ _sx_xlock_hard(struct sx *sx, uintptr_t tid, int opts, 
const char *file,
        all_time -= lockstat_nsecs(&sx->lock_object);
        state = sx->sx_lock;
  #endif
-       while (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) {
+       for (;;) {
+               if (sx->sx_lock == SX_LOCK_UNLOCKED &&
+                   atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
+                       break;
  #ifdef KDTRACE_HOOKS
                spin_cnt++;
  #endif
diff --git a/sys/sys/mutex.h b/sys/sys/mutex.h
index a9ec072..0443922 100644
--- a/sys/sys/mutex.h
+++ b/sys/sys/mutex.h
@@ -185,7 +185,7 @@ void        thread_lock_flags_(struct thread *, int, const 
char *, int);
  #define __mtx_lock(mp, tid, opts, file, line) do {                    \
        uintptr_t _tid = (uintptr_t)(tid);                              \
                                                                        \
-       if (!_mtx_obtain_lock((mp), _tid))                              \
+       if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid)))\
                _mtx_lock_sleep((mp), _tid, (opts), (file), (line));    \
        else                                                            \
                LOCKSTAT_PROFILE_OBTAIN_LOCK_SUCCESS(adaptive__acquire, \
@@ -203,7 +203,7 @@ void        thread_lock_flags_(struct thread *, int, const 
char *, int);
        uintptr_t _tid = (uintptr_t)(tid);                              \
                                                                        \
        spinlock_enter();                                               \
-       if (!_mtx_obtain_lock((mp), _tid)) {                            \
+       if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid))) {\
                if ((mp)->mtx_lock == _tid)                          \
                        (mp)->mtx_recurse++;                         \
                else                                                    \
@@ -232,7 +232,7 @@ void        thread_lock_flags_(struct thread *, int, const 
char *, int);
                                                                        \
        if ((mp)->mtx_recurse == 0)                                  \
                LOCKSTAT_PROFILE_RELEASE_LOCK(adaptive__release, mp);   \
-       if (!_mtx_release_lock((mp), _tid))                             \
+       if ((mp)->mtx_lock != _tid || !_mtx_release_lock((mp), _tid))        \
                _mtx_unlock_sleep((mp), (opts), (file), (line));        \
  } while (0)

diff --git a/sys/sys/rwlock.h b/sys/sys/rwlock.h
index f8947c5..6b4f505 100644
--- a/sys/sys/rwlock.h
+++ b/sys/sys/rwlock.h
@@ -96,7 +96,7 @@
  #define       __rw_wlock(rw, tid, file, line) do {                            
\
        uintptr_t _tid = (uintptr_t)(tid);                              \
                                                                        \
-       if (!_rw_write_lock((rw), _tid))                                \
+       if ((rw)->rw_lock != RW_UNLOCKED || !_rw_write_lock((rw), _tid))\
                _rw_wlock_hard((rw), _tid, (file), (line));             \
        else                                                            \
                LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire, rw, \
@@ -112,7 +112,7 @@
        else {                                                          \
                LOCKSTAT_PROFILE_RELEASE_RWLOCK(rw__release, rw,        \
                    LOCKSTAT_WRITER);                                   \
-               if (!_rw_write_unlock((rw), _tid))                      \
+               if ((rw)->rw_lock != _tid || !_rw_write_unlock((rw), _tid))\
                        _rw_wunlock_hard((rw), _tid, (file), (line));   \
        }                                                               \
  } while (0)
diff --git a/sys/sys/sx.h b/sys/sys/sx.h
index 96a664f..144cab5 100644
--- a/sys/sys/sx.h
+++ b/sys/sys/sx.h
@@ -150,7 +150,8 @@ __sx_xlock(struct sx *sx, struct thread *td, int opts, 
const char *file,
        uintptr_t tid = (uintptr_t)td;
        int error = 0;

-       if (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
+       if (sx->sx_lock != SX_LOCK_UNLOCKED ||
+           !atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
                error = _sx_xlock_hard(sx, tid, opts, file, line);
        else
                LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire, sx,
@@ -168,7 +169,8 @@ __sx_xunlock(struct sx *sx, struct thread *td, const char 
*file, int line)
        if (sx->sx_recurse == 0)
                LOCKSTAT_PROFILE_RELEASE_RWLOCK(sx__release, sx,
                    LOCKSTAT_WRITER);
-       if (!atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
+       if (sx->sx_lock != tid ||
+           !atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
                _sx_xunlock_hard(sx, tid, file, line);
  }


Just gave this patch a spin on my RPI2 and it seems to work fine.

--HPS

_______________________________________________
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [PATCH] microoptimize by trying to avoid locking a locked mutex

Reply via email to