Re: Fast sigblock (AKA rtld speedup)

2013-01-08 Thread Konstantin Belousov
On Tue, Jan 08, 2013 at 01:09:51PM +0800, David Xu wrote:
> On 2013/01/08 02:22, Konstantin Belousov wrote:
> > Below is the forward of the patch for which I failed to obtain a private
> > review. Might be, the list generates more responses.
> >
> > Our rtld has a performance bootleneck, typically exposed by the images
> > with the lot of the run-time relocation processing, and by the C++
> > exception handling. We block the signals delivery during the rtld
> > performing the relocations, as well as for the dl_iterate_phdr(3) (the
> > later is used for finding the dwarf unwinding tables).
> >
> > The signal blocking is needed to allow the rtld to resolve the symbols
> > for the signal handlers in the safe way, but also causes 2 syscalls
> > overhead per each rtld entry.
> >
> > The proposed approach allows to shave off those two syscalls, doubling
> > the FreeBSD performance for the (silly) throw/catch C++ microbenchmark.
> >
> > Date: Mon, 13 Aug 2012 15:26:00 +0300
> > From: Konstantin Belousov 
> >
> > ...
> >
> > The basic idea is to implement sigprocmask() as single write into usermode
> > address. If kernel needs to calculate the signal mask for current thread,
> > it takes into the consideration non-zero value of the word at the agreed
> > address. Also, usermode is informed about signals which were put on hold
> > due to fast sigblock active.
> >
> > As I said, on my measurements in microbenchmark that did throw/catch in
> > a loop, I see equal user and system time spent for unpatched system, and
> > same user time with zero system time on patched system.
> >
> > Patch can be improved further, e.g. it would be nice to allow rtld to fall
> > back to sigprocmask(2) if kernel does not support fast sigblock, to prevent
> > flag day. Also, the mask enforced by fast sigblock can be made configurable.
> >
> > Note that libthr already blocks signals by catching them, and not using rtld
> > service in the first line handler. I tried to make the change in the spirit
> > of libthr interceptors, but handoff to libthr appears too complicated to
> > work. In fact, libthr can be changed to start using fast sigblock instead
> > of wrapping sigaction, but this is out of scope of the proposal right now.
> >
> > Please comment.
> >
> > diff --git a/lib/libc/sys/Symbol.map b/lib/libc/sys/Symbol.map
> > index 6888ea0..3b75539 100644
> > --- a/lib/libc/sys/Symbol.map
> > +++ b/lib/libc/sys/Symbol.map
> > @@ -546,6 +546,7 @@ FBSDprivate_1.0 {
> > __sys_extattr_set_link;
> > _extattrctl;
> > __sys_extattrctl;
> > +   __sys_fast_sigblock;
> > _fchdir;
> > __sys_fchdir;
> > _fchflags;
> > diff --git a/libexec/rtld-elf/rtld_lock.c b/libexec/rtld-elf/rtld_lock.c
> > index d1563e5..50c52c6 100644
> > --- a/libexec/rtld-elf/rtld_lock.c
> > +++ b/libexec/rtld-elf/rtld_lock.c
> > @@ -43,6 +43,7 @@
> >*/
> >
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -59,8 +60,8 @@ typedef struct Struct_Lock {
> > void *base;
> >   } Lock;
> >
> > -static sigset_t fullsigmask, oldsigmask;
> >   static int thread_flag;
> > +static uint32_t fsigblock;
> >
> >   static void *
> >   def_lock_create()
> > @@ -111,18 +112,26 @@ def_rlock_acquire(void *lock)
> >   }
> >
> >   static void
> > +sig_fastunblock(void)
> > +{
> > +   uint32_t oldval;
> > +
> > +   oldval = atomic_fetchadd_32(&fsigblock, -FAST_SIGBLOCK_INC);
> > +   if (oldval == (FAST_SIGBLOCK_PEND | FAST_SIGBLOCK_INC))
> > +   __sys_fast_sigblock(FAST_SIGBLOCK_UNBLOCK, NULL);
> > +}
> > +
> > +static void
> >   def_wlock_acquire(void *lock)
> >   {
> >   Lock *l = (Lock *)lock;
> > -sigset_t tmp_oldsigmask;
> >
> >   for ( ; ; ) {
> > -   sigprocmask(SIG_BLOCK, &fullsigmask, &tmp_oldsigmask);
> > +   atomic_add_rel_32(&fsigblock, FAST_SIGBLOCK_INC);
> > if (atomic_cmpset_acq_int(&l->lock, 0, WAFLAG))
> > break;
> > -   sigprocmask(SIG_SETMASK, &tmp_oldsigmask, NULL);
> > +   sig_fastunblock();
> >   }
> > -oldsigmask = tmp_oldsigmask;
> >   }
> >
> >   static void
> > @@ -134,7 +143,7 @@ def_lock_release(void *lock)
> > atomic_add_rel_int(&l->lock, -RC_INCR);
> >   else {
> > atomic_add_rel_int(&l->lock, -WAFLAG);
> > -   sigprocmask(SIG_SETMASK, &oldsigmask, NULL);
> > +   sig_fastunblock();
> >   }
> >   }
> >
> > @@ -286,19 +295,7 @@ lockdflt_init()
> >
> >   memcpy(&lockinfo, &deflockinfo, sizeof(lockinfo));
> >   _rtld_thread_init(NULL);
> > -/*
> > - * Construct a mask to block all signals except traps which might
> > - * conceivably be generated within the dynamic linker itself.
> > - */
> > -sigfillset(&fullsigmask);
> > -sigdelset(&fullsigmask, SIGILL);
> > -sigdelset(&fullsigmask, SIGTRAP);
> > -sigdelset(&fullsigmask, SIGABRT);
> > -sigdelset(&fullsigmask, SIGEMT);
> > -sigdelset(&fullsigmask, SIGFPE);
> > -sigdelset(&fullsigmask, SIGBUS);
> > -sigdelset(&fullsigmask, SIGSE

Re: Fast sigblock (AKA rtld speedup)

2013-01-08 Thread Konstantin Belousov
On Tue, Jan 08, 2013 at 01:21:48AM -0500, Alfred Perlstein wrote:
> On 1/7/13 7:08 PM, Konstantin Belousov wrote:
> > On Mon, Jan 07, 2013 at 03:42:01PM -0500, Alfred Perlstein wrote:
> >> On 1/7/13 1:22 PM, Konstantin Belousov wrote:
> >>> Below is the forward of the patch for which I failed to obtain a private
> >>> review. Might be, the list generates more responses.
> >> Very cool.
> >>
> >> Sorry for being rusty here, but is it safe to call fuword in the middle
> >> of issignal()?
> >>
> >> The reason I ask is because it looks like proc lock is required for this
> >> block, and I forget it fuword(9) can fault.
> >>
> >> If fuword(9) can fault then I don't think you can do this.
> >>
> >> You'll need to use something like fuswintr(9) or wire the page down
> >> using vslock(9) or something.
> >>
> >> I am probably missing something?
> > No, you are completely right. This is a serious bug. Thank you.
> >
> > I was careful to code the td_sigblock_val prefetch etc, but failed
> > to do the update properly. There is no reason to execute the update
> > in the issignal() at all, the setting of the pending bit can be postponed
> > to the moment after the postsig() loop was executed.
> >
> > I uploaded the updated but not yet tested patch at
> > http://people.freebsd.org/~kib/misc/rtld-sigblock.1.patch
> 
> This new patch looks like it may have issues with PSTOP.
> 
> there is a lot of code in issignal() that is missed when this code is in 
> place, I have not audited the effect of this, are you sure this is what 
> will work for the majority of cases?
What exactly you are concerned with ? Note that fastblock_mask manipulations
right before setting of TDP_FAST_SIGPENDING is to disallow the postponing
of the delivery of (potentially) synchronous signals caused by traps, as
well as SIGKILL/SIGSTOP.

> 
> Honestly, if I were coding it for correctness I would vslock the pages 
> (or otherwise wire them in) and let issignal() behave normally and not 
> early exit from it.
vslock() is racy, causes user address space fragmentation and was simply
impossibly to use correctly until very recent.


pgpnmAWeG6ebj.pgp
Description: PGP signature


Re: Fast sigblock (AKA rtld speedup)

2013-01-08 Thread Alfred Perlstein

On 1/8/13 9:56 AM, Konstantin Belousov wrote:

On Tue, Jan 08, 2013 at 01:09:51PM +0800, David Xu wrote:

On 2013/01/08 02:22, Konstantin Belousov wrote:

Below is the forward of the patch for which I failed to obtain a private
review. Might be, the list generates more responses.

Our rtld has a performance bootleneck, typically exposed by the images
with the lot of the run-time relocation processing, and by the C++
exception handling. We block the signals delivery during the rtld
performing the relocations, as well as for the dl_iterate_phdr(3) (the
later is used for finding the dwarf unwinding tables).

The signal blocking is needed to allow the rtld to resolve the symbols
for the signal handlers in the safe way, but also causes 2 syscalls
overhead per each rtld entry.

The proposed approach allows to shave off those two syscalls, doubling
the FreeBSD performance for the (silly) throw/catch C++ microbenchmark.

Date: Mon, 13 Aug 2012 15:26:00 +0300
From: Konstantin Belousov 

...

The basic idea is to implement sigprocmask() as single write into usermode
address. If kernel needs to calculate the signal mask for current thread,
it takes into the consideration non-zero value of the word at the agreed
address. Also, usermode is informed about signals which were put on hold
due to fast sigblock active.

As I said, on my measurements in microbenchmark that did throw/catch in
a loop, I see equal user and system time spent for unpatched system, and
same user time with zero system time on patched system.

Patch can be improved further, e.g. it would be nice to allow rtld to fall
back to sigprocmask(2) if kernel does not support fast sigblock, to prevent
flag day. Also, the mask enforced by fast sigblock can be made configurable.

Note that libthr already blocks signals by catching them, and not using rtld
service in the first line handler. I tried to make the change in the spirit
of libthr interceptors, but handoff to libthr appears too complicated to
work. In fact, libthr can be changed to start using fast sigblock instead
of wrapping sigaction, but this is out of scope of the proposal right now.

Please comment.

diff --git a/lib/libc/sys/Symbol.map b/lib/libc/sys/Symbol.map
index 6888ea0..3b75539 100644
--- a/lib/libc/sys/Symbol.map
+++ b/lib/libc/sys/Symbol.map
@@ -546,6 +546,7 @@ FBSDprivate_1.0 {
__sys_extattr_set_link;
_extattrctl;
__sys_extattrctl;
+   __sys_fast_sigblock;
_fchdir;
__sys_fchdir;
_fchflags;
diff --git a/libexec/rtld-elf/rtld_lock.c b/libexec/rtld-elf/rtld_lock.c
index d1563e5..50c52c6 100644
--- a/libexec/rtld-elf/rtld_lock.c
+++ b/libexec/rtld-elf/rtld_lock.c
@@ -43,6 +43,7 @@
*/

   #include 
+#include 
   #include 
   #include 
   #include 
@@ -59,8 +60,8 @@ typedef struct Struct_Lock {
void *base;
   } Lock;

-static sigset_t fullsigmask, oldsigmask;
   static int thread_flag;
+static uint32_t fsigblock;

   static void *
   def_lock_create()
@@ -111,18 +112,26 @@ def_rlock_acquire(void *lock)
   }

   static void
+sig_fastunblock(void)
+{
+   uint32_t oldval;
+
+   oldval = atomic_fetchadd_32(&fsigblock, -FAST_SIGBLOCK_INC);
+   if (oldval == (FAST_SIGBLOCK_PEND | FAST_SIGBLOCK_INC))
+   __sys_fast_sigblock(FAST_SIGBLOCK_UNBLOCK, NULL);
+}
+
+static void
   def_wlock_acquire(void *lock)
   {
   Lock *l = (Lock *)lock;
-sigset_t tmp_oldsigmask;

   for ( ; ; ) {
-   sigprocmask(SIG_BLOCK, &fullsigmask, &tmp_oldsigmask);
+   atomic_add_rel_32(&fsigblock, FAST_SIGBLOCK_INC);
if (atomic_cmpset_acq_int(&l->lock, 0, WAFLAG))
break;
-   sigprocmask(SIG_SETMASK, &tmp_oldsigmask, NULL);
+   sig_fastunblock();
   }
-oldsigmask = tmp_oldsigmask;
   }

   static void
@@ -134,7 +143,7 @@ def_lock_release(void *lock)
atomic_add_rel_int(&l->lock, -RC_INCR);
   else {
atomic_add_rel_int(&l->lock, -WAFLAG);
-   sigprocmask(SIG_SETMASK, &oldsigmask, NULL);
+   sig_fastunblock();
   }
   }

@@ -286,19 +295,7 @@ lockdflt_init()

   memcpy(&lockinfo, &deflockinfo, sizeof(lockinfo));
   _rtld_thread_init(NULL);
-/*
- * Construct a mask to block all signals except traps which might
- * conceivably be generated within the dynamic linker itself.
- */
-sigfillset(&fullsigmask);
-sigdelset(&fullsigmask, SIGILL);
-sigdelset(&fullsigmask, SIGTRAP);
-sigdelset(&fullsigmask, SIGABRT);
-sigdelset(&fullsigmask, SIGEMT);
-sigdelset(&fullsigmask, SIGFPE);
-sigdelset(&fullsigmask, SIGBUS);
-sigdelset(&fullsigmask, SIGSEGV);
-sigdelset(&fullsigmask, SIGSYS);
+__sys_fast_sigblock(FAST_SIGBLOCK_SETPTR, &fsigblock);
   }

   /*
@@ -319,7 +316,10 @@ _rtld_thread_init(struct RtldLockInfo *pli)

if (pli == NULL)
pli = &deflockinfo;
-
+   else {
+   fsigblock = 0;
+   __sys_fast_sigblock(FAST_SIGBLOCK_UNSETPTR, NULL);
+   }

fo

Re: Fast sigblock (AKA rtld speedup)

2013-01-08 Thread Konstantin Belousov
On Tue, Jan 08, 2013 at 10:02:22AM -0500, Alfred Perlstein wrote:
> I think it would be great if it works, I just haven't had time to fully 
> understand your more recent patch.  If you feel it's safe and maybe do a 
> few test programs to see what happens, that would be best.

I revived the test programs I used during the development of the patch, see
http://people.freebsd.org/~kib/misc/fast_sigblock.c
http://people.freebsd.org/~kib/misc/fast_sigblock2.c

Also, I adapted the code to the specific case you asked about, in
http://people.freebsd.org/~kib/misc/fast_sigblock3.c
When run from the shell, it reports 'Killed'.


pgp4Dcuv38C5e.pgp
Description: PGP signature