On Tue, Jul 28, 2020 at 10:02:07AM +0300, Paul Irofti wrote:
> 
> [...]
> 
> Is the issue with LFENCE slowing down the network stack settled? That was
> the argument against it last time.

... a month passes.  Nobody says anything.

This "it might slow down the network stack" thing keeps coming up, and
yet nobody can point to (a) who expressed this concern or (b) what the
penalty is in practice.

Note that the alternative is "your timecounter might not be monotonic
between threads".  For me, that's already a dealbreaker.

But for sake of discussion let's look at some data.  For those of you
watching from home, please follow along!  I would like to know what
your results look like.

To start, here is a microbenchmarking program for clock_gettime(2) on
amd64.  If you have the userspace timecounter, then

        clock_gettime(CLOCK_MONOTONIC, ...);

is a suitable surrogate for nanouptime(9), so this microbenchmark can
actually tell us about how nanouptime(9) or nanotime(9) would be
impacted by a comparable change in the kernel timecounter.

--

/*
 * clock_gettime-bench.c
 */
#include <err.h>
#include <limits.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>

static uint64_t
rdtsc_lfence(void)
{
        uint32_t hi, lo;

        __asm volatile("lfence; rdtsc; lfence" : "=d" (hi), "=a" (lo));
        return ((uint64_t)hi << 32) | lo;
}

int
main(int argc, char *argv[])
{
        struct timespec now;
        uint64_t begin, end;
        long long count, i;
        const char *errstr;

        if (argc != 2) {
                fprintf(stderr, "usage: %s count\n", getprogname());
                return 1;
        }
        count = strtonum(argv[1], 1, LLONG_MAX, &errstr);
        if (errstr != NULL)
                errx(1, "count is %s: %s", errstr, argv[1]);

        begin = rdtsc_lfence();
        for (i = 0; i < count; i++)
                clock_gettime(CLOCK_MONOTONIC, &now);
        end = rdtsc_lfence();

        printf("%lld\t%llu\n", count, end - begin);

        return 0;
}

--

Now consider a benchmark of 100K clock_gettime(2) calls against the
userspace timecounter.

$ clock_gettime-bench 100000
100000  15703664

Let's collect 10K of these benchmarks -- our samples -- atop an
unpatched libc.  Use the shell script below.  Note that we throw out
samples where we hit a context switch.

--

#! /bin/sh

[ $# -ne 1 ] && exit 1
RESULTS=$1
shift

TIME=$(mktemp) || exit 1
TMP=$(mktemp) || exit 1

# Collect 10K samples.
i=0
while [ $i -lt 10000 ]; do
        # Call clock_gettime(2) 100K times.
        /usr/bin/time -l ~/scratch/clock_gettime-bench 100000 > $TMP 2> $TIME
        # Ignore this sample if a context switch occurred.
        if egrep -q '[1-9][0-9]* +(in)?voluntary context' $TIME; then
                continue
        fi
        cat $TMP >> $RESULTS
        i=$((i + 1))
done

rm $TMP $TIME

--

Run it like this:

$ ksh bench.sh unpatched.out

That will take ~5-10 minutes at most.

Next, we'll patch libc to add the LFENCE to the userspace timecounter.

Index: usertc.c
===================================================================
RCS file: /cvs/src/lib/libc/arch/amd64/gen/usertc.c,v
retrieving revision 1.2
diff -u -p -r1.2 usertc.c
--- usertc.c    8 Jul 2020 09:17:48 -0000       1.2
+++ usertc.c    22 Aug 2020 22:18:47 -0000
@@ -19,10 +19,10 @@
 #include <sys/timetc.h>
 
 static inline u_int
-rdtsc(void)
+rdtsc_lfence(void)
 {
        uint32_t hi, lo;
-       asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
+       asm volatile("lfence; rdtsc" : "=a"(lo), "=d"(hi));
        return ((uint64_t)lo)|(((uint64_t)hi)<<32);
 }
 
@@ -31,7 +31,7 @@ tc_get_timecount(struct timekeep *tk, u_
 {
        switch (tk->tk_user) {
        case TC_TSC:
-               *tc = rdtsc();
+               *tc = rdtsc_lfence();
                return 0;
        }
 
--

Recompile and reinstall libc.

Then rerun the benchmark.  Be careful not to overwrite our results
from the unpatched libc:

$ ksh bench.sh patched.out

--

Alright, now let's compare the results.  I'm not a mathemagician so I
use ministat and trust it implicitly.  A stat jock could probably do
this in R or with some python, but I am not that clever, so I will
stick with ministat.

There is no ministat port for OpenBSD, but it is pretty trivial to
clone this github repo and build it on -current:

https://github.com/thorduri/ministat

--

Okay, you have ministat?

Let's compare the results.  We want the 2nd column in the output
(-C2).  I'm not interested in the graph (-q), given our population
size.  We have N=10000, so let's push the CI up (-c 99.5).

$ ~/repo/ministat/ministat -C2 -q -c99.5 unpatched.out patched.out
x unpatched.out
+ patched.out
    N           Min           Max        Median           Avg        Stddev
x 10000      13752102      18019218      14442398      14431918     237842.31
+ 10000      15196970      16992030      15721390      15779178     181623.5
Difference at 99.5% confidence
        1.34726e+06 +/- 9247.11
        9.33528% +/- 0.064074%
        (Student's t, pooled s = 211608)

So, in conclusion, prefixing RDTSC with LFENCE incurs a penalty.  It
is unambiguously slower.

What is the penalty?  No more than ~10%.  Which translates to maybe
~10-15 cycles on my machine.

Something else worth noting is that with the LFENCE in place the
stddev is smaller.  So, reading the clock, though slower, takes a more
*consistent* amount of time with the LFENCE in place.

--

I don't think this penalty is so bad.

Thoughts?

Reply via email to