include

Jeff Roberson Fri, 06 Jul 2018 15:59:29 -0700

On Fri, 6 Jul 2018, Rodney W. Grimes wrote:

On Fri, Jul 6, 2018, 12:27 PM Rodney W. Grimes <
free...@pdx.rh.cn85.dnsmgr.net> wrote:

On Fri, Jul 6, 2018 at 9:52 AM, Rodney W. Grimes <
free...@pdx.rh.cn85.dnsmgr.net> wrote:

On Fri, Jul 6, 2018 at 9:32 AM, Rodney W. Grimes <
free...@pdx.rh.cn85.dnsmgr.net> wrote:

Author: hselasky
Date: Fri Jul  6 10:13:42 2018
New Revision: 336025
URL: https://svnweb.freebsd.org/changeset/base/336025

Log:
  Make sure kernel modules built by default are portable between

UP

and

  SMP systems by extending defined(SMP) to include

defined(KLD_MODULE).


  This is a regression issue after r335873 .

  Discussed with:             mmacy@
  Sponsored by:               Mellanox Technologies


Though this fixes the issue, it also means that now when
anyone intentionally builds a UP kernel with modules
they are getting SMP support in the modules and I am
not sure they would want that.  I know I don't.



On UP systems, these additional opcodes are harmless. They take a few

extra

cycles (since they lock an uncontested bus) and add a couple extra

memory

barriers (which will be NOPs). On MP systems, atomics now work by

default.

Had we not defaulted like this, all modules built outside of a kernel

build

env would have broken atomics. Given that (a) the overwhelming

majority

(99% or more) is SMP and (b) the MP code merely adds a few cycles to

what's

already a not-too-expensive operation, this was the right choice.

It simply doesn't matter for systems that are relevant to the project
today. While one could try to optimize this a little (for example, by
having SMP defined to be 0 or 1, say, and changing all the ifdef SMP

to

if

(defined(SMP) && SMP != 0)), it's likely not going to matter enough

for

anybody to make the effort. UP on x86 is simply not relevant enough

to

optimize for it. Even in VMs, people run SMP kernels typically even

when

they just allocate one CPU to the VM.

So while we still support the UP config, and we'll let people build
optimized kernels for x86, we've flipped the switch from pessimized

for

SMP

modules to pessimized for UP modules, which seems like quite the

reasonable

trade-off.

Were it practical to do so, I'd suggest de-orbiting UP on x86.

However,

it's a lot of work for not much benefit and we'd need to invent much

crazy

to get there.


Trivial to fix this with
+#if defined(SMP) || !defined(_KERNEL) || defined(KLD_MODULE) ||
!defined(KLD_UP_MODULES)



Nope. Not so trivial. Who defines KLD_UP_MODULES?


Call it SMP_KLD_MODULES, and it gets defined the same place SMP does.


Not so simple. SMP is defined in the config file, and winds up in one of

No problem, that is where I would be defining this anyway, or in the
latest case removing it and SMP for my UP kernel build.

the option files. It will be absent for stand alone builds,

I am ok with that.  And it would be reasonable to default to SMP.

though. These
change tweak the default yo be inlined and to include the sequence that
works everywhere.

And really, it's absolutely not worth it unless someone shows up with
numbers to show the old 'function call to optimal routine' is actually
faster than the new 'inline to slightly unoptimal code'. Since I think

the

function call overhead is larger than the pessmizations, I'm not sure

what

the fuss is about.


I have no issues with the SMP converting from function calls to
inline locks, I just want to retain the exact same code I had
before any of these changes, and that was A UP built system
without any SMP locking.  Is it too much to ask to keep what
already worked?


This doesn't enable or disable locks in the muted sense. It just changes
the atomic ops for the kernel from a function call to an inlined function.
The inlining is more efficient than the call, even with the overhead added
by always inlining the same stuff. It still is faster than before.

And userland has done this forever...

So I honestly think even UP builds are better off, even if it's not hyper
optimized for UP. The lock instruction prefix is minimal overhead (a cycle
I think).


I do not believe, and Bruce seems to have evidence, that LOCK is not
a one cycle cost.  And in my head I know that it can not be that
simple as it causes lots of very special things to happen in the
pipeline to ensure you are locked.

It is definitely not one cycle. It is a couple dozen depending on themachine. You can benchmark yourself with the attached patch.


Here's two atomics in a loop with and without lock on a Zen:
desktop# ./nonatomic
0 8838: 8 ns
desktop# ./atomic
0 32887: 32 ns

This is at ~3ghz so three instructions per-ns. Cut this in half forper-atomic cost. 12ns of overhead or 36 cycles. One of them is a fetchaddso there is extra cost associated with that. This is also worst casebecause the machine can't execute anything other than the atomics. Ifthere were other instructions that didn't operate on overlapping memoryaddresses they could run in parallel.


Jeff

This is different than the mutexes we optimize for the UP cases
(and which aren't affected by this change). It's really not a big deal.


CPU's are not getting any faster, cycles are cycles, and I think we
should at least investigate further before we just start making
assumptions about the lock prefix being a 1 cycle cheap thing to
do.

Modified:
  head/sys/amd64/include/atomic.h
  head/sys/i386/include/atomic.h

Modified: head/sys/amd64/include/atomic.h
============================================================

==================

--- head/sys/amd64/include/atomic.h   Fri Jul  6 10:10:00 2018

(r336024)

+++ head/sys/amd64/include/atomic.h   Fri Jul  6 10:13:42 2018

(r336025)

@@ -132,7 +132,7 @@ void              atomic_store_rel_##TYPE(

volatile

u_##TYPE *p, u_

  * For userland, always use lock prefixes so that the binaries

will

run

  * on both SMP and !SMP systems.
  */
-#if defined(SMP) || !defined(_KERNEL)
+#if defined(SMP) || !defined(_KERNEL) || defined(KLD_MODULE)
 #define      MPLOCKED        "lock ; "
 #else
 #define      MPLOCKED
@@ -354,7 +354,7 @@ atomic_testandclear_long(volatile u_long *p,

u_int

v)

  */
 #define      OFFSETOF_MONITORBUF     0x100

-#if defined(SMP)
+#if defined(SMP) || defined(KLD_MODULE)
 static __inline void
 __storeload_barrier(void)
 {

Modified: head/sys/i386/include/atomic.h
============================================================

==================

--- head/sys/i386/include/atomic.h    Fri Jul  6 10:10:00 2018

(r336024)

+++ head/sys/i386/include/atomic.h    Fri Jul  6 10:13:42 2018

(r336025)

@@ -143,7 +143,7 @@ void              atomic_subtract_64(volatile

uint64_t *, uint64_t

  * For userland, always use lock prefixes so that the binaries

will

run

  * on both SMP and !SMP systems.
  */
-#if defined(SMP) || !defined(_KERNEL)
+#if defined(SMP) || !defined(_KERNEL) || defined(KLD_MODULE)
 #define      MPLOCKED        "lock ; "
 #else
 #define      MPLOCKED
@@ -302,7 +302,7 @@ atomic_testandclear_int(volatile u_int *p,

u_int v)

  */

 #if defined(_KERNEL)
-#if defined(SMP)
+#if defined(SMP) || defined(KLD_MODULE)
 #define      __storeload_barrier()   __mbk()
 #else /* _KERNEL && UP */
 #define      __storeload_barrier()   __compiler_membar()


--
Rod Grimes
rgri...@freebsd.org


--
Rod Grimes
rgri...@freebsd.org


--
Rod Grimes
rgri...@freebsd.org


--
Rod Grimes                                                 rgri...@freebsd.org

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/types.h>

#define ITER    1000000

static __inline u_int
atomic_fetchadd_int(volatile u_int *p, u_int v)
{

        __asm __volatile(
        "       lock xaddl   %0,%1 ;         "
        "# atomic_fetchadd_int"
        : "+r" (v),                     /* 0 */
          "+m" (*p)                     /* 1 */
        : : "cc");
        return (v);
}

static __inline void
atomic_add_int(volatile u_int *p, u_int v)
{

        __asm __volatile(
        "       lock addl   %0,%1 ;         "
        : "+r" (v),                     /* 0 */
          "+m" (*p)                     /* 1 */
        : : "cc");
}

int
main(int argc, char **argv)
{
        struct timeval start, end;
        volatile int foo;
        uint64_t ns;
        int i;

        gettimeofday(&start, NULL);
        for (i = 0; i < ITER; i++) {
                atomic_add_int(&foo, 1);
                atomic_fetchadd_int(&foo, -1);
        }
        gettimeofday(&end, NULL);
        end.tv_sec -= start.tv_sec;
        end.tv_usec -= start.tv_usec;
        ns = end.tv_usec / (ITER / 1000);
        printf("%d %d: %d ns\n", end.tv_sec, end.tv_usec, ns);
}

_______________________________________________
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Re: svn commit: r336025 - in head/sys: amd64/include i386/include

Reply via email to