Re: svn commit: r267029 - head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs

Bruce Evans Wed, 04 Jun 2014 06:13:50 -0700

On Wed, 4 Jun 2014, Bryan Drewery wrote:

On 6/3/2014 4:06 PM, Alexander Motin wrote:

Author: mav
Date: Tue Jun  3 21:06:03 2014
New Revision: 267029
URL: http://svnweb.freebsd.org/changeset/base/267029


Log:
  Replace gethrtime() with cpu_ticks(), as source of random for the taskqueue
  selection.  gethrtime() in our port updated with HZ rate, so unusable for
  this specific purpose, completely draining benefit of multiple taskqueues.

  MFC after:    2 weeks

Modified:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c

Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
==============================================================================
--- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c   Tue Jun  3 
21:02:19 2014        (r267028)
+++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c   Tue Jun  3 
21:06:03 2014        (r267029)
@@ -953,7 +953,7 @@ spa_taskq_dispatch_ent(spa_t *spa, zio_t
        if (tqs->stqs_count == 1) {
                tq = tqs->stqs_taskq[0];
        } else {
-               tq = tqs->stqs_taskq[gethrtime() % tqs->stqs_count];
+               tq = tqs->stqs_taskq[cpu_ticks() % tqs->stqs_count];
        }

        taskq_dispatch_ent(tq, func, arg, flags, ent);


Bugs remaining in this include:
- the comment before this function still says that gethrtime() is used:

/*
 * Dispatch a task to the appropriate taskq for the ZFS I/O type and priority.
 * Note that a type may have multiple discrete taskqs to avoid lock contention
 * on the taskq itself. In that case we choose which taskq at random by using
 * the low bits of gethrtime().
 */

- timing functions are a bogus source of random bits.  The get_cyclecount()
  function should be used for this, though its existence is a bug and its
  name and implementation are worse.

On some arches (actually just i386, but should be arm too), cputick()
is the same as get_cyclecount().  cputick() has to be correct for
timing, but get_cyclecount() doesn't, so get_cyclecount() can be more
optimized, but on some arches get_cyclecount() is correct for timing
too.  The implementation of this varies gratuitously with the arch.
Only arm is very bad:

- arm: get_cyclecount() is binuptime() with bogus scaling.  This shows
  another bug in the above change -- depending on the details of the
  ticker and its scaling, there may be no random low bits at all.
  stqs_count is dynamically initialized and it is not clear if it is
  a power of 2, or a prime...  The bogus scaling is:

        binuptime(&bt);
        return ((uint64_t)bt.sec << 56 | bt.frac >> 8);

  The 8 lower bits may or may not be random (or even nonzero) but they are
  always discarded.  The remaining lower bits may or may not be random
  (or even zero).  56 top bits of the seconds part are always discarded.
  Old versions were better.  They xor'ed the seconds and fraction part.
  This is inconsistent with the bad function name, but better for
  randomness.

  It is still an error to use only the low bits.  Using all the bits modulo
  a prime would work well:

        tq = tqs->stqs_taskq[cpu_ticks() % BIGPRIME % tqs->stqs_count];

  The magic BIGPRIME only needs to be relatively prime to the infinite-
  precision maxof(typeof(get_cyclecount()) + 1), so it doesn't depend
  on the details of get_cyclecount().  For general use, BIGPRIME should
  be 64 bits, but 32 bits should be enough for anyone and the the
  64-bit division for excessive generality would be slow.  The existing
  64-bit division is already excessively general and gratuitously slow.
  Both cpu_ticks() and gethrtime() are 64 bits.  The division should have
  been reduced to 32 bits using (uint32_t)getmumbletime() % count.

  When ntpd is active, the lower bits in the fraction normally contain
  significant randomness from micro-adjustments, so they should not be
  discarded.  Otherwise, there is really no randomness in lower bits,
  but they may change in step with high bits depending on the details
  of the clock frequency and scaling.

- i386: get_cyclecount() is cputicks().  This is different from amd64
  because rdtsc() is not always available and cputicks() handles the
  details of sometimes using a timecounter.  This is unnecessarily
  slow when cputicks() doesn't use rdtsc().  cputicks() isn't inline,
  and it uses lots of function pointers.  It even uses a function
  pointer for rdtsc().  cputicks() is much more important than
  get_cyclecount().  It should be optimized.  arm should use it even
  if it is not optimized.

get_cyclecount() is abused as a cycle count in:
- netinet/sctp_os_bsd.h.  microtime() with an x86 TSC timecounter is
  plenty fast and accurate enough for timestamping 1 GBps ethernet.
  (I sometimes use nanotime(), but find that actually using the
  nanoseconds resolution usually just makes the output too verbose
  with noise in the last 3 decimal digits, so the output formatting
  routine needs to reduce to microseconds.)
- kern/ktr_time.c.  The undocumented unsupported option KTR_TIME defaults
  to being get_cyclecount().  KTR needs to timestamp low-level functions,
  so the timestamp function needs to be lower-level and nanotime() might
  not be low enough.  However, i386 sometimes uses the timecounter,
  and arm always does, to nanotime() must be low enough.  nanotime() also
  gives nicely biased and scaled timestamps.  KTR has null support for
  scaling the timestamps that it makes.  In the kernel, it just prints
  raw values in %10.10lld format.  ktrdump(1) is further from knowing how
  to scale them.  It just prints them or differences of them in %16ju
  format.  16 is excessively wide, yet not wide enough to print
  UINT64_MAX (that takes 20 decimal digits.  Only 16 hex digits).  At
  least it doesn't use the long long abomination.
- kern/subr_bus.c.  get_cyclecount() is not abused, but its result is
  assigned to a variable named attachtime.  This is fed to random_harvest()
  as the vatiable named *entropy, so it is clearly random and not a time.

The existence of get_cyclecount() is a bug since all uses are satisifed
with a timestamp and it is better to optimize the timestamp functions
than have special versions of them for just randomness.  It was created
before cputick() existed.  cputick() is a fast enough timestamp function
for all random uses (it basically uses a cycle counter if available,
else falls back to using the timecounter).

However, bugs pointed out above show that it is confusing to use
functions whose names are related to timers as a source of randomness.
A name with "random" in it would be better.  Also, some arches may
have a random register (in the CPU so it is fast to read) but no
x86-style cycle counter or fast timecounter.  These arches should use
the random register for the random function.  The random register just
doesn't need to be very wide or crypto quality.

Some of the bugs are documented in get_cyclecount(9).  It is documented
as monotonically increasing, although the xor version on arm was far
from that and the current version wraps every 256 seconds.

Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"

Re: svn commit: r267029 - head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs

Reply via email to