Re: EVFILT_TIMER add support for different timer precisions NOTE_{,U,N,M}SECONDS

Scott Cheloha Tue, 08 Aug 2023 08:41:06 -0700

On Sat, Aug 05, 2023 at 01:33:05AM -0400, A Tammy wrote:
> 
> On 8/5/23 00:49, Scott Cheloha wrote:
> > On Sat, Aug 05, 2023 at 12:17:48AM -0400, aisha wrote:
> >> On 22/09/10 01:53PM, Visa Hankala wrote:
> >>> On Wed, Aug 31, 2022 at 04:48:37PM -0400, aisha wrote:
> >>>> I've added a patch which adds support for NOTE_{,U,M,N}SECONDS for
> >>>> EVFILT_TIMER in the kqueue interface.
> >>> It sort of makes sense to add an option to specify timeouts in
> >>> sub-millisecond precision. It feels complete overengineering to add
> >>> multiple time units on the level of the kernel interface. However,
> >>> it looks that FreeBSD and NetBSD have already done this following
> >>> macOS' lead...
> >>>
> >>>> I've also added the NOTE_ABSTIME but haven't done any actual 
> >>>> implementation
> >>>> there as I am not sure how the `data` field should be interpreted (is it
> >>>> absolute time in seconds since epoch?).
> >>> I think FreeBSD and NetBSD take NOTE_ABSTIME as time since the epoch.
> >>>
> >>> Below is a revised patch that takes into account some corner cases.
> >>> It tries to be API-compatible with FreeBSD and NetBSD. I have adjusted
> >>> the NOTE_{,M,U,N}SECONDS flags so that they are enum-like.
> >>>
> >>> The manual page bits are from NetBSD.
> >>>
> >>> It is quite late to introduce a feature like this within this release
> >>> cycle. Until now, the timer code has ignored the fflags field. There
> >>> might be pieces of software that are careless with struct kevent and
> >>> that would break as a result of this patch. Programs that are widely
> >>> used on different BSDs are probably fine already, though.
> >> 
> >> Sorry, I had forgotten this patch for a long time!!! I've been running 
> >> with this for a while now and it's been working nicely.
> > 
> > Where is this being used in ports?  I think having "one of each" for
> > seconds, milliseconds, microseconds, and nanoseconds is (as visa
> > noted) way, way over-the-top.
> 
> I was using it with a port that I sent out a while ago but never got
> into tree (was before I joined the project) -
> https://marc.info/?l=openbsd-ports&m=165715874509440&w=2


If nothing in ports is using this I am squeamish about adding it.
Once we add it, we're stuck maintaining it, warts and all.

If www/workflow were in the tree I could see the upside.  Is it in
ports?

It looks like workflow actually wants timerfd(2) from Linux and is
simulating timerfd(2) with EVFILT_TIMER and NOTE_NSECONDS:

https://github.com/sogou/workflow/blob/80b3dfbad2264bcd79ba37811c66421490e337d2/src/kernel/poller.c#L227

I think timerfd(2) is the superior interface here.  It keeps the POSIX
interval timer semantics without all the signal delivery baggage.  It
also supports multiple clocks and starting a periodic timeout from an
absolute starting time.

So, if the goal is "add www/workflow to ports", adding timerfd(2) might
be the right thing.

> I also agree with it being over the top but that's the way it is in
> net/freebsd, I'm also fine with breaking compatibility and only keeping
> nano, no preferences either way.

Well, if we're going to add it (if), we should add all of it.  The
vast majority of the code is not conversion code: if we add support
for NOTE_NSECONDS, adding support for the other units is trivial, and
there is value in being fully compatible with other implementations.

> > The original EVFILT_TIMER supported only milliseconds, yes.  Given
> > that it debuted in the late 90s, I think that was a bad choice.  But
> > when milliseconds were insufficiently precise, the obvious thing would
> > be to add support for nanoseconds... and then stop.
> >
> > The decision to use the UTC clock with no option to select a different
> > clockid_t for NOTE_ABSTIME is also unfortunate.
> 
> Yes, furthermore this was very unclear as I couldn't find this in the
> man pages for either of net/freebsd.
> 
> > Grumble.
> >
> >> I had an unrelated question inlined.
> >>
> >> [...]
> >>>  static void
> >>> -filt_timer_timeout_add(struct knote *kn)
> >>> +filt_timeradd(struct knote *kn, struct timespec *ts)
> >>>  {
> >>> - struct timeval tv;
> >>> + struct timespec expiry, now;
> >>>   struct timeout *to = kn->kn_hook;
> >>>   int tticks;
> >>>  
> >>> - tv.tv_sec = kn->kn_sdata / 1000;
> >>> - tv.tv_usec = (kn->kn_sdata % 1000) * 1000;
> >>> - tticks = tvtohz(&tv);
> >>> - /* Remove extra tick from tvtohz() if timeout has fired before. */
> >>> + if (kn->kn_sfflags & NOTE_ABSTIME) {
> >>> +         nanotime(&now);
> >>> +         if (timespeccmp(ts, &now, >)) {
> >>> +                 timespecsub(ts, &now, &expiry);
> >>> +                 /* XXX timeout_at_ts */
> >>> +                 timeout_add(to, tstohz(&expiry));
> > visa:
> >
> > we should use timeout_abs_ts() here.  I need to adjust it, though.
> >
> >>> +         } else {
> >>> +                 /* Expire immediately. */
> >>> +                 filt_timerexpire(kn);
> >>> +         }
> >>> +         return;
> >>> + }
> >>> +
> >>> + tticks = tstohz(ts);
> >>> + /* Remove extra tick from tstohz() if timeout has fired before. */
> >>>   if (timeout_triggered(to))
> >>>           tticks--;
> >> I always wondered why one tick was removed, is one tick really
> >> that important? And does a timeout firing only cost one tick?
> > When you convert a duration to a count of ticks with tstohz(), it adds
> > an extra tick to the result to keep you from undershooting your
> > timeout.  You start counting your timeout at the start of the *next*
> > tick, otherwise the timeout might fire early.  However, after the
> > timeout has expired once, you no longer need the extra tick because
> > you can (more or less) assume that the timeout is running at the start
> > of the new tick.
> >
> > I know that sounds a little fuzzy, but in practice it works.
> 
> Haha, these are the kind of weird idiosyncrasies that are fun to know
> about. Thank you very much for the explanation! :D
> 
> So I went around looking at how large a tick really is and seems like we
> get it through kern.clockrate?? (from man tick)
> 
> aisha@fwall ~ $ sysctl kern.clockrate
> kern.clockrate=tick = 10000, hz = 100, profhz = 1000, stathz = 100
>
> so presumably each tick is 1/10000 of a second (is this right?), [...]

kern.clockrate's "tick" member represents the number of microseconds
in a hardclock tick.  It's just 1,000,000 / hz.

> and things are getting scheduled in terms of ticks, so how is it even
> possible to get nanosecond level accuracy there?

We have a nanosecond resolution timeout API, but it isn't super useful
yet because the timeout layer doesn't use the clock interrupt API.  I
am hoping to add this in the next release cycle.

> From more looking around it seems like atleast x86 has TSC which
> provides better resolution (presumably similar things exist for other
> archs) but I don't see it being used anywhere here in an obvious
> fashion. man pctr doesn't mention it being used for time measurement.

Every practical OpenBSD platform has access to a nice clock.
Fixed-frequency, high resolution (1us or higher), and high precision
(reads are fast).

--

Here is a revised patch:

- Only validate inputs in filt_timervalidate().  Do the input conversion
  in a separate routine, filt_timer_sdata_to_nsecs().

- Schedule the timeout in filt_timerstart().  Return zero if the absolute
  time has already expired and the timeout was not scheduled.  The caller
  can then call filt_timerexpire().

  This duplicates some code across filt_timerattach() and filt_timermodify(),
  but I think it's a little less magical: filt_timerstart() does *one* thing
  and leaves error handling to the caller.

- If the input isn't an absolute timeout we need to round sdata up
  from 0 to 1.  This is what FreeBSD does.

  I think this is bad behavior.  A periodic timeout of zero is meaningless.
  The sensible thing would be to reject the input with EINVAL.  But I didn't
  design the API so that ship has sailed.

- Use the high resolution timeout API instead of the tick-based API.
  In particular, we can use the UTC clock for absolute timeouts, just like
  FreeBSD does.

- In filt_timerexpire(), use timeout_advance() to count any expirations we
  missed due to processing delays.

The UTC timeout support in kern_timeout.c is a rough draft.  There's a
lot going on in there.  But if we included it we would be more compatible
with FreeBSD.

Index: sys/event.h
===================================================================
RCS file: /cvs/src/sys/sys/event.h,v
retrieving revision 1.69
diff -u -p -r1.69 event.h
--- sys/event.h 10 Feb 2023 14:34:17 -0000      1.69
+++ sys/event.h 8 Aug 2023 15:38:39 -0000
@@ -122,6 +122,13 @@ struct kevent {
 /* data/hint flags for EVFILT_DEVICE, shared with userspace */
 #define NOTE_CHANGE    0x00000001              /* device change event */
 
+/* additional flags for EVFILT_TIMER */
+#define NOTE_MSECONDS  0x00000000              /* data is milliseconds */
+#define NOTE_SECONDS   0x00000001              /* data is seconds */
+#define NOTE_USECONDS  0x00000002              /* data is microseconds */
+#define NOTE_NSECONDS  0x00000003              /* data is nanoseconds */
+#define NOTE_ABSTIME   0x00000010              /* timeout is absolute */
+
 /*
  * This is currently visible to userland to work around broken
  * programs which pull in <sys/proc.h> or <sys/selinfo.h>.
Index: kern/kern_event.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_event.c,v
retrieving revision 1.196
diff -u -p -r1.196 kern_event.c
--- kern/kern_event.c   11 Apr 2023 00:45:09 -0000      1.196
+++ kern/kern_event.c   8 Aug 2023 15:38:39 -0000
@@ -449,55 +449,127 @@ filt_proc(struct knote *kn, long hint)
        return (kn->kn_fflags != 0);
 }
 
-static void
-filt_timer_timeout_add(struct knote *kn)
+#define NOTE_TIMER_UNITMASK \
+    (NOTE_SECONDS | NOTE_MSECONDS | NOTE_USECONDS | NOTE_NSECONDS)
+
+static int
+filt_timervalidate(int flags, int64_t sdata)
+{
+       if (flags & ~(NOTE_TIMER_UNITMASK | NOTE_ABSTIME))
+               return (EINVAL);
+
+       switch (flags & NOTE_TIMER_UNITMASK) {
+       case NOTE_SECONDS:
+       case NOTE_MSECONDS:
+       case NOTE_USECONDS:
+       case NOTE_NSECONDS:
+               break;
+       default:
+               return (EINVAL);
+       }
+
+       if (sdata < 0)
+               return (EINVAL);
+
+       return (0);
+}
+
+static uint64_t
+filt_timer_sdata_to_nsecs(const struct knote *kn)
+{
+       int unit = kn->kn_sfflags & NOTE_TIMER_UNITMASK;
+
+       switch (unit) {
+       case NOTE_SECONDS:
+               return SEC_TO_NSEC(kn->kn_sdata);
+       case NOTE_MSECONDS:
+               return MSEC_TO_NSEC(kn->kn_sdata);
+       case NOTE_USECONDS:
+               return USEC_TO_NSEC(kn->kn_sdata);
+       case NOTE_NSECONDS:
+               return kn->kn_sdata;
+       default:
+               panic("%s: invalid EVFILT_TIMER unit: %d", __func__, unit);
+       }
+}
+
+/*
+ * Attempt to schedule the timeout.  Returns zero if the timeout is
+ * not scheduled because the absolute time has already expired.
+ */
+static int
+filt_timerstart(struct knote *kn)
 {
-       struct timeval tv;
+       struct timespec expiry, now, timeout;
        struct timeout *to = kn->kn_hook;
-       int tticks;
 
-       tv.tv_sec = kn->kn_sdata / 1000;
-       tv.tv_usec = (kn->kn_sdata % 1000) * 1000;
-       tticks = tvtohz(&tv);
-       /* Remove extra tick from tvtohz() if timeout has fired before. */
-       if (timeout_triggered(to))
-               tticks--;
-       timeout_add(to, (tticks > 0) ? tticks : 1);
+       NSEC_TO_TIMESPEC(filt_timer_sdata_to_nsecs(kn), &timeout);
+       if (kn->kn_sfflags & NOTE_ABSTIME) {
+               nanotime(&now);
+               if (timespeccmp(&timeout, &now, <=))
+                       return 0;
+               expiry = timeout;
+               timeout_set_flags(to, filt_timerexpire, kn, KCLOCK_UTC, 0);
+       } else {
+               nanouptime(&now);
+               timespecadd(&now, &timeout, &expiry);
+               timeout_set_flags(to, filt_timerexpire, kn, KCLOCK_UPTIME, 0);
+       }
+       timeout_abs_ts(to, &expiry);
+       return 1;
 }
 
 void
 filt_timerexpire(void *knx)
 {
+       uint64_t count;
        struct knote *kn = knx;
        struct kqueue *kq = kn->kn_kq;
+       struct timeout *to = kn->kn_hook;
 
-       kn->kn_data++;
+       /*
+        * One-shot timers and absolute timers expire only once.
+        * Periodic timers, on the other hand, may expire faster
+        * than we can service them.  timeout_advance() reschedules
+        * a periodic timer while computing how many times the timer
+        * expired.
+        */
+       if ((kn->kn_flags & EV_ONESHOT) || (kn->kn_sfflags & NOTE_ABSTIME))
+               count = 1;
+       else
+               timeout_advance(to, filt_timer_sdata_to_nsecs(kn), &count);
+       kn->kn_data += count;
        mtx_enter(&kq->kq_lock);
        knote_activate(kn);
        mtx_leave(&kq->kq_lock);
-
-       if ((kn->kn_flags & EV_ONESHOT) == 0)
-               filt_timer_timeout_add(kn);
 }
 
-
 /*
- * data contains amount of time to sleep, in milliseconds
+ * data contains a timeout.  fflags clarifies what the timeout means.
  */
 int
 filt_timerattach(struct knote *kn)
 {
        struct timeout *to;
+       int error;
+
+       error = filt_timervalidate(kn->kn_sfflags, kn->kn_sdata);
+       if (error != 0)
+               return (error);
 
        if (kq_ntimeouts > kq_timeoutmax)
                return (ENOMEM);
        kq_ntimeouts++;
 
-       kn->kn_flags |= EV_CLEAR;       /* automatically set */
-       to = malloc(sizeof(*to), M_KEVENT, M_WAITOK);
-       timeout_set(to, filt_timerexpire, kn);
+       if ((kn->kn_sfflags & NOTE_ABSTIME) == 0) {
+               kn->kn_flags |= EV_CLEAR;       /* automatically set */
+               if (kn->kn_sdata == 0)
+                       kn->kn_sdata = 1;
+       }
+       to = malloc(sizeof(*to), M_KEVENT, M_WAITOK | M_ZERO);
        kn->kn_hook = to;
-       filt_timer_timeout_add(kn);
+       if (!filt_timerstart(kn))
+               filt_timerexpire(kn);
 
        return (0);
 }
@@ -505,11 +577,11 @@ filt_timerattach(struct knote *kn)
 void
 filt_timerdetach(struct knote *kn)
 {
-       struct timeout *to;
+       struct timeout *to = kn->kn_hook;
 
-       to = (struct timeout *)kn->kn_hook;
        timeout_del_barrier(to);
        free(to, M_KEVENT, sizeof(*to));
+       kn->kn_hook = NULL;
        kq_ntimeouts--;
 }
 
@@ -518,6 +590,14 @@ filt_timermodify(struct kevent *kev, str
 {
        struct kqueue *kq = kn->kn_kq;
        struct timeout *to = kn->kn_hook;
+       int error;
+
+       error = filt_timervalidate(kev->fflags, kev->data);
+       if (error != 0) {
+               kev->flags |= EV_ERROR;
+               kev->data = error;
+               return (0);
+       }
 
        /* Reset the timer. Any pending events are discarded. */
 
@@ -531,9 +611,13 @@ filt_timermodify(struct kevent *kev, str
 
        kn->kn_data = 0;
        knote_assign(kev, kn);
-       /* Reinit timeout to invoke tick adjustment again. */
-       timeout_set(to, filt_timerexpire, kn);
-       filt_timer_timeout_add(kn);
+       if ((kn->kn_sfflags & NOTE_ABSTIME) == 0) {
+               kn->kn_flags |= EV_CLEAR;       /* automatically set */
+               if (kn->kn_sdata == 0)
+                       kn->kn_sdata = 1;
+       }
+       if (!filt_timerstart(kn))
+               filt_timerexpire(kn);
 
        return (0);
 }
@@ -551,7 +635,6 @@ filt_timerprocess(struct knote *kn, stru
 
        return (active);
 }
-
 
 /*
  * filt_seltrue:
Index: sys/timeout.h
===================================================================
RCS file: /cvs/src/sys/sys/timeout.h,v
retrieving revision 1.47
diff -u -p -r1.47 timeout.h
--- sys/timeout.h       31 Dec 2022 16:06:24 -0000      1.47
+++ sys/timeout.h       8 Aug 2023 15:38:39 -0000
@@ -27,6 +27,7 @@
 #ifndef _SYS_TIMEOUT_H_
 #define _SYS_TIMEOUT_H_
 
+#include <sys/queue.h>
 #include <sys/time.h>
 
 struct circq {
@@ -36,6 +37,7 @@ struct circq {
 
 struct timeout {
        struct circq to_list;                   /* timeout queue, don't move */
+       TAILQ_ENTRY(timeout) to_utc_link;       /* UTC queue link */
        struct timespec to_abstime;             /* absolute time to run at */
        void (*to_func)(void *);                /* function to call */
        void *to_arg;                           /* function argument */
@@ -85,10 +87,12 @@ int timeout_sysctl(void *, size_t *, voi
 
 #define KCLOCK_NONE    (-1)            /* dummy clock for sanity checks */
 #define KCLOCK_UPTIME  0               /* uptime clock; time since boot */
-#define KCLOCK_MAX     1
+#define KCLOCK_UTC     1               /* UTC clock; time since unix epoch */
+#define KCLOCK_MAX     2
 
 #define TIMEOUT_INITIALIZER_FLAGS(_fn, _arg, _kclock, _flags) {                
\
        .to_list = { NULL, NULL },                                      \
+       .to_utc_link = { NULL, NULL },                                  \
        .to_abstime = { .tv_sec = 0, .tv_nsec = 0 },                    \
        .to_func = (_fn),                                               \
        .to_arg = (_arg),                                               \
@@ -112,6 +116,7 @@ int timeout_add_usec(struct timeout *, i
 int timeout_add_nsec(struct timeout *, int);
 
 int timeout_abs_ts(struct timeout *, const struct timespec *);
+int timeout_advance(struct timeout *, uint64_t, uint64_t *);
 
 int timeout_del(struct timeout *);
 int timeout_del_barrier(struct timeout *);
@@ -119,6 +124,7 @@ void timeout_barrier(struct timeout *);
 
 void timeout_adjust_ticks(int);
 void timeout_hardclock_update(void);
+void timeout_reset_kclock_offset(int, const struct timespec *);
 void timeout_startup(void);
 
 #endif /* _KERNEL */
Index: kern/kern_timeout.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_timeout.c,v
retrieving revision 1.95
diff -u -p -r1.95 kern_timeout.c
--- kern/kern_timeout.c 29 Jul 2023 06:52:08 -0000      1.95
+++ kern/kern_timeout.c 8 Aug 2023 15:38:39 -0000
@@ -75,6 +75,7 @@ struct circq timeout_wheel_kc[BUCKETS];       
 struct circq timeout_new;              /* [T] New, unscheduled timeouts */
 struct circq timeout_todo;             /* [T] Due or needs rescheduling */
 struct circq timeout_proc;             /* [T] Due + needs process context */
+TAILQ_HEAD(, timeout) timeout_utc;     /* [T] UTC-based timeouts */
 
 time_t timeout_level_width[WHEELCOUNT];        /* [I] Wheel level width 
(seconds) */
 struct timespec tick_ts;               /* [I] Length of a tick (1/hz secs) */
@@ -166,15 +167,22 @@ struct lock_type timeout_spinlock_type =
        ((needsproc) ? &timeout_sleeplock_obj : &timeout_spinlock_obj)
 #endif
 
+void kclock_nanotime(int, struct timespec *);
 void softclock(void *);
 void softclock_create_thread(void *);
 void softclock_process_kclock_timeout(struct timeout *, int);
 void softclock_process_tick_timeout(struct timeout *, int);
 void softclock_thread(void *);
+int timeout_abs_ts_locked(struct timeout *, const struct timespec *);
 void timeout_barrier_timeout(void *);
 uint32_t timeout_bucket(const struct timeout *);
+void timeout_dequeue(struct timeout *);
+void timeout_enqueue(struct circq *, struct timeout *);
 uint32_t timeout_maskwheel(uint32_t, const struct timespec *);
 void timeout_run(struct timeout *);
+uint64_t timespec_advance_nsec(struct timespec *, uint64_t,
+    const struct timespec *);
+void u64_sat_add(uint64_t *, uint64_t, uint64_t);
 
 /*
  * The first thing in a struct timeout is its struct circq, so we
@@ -228,6 +236,7 @@ timeout_startup(void)
        CIRCQ_INIT(&timeout_new);
        CIRCQ_INIT(&timeout_todo);
        CIRCQ_INIT(&timeout_proc);
+       TAILQ_INIT(&timeout_utc);
        for (b = 0; b < nitems(timeout_wheel); b++)
                CIRCQ_INIT(&timeout_wheel[b]);
        for (b = 0; b < nitems(timeout_wheel_kc); b++)
@@ -252,6 +261,25 @@ timeout_proc_init(void)
 }
 
 void
+timeout_reset_kclock_offset(int kclock, const struct timespec *offset)
+{
+       struct kclock *kc = &timeout_kclock[kclock];
+       struct timeout *to;
+
+       KASSERT(kclock == KCLOCK_UTC);
+
+       mtx_enter(&timeout_mutex);
+       if (kclock == KCLOCK_UTC && timespeccmp(&kc->kc_offset, offset, <)) {
+               TAILQ_FOREACH(to, &timeout_utc, to_utc_link) {
+                       CIRCQ_REMOVE(&to->to_list);
+                       CIRCQ_INSERT_TAIL(&timeout_todo, &to->to_list);
+               }
+       }
+       kc->kc_offset = *offset;
+       mtx_leave(&timeout_mutex);
+}
+
+void
 timeout_set(struct timeout *new, void (*fn)(void *), void *arg)
 {
        timeout_set_flags(new, fn, arg, KCLOCK_NONE, 0);
@@ -273,6 +301,28 @@ timeout_set_proc(struct timeout *new, vo
        timeout_set_flags(new, fn, arg, KCLOCK_NONE, TIMEOUT_PROC);
 }
 
+void
+timeout_dequeue(struct timeout *to)
+{
+       KASSERT(ISSET(to->to_flags, TIMEOUT_ONQUEUE));
+
+       CIRCQ_REMOVE(&to->to_list);
+       if (to->to_kclock == KCLOCK_UTC)
+               TAILQ_REMOVE(&timeout_utc, to, to_utc_link);
+       CLR(to->to_flags, TIMEOUT_ONQUEUE);
+}
+
+void
+timeout_enqueue(struct circq *queue, struct timeout *to)
+{
+       KASSERT(!ISSET(to->to_flags, TIMEOUT_ONQUEUE));
+
+       CIRCQ_INSERT_TAIL(queue, &to->to_list);
+       if (to->to_kclock == KCLOCK_UTC)
+               TAILQ_INSERT_TAIL(&timeout_utc, to, to_utc_link);
+       SET(to->to_flags, TIMEOUT_ONQUEUE);
+}
+
 int
 timeout_add(struct timeout *new, int to_ticks)
 {
@@ -297,14 +347,13 @@ timeout_add(struct timeout *new, int to_
         */
        if (ISSET(new->to_flags, TIMEOUT_ONQUEUE)) {
                if (new->to_time - ticks < old_time - ticks) {
-                       CIRCQ_REMOVE(&new->to_list);
-                       CIRCQ_INSERT_TAIL(&timeout_new, &new->to_list);
+                       timeout_dequeue(new);
+                       timeout_enqueue(&timeout_new, new);
                }
                tostat.tos_readded++;
                ret = 0;
        } else {
-               SET(new->to_flags, TIMEOUT_ONQUEUE);
-               CIRCQ_INSERT_TAIL(&timeout_new, &new->to_list);
+               timeout_enqueue(&timeout_new, new);
        }
 #if NKCOV > 0
        if (!kcov_cold)
@@ -383,13 +432,23 @@ timeout_add_nsec(struct timeout *to, int
 int
 timeout_abs_ts(struct timeout *to, const struct timespec *abstime)
 {
-       struct timespec old_abstime;
-       int ret = 1;
+       int status;
 
        mtx_enter(&timeout_mutex);
+       status = timeout_abs_ts_locked(to, abstime);
+       mtx_leave(&timeout_mutex);
+       return status;
+}
+
+int
+timeout_abs_ts_locked(struct timeout *to, const struct timespec *abstime)
+{
+       struct timespec old_abstime;
+       int ret = 1;
 
+       MUTEX_ASSERT_LOCKED(&timeout_mutex);
        KASSERT(ISSET(to->to_flags, TIMEOUT_INITIALIZED));
-       KASSERT(to->to_kclock != KCLOCK_NONE);
+       KASSERT(to->to_kclock > KCLOCK_NONE && to->to_kclock < KCLOCK_MAX);
 
        old_abstime = to->to_abstime;
        to->to_abstime = *abstime;
@@ -397,14 +456,13 @@ timeout_abs_ts(struct timeout *to, const
 
        if (ISSET(to->to_flags, TIMEOUT_ONQUEUE)) {
                if (timespeccmp(abstime, &old_abstime, <)) {
-                       CIRCQ_REMOVE(&to->to_list);
-                       CIRCQ_INSERT_TAIL(&timeout_new, &to->to_list);
+                       timeout_dequeue(to);
+                       timeout_enqueue(&timeout_new, to);
                }
                tostat.tos_readded++;
                ret = 0;
        } else {
-               SET(to->to_flags, TIMEOUT_ONQUEUE);
-               CIRCQ_INSERT_TAIL(&timeout_new, &to->to_list);
+               timeout_enqueue(&timeout_new, to);
        }
 #if NKCOV > 0
        if (!kcov_cold)
@@ -412,9 +470,26 @@ timeout_abs_ts(struct timeout *to, const
 #endif
        tostat.tos_added++;
 
+       return ret;
+}
+
+int
+timeout_advance(struct timeout *to, uint64_t intvl, uint64_t *ocount)
+{
+       struct timespec next, now;
+       uint64_t count;
+       int status;
+
+       mtx_enter(&timeout_mutex);
+       kclock_nanotime(to->to_kclock, &now);
+       next = to->to_abstime;
+       count = timespec_advance_nsec(&next, intvl, &now);
+       status = timeout_abs_ts_locked(to, &next);
        mtx_leave(&timeout_mutex);
 
-       return ret;
+       if (ocount != NULL)
+               *ocount = count;
+       return status;
 }
 
 int
@@ -424,8 +499,7 @@ timeout_del(struct timeout *to)
 
        mtx_enter(&timeout_mutex);
        if (ISSET(to->to_flags, TIMEOUT_ONQUEUE)) {
-               CIRCQ_REMOVE(&to->to_list);
-               CLR(to->to_flags, TIMEOUT_ONQUEUE);
+               timeout_dequeue(to);
                tostat.tos_cancelled++;
                ret = 1;
        }
@@ -468,11 +542,10 @@ timeout_barrier(struct timeout *to)
        mtx_enter(&timeout_mutex);
 
        barrier.to_time = ticks;
-       SET(barrier.to_flags, TIMEOUT_ONQUEUE);
        if (procflag)
-               CIRCQ_INSERT_TAIL(&timeout_proc, &barrier.to_list);
+               timeout_enqueue(&timeout_proc, &barrier);
        else
-               CIRCQ_INSERT_TAIL(&timeout_todo, &barrier.to_list);
+               timeout_enqueue(&timeout_todo, &barrier);
 
        mtx_leave(&timeout_mutex);
 
@@ -496,19 +569,18 @@ uint32_t
 timeout_bucket(const struct timeout *to)
 {
        struct timespec diff, shifted_abstime;
-       struct kclock *kc;
+       struct kclock *kc = &timeout_kclock[to->to_kclock];
        uint32_t level;
 
-       KASSERT(to->to_kclock == KCLOCK_UPTIME);
-       kc = &timeout_kclock[to->to_kclock];
-
+       KASSERT(to->to_kclock > KCLOCK_NONE && to->to_kclock < KCLOCK_MAX);
        KASSERT(timespeccmp(&kc->kc_lastscan, &to->to_abstime, <));
+
        timespecsub(&to->to_abstime, &kc->kc_lastscan, &diff);
        for (level = 0; level < nitems(timeout_level_width) - 1; level++) {
                if (diff.tv_sec < timeout_level_width[level])
                        break;
        }
-       timespecadd(&to->to_abstime, &kc->kc_offset, &shifted_abstime);
+       timespecsub(&to->to_abstime, &kc->kc_offset, &shifted_abstime);
        return level * WHEELSIZE + timeout_maskwheel(level, &shifted_abstime);
 }
 
@@ -620,7 +692,6 @@ timeout_run(struct timeout *to)
 
        MUTEX_ASSERT_LOCKED(&timeout_mutex);
 
-       CLR(to->to_flags, TIMEOUT_ONQUEUE);
        SET(to->to_flags, TIMEOUT_TRIGGERED);
 
        fn = to->to_func;
@@ -652,14 +723,13 @@ softclock_process_kclock_timeout(struct 
                tostat.tos_scheduled++;
                if (!new)
                        tostat.tos_rescheduled++;
-               CIRCQ_INSERT_TAIL(&timeout_wheel_kc[timeout_bucket(to)],
-                   &to->to_list);
+               timeout_enqueue(&timeout_wheel_kc[timeout_bucket(to)], to);
                return;
        }
        if (!new && timespeccmp(&to->to_abstime, &kc->kc_late, <=))
                tostat.tos_late++;
        if (ISSET(to->to_flags, TIMEOUT_PROC)) {
-               CIRCQ_INSERT_TAIL(&timeout_proc, &to->to_list);
+               timeout_enqueue(&timeout_proc, to);
                return;
        }
        timeout_run(to);
@@ -675,13 +745,13 @@ softclock_process_tick_timeout(struct ti
                tostat.tos_scheduled++;
                if (!new)
                        tostat.tos_rescheduled++;
-               CIRCQ_INSERT_TAIL(&BUCKET(delta, to->to_time), &to->to_list);
+               timeout_enqueue(&BUCKET(delta, to->to_time), to);
                return;
        }
        if (!new && delta < 0)
                tostat.tos_late++;
        if (ISSET(to->to_flags, TIMEOUT_PROC)) {
-               CIRCQ_INSERT_TAIL(&timeout_proc, &to->to_list);
+               timeout_enqueue(&timeout_proc, to);
                return;
        }
        timeout_run(to);
@@ -697,11 +767,8 @@ softclock_process_tick_timeout(struct ti
 void
 softclock(void *arg)
 {
-       struct timeout *first_new, *to;
-       int needsproc, new;
-
-       first_new = NULL;
-       new = 0;
+       struct timeout *first_new = NULL, *to;
+       int needsproc, new = 0;
 
        mtx_enter(&timeout_mutex);
        if (!CIRCQ_EMPTY(&timeout_new))
@@ -709,7 +776,7 @@ softclock(void *arg)
        CIRCQ_CONCAT(&timeout_todo, &timeout_new);
        while (!CIRCQ_EMPTY(&timeout_todo)) {
                to = timeout_from_circq(CIRCQ_FIRST(&timeout_todo));
-               CIRCQ_REMOVE(&to->to_list);
+               timeout_dequeue(to);
                if (to == first_new)
                        new = 1;
                if (to->to_kclock != KCLOCK_NONE)
@@ -758,7 +825,7 @@ softclock_thread(void *arg)
                mtx_enter(&timeout_mutex);
                while (!CIRCQ_EMPTY(&timeout_proc)) {
                        to = timeout_from_circq(CIRCQ_FIRST(&timeout_proc));
-                       CIRCQ_REMOVE(&to->to_list);
+                       timeout_dequeue(to);
                        timeout_run(to);
                        tostat.tos_run_thread++;
                }
@@ -768,6 +835,108 @@ softclock_thread(void *arg)
        splx(s);
 }
 
+void
+kclock_nanotime(int kclock, struct timespec *now)
+{
+       switch (kclock) {
+       case KCLOCK_UPTIME:
+               nanouptime(now);
+               return;
+       case KCLOCK_UTC:
+               nanotime(now);
+               return;
+       default:
+               panic("%s: invalid kclock: %d", __func__, kclock);
+       }
+}
+
+void
+u64_sat_add(uint64_t *sum, uint64_t a, uint64_t b)
+{
+       if (a + b < a)
+               *sum = UINT64_MAX;
+       else
+               *sum = a + b;
+}
+
+/*
+ * Given an interval timer with a period of invtl that last expired
+ * at absolute time abs, find the timer's next expiration time and
+ * write it back to abs.  If abs has not yet expired, abs is not
+ * modified.
+ *
+ * Returns the number of intervals that have elapsed.  If the number
+ * of elapsed intervals would overflow a 64-bit integer, UINT64_MAX is
+ * returned.  Note that abs marks the end of the first interval: if abs
+ * has not expired, zero intervals have elapsed.
+ */
+uint64_t
+timespec_advance_nsec(struct timespec *abs, uint64_t intvl,
+    const struct timespec *now)
+{
+       struct timespec base, diff, minbase, next, intvl_product;
+       struct timespec intvl_product_max, intvl_ts;
+       uint64_t count = 0, quo;
+
+       /* Unusual case: abs has not expired, no intervals have elapsed. */
+       if (timespeccmp(now, abs, <)) {
+               if (intvl == 0)
+                       panic("%s: intvl is zero", __func__);
+               return 0;
+       }
+
+       /* Typical case: abs has expired and only one interval has elapsed. */
+       NSEC_TO_TIMESPEC(intvl, &intvl_ts);
+       timespecadd(abs, &intvl_ts, &next);
+       if (timespeccmp(now, &next, <)) {
+               *abs = next;
+               return 1;
+       }
+
+       /*
+        * Annoying case: two or more intervals have elapsed.
+        *
+        * Find a base within interval-product range of the current time.
+        * Under normal circumstances abs will already be within range,
+        * but for sake of correctness we handle cases where enormous
+        * expanses of time have passed between abs and now.
+        */
+       quo = UINT64_MAX / intvl;
+       NSEC_TO_TIMESPEC(quo * intvl, &intvl_product_max);
+       timespecsub(now, &intvl_product_max, &minbase);
+       base = *abs;
+       if (__predict_false(timespeccmp(&base, &minbase, <))) {
+               while (timespeccmp(&base, &minbase, <)) {
+                       timespecadd(&base, &intvl_product_max, &base);
+                       u64_sat_add(&count, count, quo);
+               }
+       }
+
+       /*
+        * We have a base within range.  Now find the interval-product
+        * that, when added to the base, gets us just past the current time
+        * to the most imminent expiration point.
+        *
+        * If the product would overflow a 64-bit integer we advance the
+        * base by one interval and retry.  This can happen at most once.
+        *
+        * The next expiration is then the sum of the base and the
+        * interval-product.
+        */
+       for (;;) {
+               timespecsub(now, &base, &diff);
+               quo = TIMESPEC_TO_NSEC(&diff) / intvl;
+               if (__predict_true(intvl * quo <= UINT64_MAX - intvl))
+                       break;
+               timespecadd(&base, &intvl_ts, &base);
+               u64_sat_add(&count, count, quo);
+       }
+       NSEC_TO_TIMESPEC(intvl * (quo + 1), &intvl_product);
+       timespecadd(&base, &intvl_product, abs);
+       u64_sat_add(&count, count, quo + 1);
+       return count;
+}
+
 #ifndef SMALL_KERNEL
 void
 timeout_adjust_ticks(int adj)
@@ -791,8 +960,8 @@ timeout_adjust_ticks(int adj)
                        /* when moving a timeout forward need to reinsert it */
                        if (to->to_time - ticks < adj)
                                to->to_time = new_ticks;
-                       CIRCQ_REMOVE(&to->to_list);
-                       CIRCQ_INSERT_TAIL(&timeout_todo, &to->to_list);
+                       timeout_dequeue(to);
+                       timeout_enqueue(&timeout_todo, to);
                }
        }
        ticks = new_ticks;
@@ -824,6 +993,8 @@ db_kclock(int kclock)
        switch (kclock) {
        case KCLOCK_UPTIME:
                return "uptime";
+       case KCLOCK_UTC:
+               return "utc";
        default:
                return "invalid";
        }

Re: EVFILT_TIMER add support for different timer precisions NOTE_{,U,N,M}SECONDS

Reply via email to